You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I have an imbalanced dataset which contains continuous and categorical features. I am trying to use SMOTENC to oversample my minor class. I give SMOTENC the raw categorical features (strings). When I run this with a tiny subset of my origianl dataset (about 188 samples), it works fine and generates new samples with raw categorical features. But when I run it on the original dataset (~3M), I have the below error.
When I see (42507, 72255), as if the algorithm is one hot encoding my raw categorical features under the hood. This is something that i cannot understand as the original paper of SMOTE talks about median of standard deviation of continous features for the categorical features. So categorical features don't need to encoded before passing them to SMOTENC. When debugging, I found out that when the std = 0, there is some calculus done with the ohe to include in the distance as far as I understand. But the below line generates an error when trying to put together the samples of the minority class and their corresponding neighbors:
# convert to dense array since scipy.sparse doesn't handle 3Dnn_data= (nn_data.toarray() ifsparse.issparse(nn_data) elsenn_data)
File "/home/manou/anaconda3/lib/python3.7/site-packages/scipy/sparse/base.py", line 1189, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError: Unable to allocate 22.9 GiB for an array with shape (42507, 72255) and data type float64
The text was updated successfully, but these errors were encountered:
Hi,
I have an imbalanced dataset which contains continuous and categorical features. I am trying to use SMOTENC to oversample my minor class. I give SMOTENC the raw categorical features (strings). When I run this with a tiny subset of my origianl dataset (about 188 samples), it works fine and generates new samples with raw categorical features. But when I run it on the original dataset (~3M), I have the below error.
When I see (42507, 72255), as if the algorithm is one hot encoding my raw categorical features under the hood. This is something that i cannot understand as the original paper of SMOTE talks about median of standard deviation of continous features for the categorical features. So categorical features don't need to encoded before passing them to SMOTENC. When debugging, I found out that when the std = 0, there is some calculus done with the ohe to include in the distance as far as I understand. But the below line generates an error when trying to put together the samples of the minority class and their corresponding neighbors:
Here's my code and the error.
The text was updated successfully, but these errors were encountered: