Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SMOTENC MemoryError #752

Closed
MokaddemMouna opened this issue Sep 9, 2020 · 2 comments
Closed

SMOTENC MemoryError #752

MokaddemMouna opened this issue Sep 9, 2020 · 2 comments

Comments

@MokaddemMouna
Copy link

MokaddemMouna commented Sep 9, 2020

Hi,
I have an imbalanced dataset which contains continuous and categorical features. I am trying to use SMOTENC to oversample my minor class. I give SMOTENC the raw categorical features (strings). When I run this with a tiny subset of my origianl dataset (about 188 samples), it works fine and generates new samples with raw categorical features. But when I run it on the original dataset (~3M), I have the below error.
When I see (42507, 72255), as if the algorithm is one hot encoding my raw categorical features under the hood. This is something that i cannot understand as the original paper of SMOTE talks about median of standard deviation of continous features for the categorical features. So categorical features don't need to encoded before passing them to SMOTENC. When debugging, I found out that when the std = 0, there is some calculus done with the ohe to include in the distance as far as I understand. But the below line generates an error when trying to put together the samples of the minority class and their corresponding neighbors:

# convert to dense array since scipy.sparse doesn't handle 3D
        nn_data = (nn_data.toarray() if sparse.issparse(nn_data) else nn_data)

Here's my code and the error.

oversample = SMOTENC(categorical_features=[0, 1, 2, 3, 4, 11, 12],
                         k_neighbors=5,
                         sampling_strategy={1: 60000},
                         n_jobs=8)
    undersample = RandomUnderSampler(sampling_strategy={0: 120000})
    x_train, y_train = oversample.fit_resample(x_tot, y_tot)
    x_train, y_train = undersample.fit_resample(x_train, y_train)
File "/home/manou/anaconda3/lib/python3.7/site-packages/scipy/sparse/base.py", line 1189, in _process_toarray_args
   return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError: Unable to allocate 22.9 GiB for an array with shape (42507, 72255) and data type float64
@hayesall
Copy link
Member

I'd need the full error traceback.

This looks like a scipy error currently (the MemoryError is being raised from scipy.sparse.base.spmatrix._process_toarray_args)

@glemaitre
Copy link
Member

I am closing this issue because I opened #771.

Basically, we should investigate if there is the need to convert the sparse matrix to a dense array that make things blow up because of the OHE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants