Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Returning a numpy array in one hot encoder #442

Open
adriencrtr opened this issue Sep 3, 2024 · 1 comment
Open

Returning a numpy array in one hot encoder #442

adriencrtr opened this issue Sep 3, 2024 · 1 comment

Comments

@adriencrtr
Copy link

Expected Behavior

Even if the category_encoders.one_hot.OneHotEncoder doesn't encode any features, we would expect it to convert a pd.DataFrame into a numpy.ndarray if we set the parameter :
return_df=False

Actual Behavior

When the category_encoders.one_hot.OneHotEncoder deals with a dataframe with only numerical features, the parameter cols is empty and the parameter return_df=False, the fit_transform method returns a pd.DataFrame object.

Steps to Reproduce the Problem

import numpy as np
import pandas as pd

from category_encoders.one_hot import OneHotEncoder

rng = np.random.RandomState(42)

This works

n_rows = 100

col1 = rng.rand(n_rows) * 100
col2 = rng.randint(1, 100, n_rows)
col3 = rng.choice([True, False], n_rows)
modalities = ['A', 'B', 'C', 'D']
col4 = rng.choice(modalities, n_rows)

df = pd.DataFrame({
    'Numeric1': col1,
    'Numeric2': col2,
    'Boolean': col3,
    'Object': col4
})

encoder = OneHotEncoder(
    cols=df.select_dtypes(include=["object", "bool"]).columns,
    return_df=False,
    handle_missing='return_nan'
)
X = encoder.fit_transform(df)
type(X)

Out: pandas.core.frame.DataFrame

This is the unexpected behavior

data = rng.multivariate_normal(mean=[0, 0], cov=[[1, 0], [0, 1]], size=200)
df = pd.DataFrame(data=data, columns=["Column 1", "Column 2"])

encoder = OneHotEncoder(
    cols=df.select_dtypes(include=["object", "bool"]).columns,
    return_df=False,
    handle_missing='return_nan'
)
X = encoder.fit_transform(df)
type(X)

Out: numpy.ndarray

Specifications

  • Version: 2.6.3
  • Platform: macOS Sonoma 14.6.1
@PaulWestenthanner
Copy link
Collaborator

PaulWestenthanner commented Oct 1, 2024

Hi @adriencrtr
the issue is actually that the cols must be list rather than a pandas column object.
Column object should be supported though in the future, that'd be a useful addition.
I'm leaving the issue open to remind myself of adding support for columns.

Also in the case at hand there are no columns of type object or bool. Hence the input is returned

if not list(self.cols):
return X

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants