Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OrdinalEncoder unseen value spec #283

Open
lqrz opened this issue Nov 15, 2020 · 2 comments
Open

OrdinalEncoder unseen value spec #283

lqrz opened this issue Nov 15, 2020 · 2 comments

Comments

@lqrz
Copy link

lqrz commented Nov 15, 2020

Hi, this is not a bug report but rather a feature request (not sure if this is the place or how).

It would be great to be able to specify the "value" an unseen val should take when using the OrdinalEncoder -- rather than fixing it to -1.

For instance, I would like to use this encoder as my preprocessing step, before calling a LightGBM classifier (which expects all categorical feature values to be non-negative integers), within a PMMLPipeline (which currently supports ce.OrdinalEncoder).

Nowadays, the way around this would be to construct my encoding mapping beforehand and specifying it as the mapping param, which is a bit of an overkill... am I missing something? is there another way?

Thanks!

@lqrz
Copy link
Author

lqrz commented Nov 15, 2020

I understand this is going to be introduced in SKlearn's OrdinalEncoder in v.0.24.

@PaulWestenthanner
Copy link
Collaborator

PaulWestenthanner commented Jan 31, 2023

One of the big advantages of this library is a rather common interface to all the different encoders (e.g. for handling missing values or unknowns). It makes a lot of sense to keep this. So if we want to have this flexible we'd need to introduce it for all encoders and then it would be consequent to also have the missing (currently -2) flexible as well.
The downside to this is that another 2 parameters are introduced in the __init__ function which makes it somewhat big. The workaround is also rather easy, isnt it? you just replace the -1 with some other value using df.replace`.
But I'm open to discuss this further if a lot of people find it super convenient it can be worth it.

EDIT: I just realised the replace workaround won't work in pipelines (as you noted) and supplying a mapping indeed feels like an overkill

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants