-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Reimplement DataFrame.lookup #61185
Open
stevenae
wants to merge
20
commits into
pandas-dev:main
Choose a base branch
from
stevenae:enh-lookup
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
a4057e5
dev setup
stevenae 0f5ad86
Update dev_attempts.py
stevenae 7e30181
removed mixed type and threshold
stevenae 6fed58d
Delete dev_attempts.py
stevenae 8156c42
Update indexing.rst
stevenae c17a020
bringing tests back from 1.1.x
stevenae 4a0b856
extend underline
stevenae e0b0b57
spacing
stevenae 2a6dfae
remove dev_version
stevenae 21280ed
fixed test_lookup_requires_unique_axes
stevenae 48f1cde
Reduce columns to those in lookup
stevenae 9c060a8
Update frame.py
stevenae d620710
Merge branch 'enh-lookup-subset' into enh-lookup
stevenae 4e0c17f
one line to separate sections
stevenae a5e379b
Update v3.0.0.rst
stevenae 47e0b1b
Adding an example
stevenae 0c04e97
Update frame.py
stevenae 7d6dea5
shorter example
stevenae 5018365
Update frame.py
stevenae 6aa6218
rewrite to preserve types
stevenae File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have other places in our API where we return a NumPy array? With the prevalance of the Arrow type system this doesn't seem desirable to be locked into returning a NumPy array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like
values
also does this.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed I think this API should return an
ExtensionArray
or numpy array depending on the initial type or result typeThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values
only returns a NumPy array for numpy types. For extension types or arrow-backed types you get something different:I don't think we should force a NumPy array return here; particularly for string data, that could be non-performant and expensive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought through and did a bit more of a heavy-handed rewrite.
Now using
melt
to achieve the outcome ofvalues
orto_numpy
'Performance does take a hit, however, we are still outperforming the naiive lookup of
to_numpy
for mixed-type lookups.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
factorize
This function can be operating on multiple columns of different dtypes. I think the only option in such a case is to return a NumPy array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true on factorize but that isn't 100% an equivalent comparison. For sure the indexer is a numpy array, but the values in the two-tuple are an Index that should be type-preserving.
That's also a great point on the mixed column types, but that makes me wary of re-implementing this function. With all of the work going towards clarifying our nullability handling and implementing more than just NumPy types, it seems like this function is going to have a ton of edge cases