Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG(string dtype): groupby/resampler.min/max returns float on all NA strings #60985

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

rhshadrach
Copy link
Member

Built on top of #60936

@rhshadrach rhshadrach added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Strings String extension data type and string data Reduction Operations sum, mean, min, max, etc. Bug labels Mar 19, 2025
@rhshadrach rhshadrach added this to the 2.3 milestone Mar 19, 2025
@rhshadrach rhshadrach marked this pull request as ready for review March 23, 2025 12:07
expected_dtype, expected_value = dtype, pd.NA
if reduction_func in ["all", "any"]:
expected_dtype = "bool"
# TODO: For skipna=False, bool(pd.NA) raises; should groupby?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there are a few TODOs / inconsistencies in our interface here. I think rather than branching and trying to document all of them, it would help to simplify this test and just skip/xfail the cases where things are not consistent. It may even be helpful to split this up into multiple tests that are more focused

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rather than branching and trying to document all of them, it would help to simplify this test and just skip/xfail the cases where things are not consistent.

If they added significant complexity I would agree. However the complexity added seems minimal to me, and testing the current behavior tells us when it changes. So even if it's not the final behavior we'd desire, testing it seems better than skipping or xfailing.

It may even be helpful to split this up into multiple tests that are more focused

If there was a good way of doing this while ensuring we are going through all the reduction funcs, I'd definitely be on board. However to my knowledge there is not.

res_values = res_values.astype(object, copy=False)
elif is_string_dtype(dtype) and how in ["min", "max"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to avoid special-casing these functions here? Where is the return value from other functions being handled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea - good call. We only get here with min/max today. If we do end up here with other ops at some point in the future, either (a) the dtype is already correct in which case _from_sequence is O(1) or (b) we want to cast. So I've removed the condition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reduction Operations sum, mean, min, max, etc. Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Inconsistent dtype with GroupBy for StrDtype and all missing values
2 participants