Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop using boston housing dataset #6180

Open
wants to merge 1 commit into
base: branch-25.02
Choose a base branch
from

Conversation

betatim
Copy link
Member

@betatim betatim commented Dec 12, 2024

Instead we use the california housing dataset.

closes #5158

Copy link

copy-pr-bot bot commented Dec 12, 2024

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the Cython / Python Cython or Python issue label Dec 12, 2024
Instead we use the california housing dataset.
@betatim betatim marked this pull request as ready for review December 12, 2024 13:52
@betatim betatim requested a review from a team as a code owner December 12, 2024 13:52
@betatim betatim requested review from bdice and viclafargue December 12, 2024 13:52
@bdice
Copy link
Contributor

bdice commented Dec 12, 2024

Looks like there are some failures in hdbscan with the new dataset:

=========================== short test summary info ============================
FAILED test_hdbscan.py::test_hdbscan_sklearn_datasets[housing_dataset-knn-leaf-True-min_samples_cluster_size_bounds1-0.0] - AssertionError: 
Arrays are not almost equal to -25 decimals

(shapes (47,), (48,) mismatch)
 ACTUAL: array([   25,    27,    28,    29,    29,    29,    31,    32,    33,
          35,    36,    36,    40,    41,    43,    45,    47,    49,
          50,    54,    61,    62,    65,    71,    72,    72,    78,...
 DESIRED: array([   25,    25,    25,    26,    29,    30,    30,    31,    32,
          33,    35,    35,    35,    37,    40,    41,    44,    44,
          47,    49,    50,    57,    60,    61,    64,    71,    71,...
FAILED test_hdbscan.py::test_hdbscan_sklearn_datasets[housing_dataset-knn-leaf-False-min_samples_cluster_size_bounds1-0.0] - AssertionError: 
Arrays are not almost equal to -25 decimals

(shapes (47,), (48,) mismatch)
 ACTUAL: array([   25,    27,    28,    29,    29,    29,    31,    32,    33,
          35,    36,    36,    40,    41,    43,    45,    47,    49,
          50,    54,    61,    62,    65,    71,    72,    72,    78,...
 DESIRED: array([   25,    25,    25,    26,    29,    30,    30,    31,    32,
          33,    35,    35,    35,    37,    40,    41,    44,    44,
          47,    49,    50,    57,    60,    61,    64,    71,    71,...
= 2 failed, 14387 passed, 6025 skipped, 646 xfailed, 66 xpassed, 1152 warnings in 4889.78s (1:21:29) =

https://github.com/rapidsai/cuml/actions/runs/12297823749/job/34322294951?pr=6180#step:8:8331

@dantegd
Copy link
Member

dantegd commented Dec 13, 2024

@betatim can you open an issue with the log of the error?

I think we should use this PR to debug that failure. Specifically, this fails with this specific set of relevant parameters:

  • cluster_selection_method=leaf not eom
  • cluster_selection_epsilon=0.0 not the larger ones
  • min_samples=50
  • min_cluster_size=25

This combination of parameters should give us a clue when debugging. Important that it is with cluster_selection_epsilon=0.0, which means it'd affect out of sample inference.

On a very first quick thought, this seems to suggest to me a bug either in the leaf code in

void leaf(const raft::handle_t& handle,
or somewhere else in the cluster selection logic.

@dantegd
Copy link
Member

dantegd commented Dec 13, 2024

Something additional that always frustrates me about assert in pytests, I'd love to know beforehand the results of the following assertions (calculating adjusted_rand_score and checking for matching cluster_persistence_). In more recent tests, I've been moving away from a simple assert/failing on first thing, and accumulating the errors in an error list to fail and display at the end of the test:

# Report all errors at the end

Might be useful to addopt that approach in general, will open an issue to capture that as well.

@betatim
Copy link
Member Author

betatim commented Dec 13, 2024

An interesting thing that I didn't spot initially is that the length of the arrays is different. So isn't just the number of samples assigned to each cluster that differs, but also the number of clusters in total (I think).

Poking around a bit to understand what is what

@betatim
Copy link
Member Author

betatim commented Dec 13, 2024

adjusted_rand_score(cuml_agg.labels_, sk_agg.labels_)=0.9694956618348944

@divyegala
Copy link
Member

divyegala commented Dec 17, 2024

@betatim @dantegd

Here's a full sequence of min_samples values that occur in the mutual reachability graph for the last data point in the housing dataset:

10.9316,10.9316,10.9316,10.9316,13.351,11.2694,10.9316,15.2398,10.9316,10.9316,11.3027,12.227,11.1692,10.9772,13.3884,10.9316,13.3978,11.4673,12.6491,11.3798,10.9316,10.9316,10.9316,11.2694,10.9316,10.9316,10.9316,10.9316,10.9316,10.9316,10.9316,12.1347,15.6445,10.9316,18.2688,11.8954,10.9316,11.5434,12.052,10.9316,11.2138,10.9316,12.5399,10.9316,10.9316,10.9316,10.9316,10.9316,11.3248,10.9316,10.9316,14.2653,10.9316,11.1131,11.6833,10.9316,10.9316,12.0934,10.9316,11.2027,10.9316,10.9316,10.9316,10.9316,10.9316,10.9316,3.40282e+38

where the min of this array ,10.9316, is highly repetitive. In fact, there appear to be very few unique edge weights for this data point. Thus, no parallel MST can be deterministic as you could select any edge weight with 10.9316 to build the MST. A sequential MST, however, would be deterministic and always select the same edge despite the repetitions (most likely the first instance of 10.9316). Once this difference occurs, we are looking at a possibility of completely different dendrograms being constructed, which leads to very different final solutions.

NOTE: I can also verify with an eye-test that almost all data points in the mutual reachability graph seem to have similar edge weight patterns.

@betatim
Copy link
Member Author

betatim commented Dec 19, 2024

The take away here is that we can't make the current test pass and that is a feature (not a bug)? Because cuml uses a parallel MST structure we can't promise to break ties in the same way as a sequential implementation.

If yes, I see two(+1) options:

  1. special case the california housing dataset to only check e.g. the score but not the exact clustering
  2. in general loosen the requirements in the test because we are just being lucky that it passes for the other datasets
  3. try other datasets until we find one that passes the test

WDYT?

@divyegala
Copy link
Member

@betatim I would want to try a dataset with fewer repetitive patterns. California housing dataset has a bunch of repetitive values in the first 3 columns.

@betatim
Copy link
Member Author

betatim commented Jan 15, 2025

I don't see the repetitiveness?

california_housing.frame.head(20)
Out[5]: 
    MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  Longitude  MedHouseVal
0   8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88    -122.23        4.526
1   8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86    -122.22        3.585
2   7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85    -122.24        3.521
3   5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85    -122.25        3.413
4   3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85    -122.25        3.422
5   4.0368      52.0  4.761658   1.103627       413.0  2.139896     37.85    -122.25        2.697
6   3.6591      52.0  4.931907   0.951362      1094.0  2.128405     37.84    -122.25        2.992
7   3.1200      52.0  4.797527   1.061824      1157.0  1.788253     37.84    -122.25        2.414
8   2.0804      42.0  4.294118   1.117647      1206.0  2.026891     37.84    -122.26        2.267
9   3.6912      52.0  4.970588   0.990196      1551.0  2.172269     37.84    -122.25        2.611
10  3.2031      52.0  5.477612   1.079602       910.0  2.263682     37.85    -122.26        2.815
11  3.2705      52.0  4.772480   1.024523      1504.0  2.049046     37.85    -122.26        2.418
12  3.0750      52.0  5.322650   1.012821      1098.0  2.346154     37.85    -122.26        2.135
13  2.6736      52.0  4.000000   1.097701       345.0  1.982759     37.84    -122.26        1.913
14  1.9167      52.0  4.262903   1.009677      1212.0  1.954839     37.85    -122.26        1.592
15  2.1250      50.0  4.242424   1.071970       697.0  2.640152     37.85    -122.26        1.400
16  2.7750      52.0  5.939577   1.048338       793.0  2.395770     37.85    -122.27        1.525
17  2.1202      52.0  4.052805   0.966997       648.0  2.138614     37.85    -122.27        1.555
18  1.9911      50.0  5.343675   1.085919       990.0  2.362768     37.84    -122.26        1.587
19  2.6033      52.0  5.465455   1.083636       690.0  2.509091     37.84    -122.27        1.629

A lot of 52 year old houses, but otherwise this doesn't look too bad?

Do you have an alternative dataset in mind?

And for my education, why would having repetitive values be a problem? Real world data might have this as well no?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cython / Python Cython or Python issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Remove usage of Boston dataset in test files
4 participants