Stop using boston housing dataset #6180

betatim · 2024-12-12T13:49:56Z

Instead we use the california housing dataset.

closes #5158

copy-pr-bot · 2024-12-12T13:50:00Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Instead we use the california housing dataset.

bdice · 2024-12-12T18:40:17Z

Looks like there are some failures in hdbscan with the new dataset:

=========================== short test summary info ============================
FAILED test_hdbscan.py::test_hdbscan_sklearn_datasets[housing_dataset-knn-leaf-True-min_samples_cluster_size_bounds1-0.0] - AssertionError: 
Arrays are not almost equal to -25 decimals

(shapes (47,), (48,) mismatch)
 ACTUAL: array([   25,    27,    28,    29,    29,    29,    31,    32,    33,
          35,    36,    36,    40,    41,    43,    45,    47,    49,
          50,    54,    61,    62,    65,    71,    72,    72,    78,...
 DESIRED: array([   25,    25,    25,    26,    29,    30,    30,    31,    32,
          33,    35,    35,    35,    37,    40,    41,    44,    44,
          47,    49,    50,    57,    60,    61,    64,    71,    71,...
FAILED test_hdbscan.py::test_hdbscan_sklearn_datasets[housing_dataset-knn-leaf-False-min_samples_cluster_size_bounds1-0.0] - AssertionError: 
Arrays are not almost equal to -25 decimals

(shapes (47,), (48,) mismatch)
 ACTUAL: array([   25,    27,    28,    29,    29,    29,    31,    32,    33,
          35,    36,    36,    40,    41,    43,    45,    47,    49,
          50,    54,    61,    62,    65,    71,    72,    72,    78,...
 DESIRED: array([   25,    25,    25,    26,    29,    30,    30,    31,    32,
          33,    35,    35,    35,    37,    40,    41,    44,    44,
          47,    49,    50,    57,    60,    61,    64,    71,    71,...
= 2 failed, 14387 passed, 6025 skipped, 646 xfailed, 66 xpassed, 1152 warnings in 4889.78s (1:21:29) =

https://github.com/rapidsai/cuml/actions/runs/12297823749/job/34322294951?pr=6180#step:8:8331

dantegd · 2024-12-13T04:48:51Z

@betatim can you open an issue with the log of the error?

I think we should use this PR to debug that failure. Specifically, this fails with this specific set of relevant parameters:

cluster_selection_method=leaf not eom
cluster_selection_epsilon=0.0 not the larger ones
min_samples=50
min_cluster_size=25

This combination of parameters should give us a clue when debugging. Important that it is with cluster_selection_epsilon=0.0, which means it'd affect out of sample inference.

On a very first quick thought, this seems to suggest to me a bug either in the leaf code in

cuml/cpp/src/hdbscan/detail/select.cuh

Line 269 in 7211507

void leaf(const raft::handle_t& handle,

or somewhere else in the cluster selection logic.

dantegd · 2024-12-13T05:00:25Z

Something additional that always frustrates me about assert in pytests, I'd love to know beforehand the results of the following assertions (calculating adjusted_rand_score and checking for matching cluster_persistence_). In more recent tests, I've been moving away from a simple assert/failing on first thing, and accumulating the errors in an error list to fail and display at the end of the test:

cuml/python/cuml/cuml/tests/test_public_methods_attributes.py

Line 209 in 7211507

errors.append(

cuml/python/cuml/cuml/tests/test_public_methods_attributes.py

Line 233 in 7211507

# Report all errors at the end

Might be useful to addopt that approach in general, will open an issue to capture that as well.

betatim · 2024-12-13T13:12:13Z

An interesting thing that I didn't spot initially is that the length of the arrays is different. So isn't just the number of samples assigned to each cluster that differs, but also the number of clusters in total (I think).

Poking around a bit to understand what is what

betatim · 2024-12-13T14:02:57Z

adjusted_rand_score(cuml_agg.labels_, sk_agg.labels_)=0.9694956618348944

divyegala · 2024-12-17T22:56:07Z

@betatim @dantegd

Here's a full sequence of min_samples values that occur in the mutual reachability graph for the last data point in the housing dataset:

10.9316,10.9316,10.9316,10.9316,13.351,11.2694,10.9316,15.2398,10.9316,10.9316,11.3027,12.227,11.1692,10.9772,13.3884,10.9316,13.3978,11.4673,12.6491,11.3798,10.9316,10.9316,10.9316,11.2694,10.9316,10.9316,10.9316,10.9316,10.9316,10.9316,10.9316,12.1347,15.6445,10.9316,18.2688,11.8954,10.9316,11.5434,12.052,10.9316,11.2138,10.9316,12.5399,10.9316,10.9316,10.9316,10.9316,10.9316,11.3248,10.9316,10.9316,14.2653,10.9316,11.1131,11.6833,10.9316,10.9316,12.0934,10.9316,11.2027,10.9316,10.9316,10.9316,10.9316,10.9316,10.9316,3.40282e+38

where the min of this array ,10.9316, is highly repetitive. In fact, there appear to be very few unique edge weights for this data point. Thus, no parallel MST can be deterministic as you could select any edge weight with 10.9316 to build the MST. A sequential MST, however, would be deterministic and always select the same edge despite the repetitions (most likely the first instance of 10.9316). Once this difference occurs, we are looking at a possibility of completely different dendrograms being constructed, which leads to very different final solutions.

NOTE: I can also verify with an eye-test that almost all data points in the mutual reachability graph seem to have similar edge weight patterns.

betatim · 2024-12-19T12:57:35Z

The take away here is that we can't make the current test pass and that is a feature (not a bug)? Because cuml uses a parallel MST structure we can't promise to break ties in the same way as a sequential implementation.

If yes, I see two(+1) options:

special case the california housing dataset to only check e.g. the score but not the exact clustering
in general loosen the requirements in the test because we are just being lucky that it passes for the other datasets
try other datasets until we find one that passes the test

WDYT?

divyegala · 2024-12-19T19:59:54Z

@betatim I would want to try a dataset with fewer repetitive patterns. California housing dataset has a bunch of repetitive values in the first 3 columns.

betatim · 2025-01-15T14:20:42Z

I don't see the repetitiveness?

california_housing.frame.head(20)
Out[5]: 
    MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  Longitude  MedHouseVal
0   8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88    -122.23        4.526
1   8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86    -122.22        3.585
2   7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85    -122.24        3.521
3   5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85    -122.25        3.413
4   3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85    -122.25        3.422
5   4.0368      52.0  4.761658   1.103627       413.0  2.139896     37.85    -122.25        2.697
6   3.6591      52.0  4.931907   0.951362      1094.0  2.128405     37.84    -122.25        2.992
7   3.1200      52.0  4.797527   1.061824      1157.0  1.788253     37.84    -122.25        2.414
8   2.0804      42.0  4.294118   1.117647      1206.0  2.026891     37.84    -122.26        2.267
9   3.6912      52.0  4.970588   0.990196      1551.0  2.172269     37.84    -122.25        2.611
10  3.2031      52.0  5.477612   1.079602       910.0  2.263682     37.85    -122.26        2.815
11  3.2705      52.0  4.772480   1.024523      1504.0  2.049046     37.85    -122.26        2.418
12  3.0750      52.0  5.322650   1.012821      1098.0  2.346154     37.85    -122.26        2.135
13  2.6736      52.0  4.000000   1.097701       345.0  1.982759     37.84    -122.26        1.913
14  1.9167      52.0  4.262903   1.009677      1212.0  1.954839     37.85    -122.26        1.592
15  2.1250      50.0  4.242424   1.071970       697.0  2.640152     37.85    -122.26        1.400
16  2.7750      52.0  5.939577   1.048338       793.0  2.395770     37.85    -122.27        1.525
17  2.1202      52.0  4.052805   0.966997       648.0  2.138614     37.85    -122.27        1.555
18  1.9911      50.0  5.343675   1.085919       990.0  2.362768     37.84    -122.26        1.587
19  2.6033      52.0  5.465455   1.083636       690.0  2.509091     37.84    -122.27        1.629

A lot of 52 year old houses, but otherwise this doesn't look too bad?

Do you have an alternative dataset in mind?

And for my education, why would having repetitive values be a problem? Real world data might have this as well no?

github-actions bot added the Cython / Python Cython or Python issue label Dec 12, 2024

Stop using boston housing dataset

0112051

Instead we use the california housing dataset.

betatim force-pushed the remove-boston branch from 3e090c2 to 0112051 Compare December 12, 2024 13:51

betatim marked this pull request as ready for review December 12, 2024 13:52

betatim requested a review from a team as a code owner December 12, 2024 13:52

betatim requested review from bdice and viclafargue December 12, 2024 13:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop using boston housing dataset #6180

Stop using boston housing dataset #6180

betatim commented Dec 12, 2024 •

edited

Loading

copy-pr-bot bot commented Dec 12, 2024

bdice commented Dec 12, 2024

dantegd commented Dec 13, 2024 •

edited

Loading

dantegd commented Dec 13, 2024 •

edited

Loading

betatim commented Dec 13, 2024 •

edited

Loading

betatim commented Dec 13, 2024

divyegala commented Dec 17, 2024 •

edited

Loading

betatim commented Dec 19, 2024

divyegala commented Dec 19, 2024

betatim commented Jan 15, 2025

Stop using boston housing dataset #6180

Are you sure you want to change the base?

Stop using boston housing dataset #6180

Conversation

betatim commented Dec 12, 2024 • edited Loading

copy-pr-bot bot commented Dec 12, 2024

bdice commented Dec 12, 2024

dantegd commented Dec 13, 2024 • edited Loading

dantegd commented Dec 13, 2024 • edited Loading

betatim commented Dec 13, 2024 • edited Loading

betatim commented Dec 13, 2024

divyegala commented Dec 17, 2024 • edited Loading

betatim commented Dec 19, 2024

divyegala commented Dec 19, 2024

betatim commented Jan 15, 2025

betatim commented Dec 12, 2024 •

edited

Loading

dantegd commented Dec 13, 2024 •

edited

Loading

dantegd commented Dec 13, 2024 •

edited

Loading

betatim commented Dec 13, 2024 •

edited

Loading

divyegala commented Dec 17, 2024 •

edited

Loading