Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing group-id assignments in data export #1930

Open
metasoarous opened this issue Feb 18, 2025 · 7 comments
Open

Missing group-id assignments in data export #1930

metasoarous opened this issue Feb 18, 2025 · 7 comments
Assignees

Comments

@metasoarous
Copy link
Member

Expected behavior:
That the number of nonempty in the group-id column of the participant-votes.csv match the "grouped" counts from the automated report.

Actual behavior:
There's an order of magnitude difference between the counts.

To Reproduce:
Download export and count columns with pandas or other CSV processing utility.

Screenshots:
Counts from the automated report interface:
Image

Counts from applying python csvkit's csvstat utility to export data (for clarity, note that it's interpreting 0/1 as boolean):
Image

Counts from pandas value_count() method on export data:
Image

Additional context:

  • Reviewing the export data in greater detail, it seems that all of the participants who were grouped fall within the pid range of 1-170, so front loaded in terms of time at which they joined the conversation.
  • It could be a red herring, but it was a bit suspicious that the number of grouped participants (as inferred by the export data) was exactly 100 🤔
  • This didn't change over the course of a couple of hours, during which there's been activate participation.
@metasoarous
Copy link
Member Author

metasoarous commented Feb 19, 2025

Update: It looks like this data is getting updated, but is still capping out at 100 nonzero group-id entries.

The latest counts on this conversation:

Image

This specific number (100) has me thinking that the issue may have to do with an error in how the export is using the math blob participant buckets to assign group-ids, since the number of buckets is set to 100 by default.

@ballPointPenguin ballPointPenguin self-assigned this Feb 20, 2025
@ballPointPenguin
Copy link
Contributor

@metasoarous FYI I'm trying to repro this locally and get a better look at root cause. Thanks for the insight!

@ballPointPenguin
Copy link
Contributor

In my dev convo test, with 159 participants, I see that exactly the first 100 are assigned to groups, none after that.

@ballPointPenguin
Copy link
Contributor

ballPointPenguin commented Feb 24, 2025

Interestingly, when I export the data the "old school" way, via clojure -M:run, the group-ids are all included as expected.
This makes me wonder if the problem is in server and not math.
update: I think the bug is in math, just not the part that is used for CLI export construction

@metasoarous
Copy link
Member Author

Thanks for looking into this @ballPointPenguin!

I'm a bit skeptical that the it's in the math, unless it's coming from the new math implementation. That's because I can see from the network console that the math blob that hits the report appears to have all of the base clusters members coming through:

Image

I think the problem is that the group-clusters members entries are pointing not directly to participants, but to the base-clusters. You need to get the base clusters via those ids, then from there to the underlying participant ids:

Image

I can explain more about why it's set up this way, but hopefully this should make for an easy fix.

Thanks again!

@colinmegill
Copy link
Member

I also don't think it's in the math, and do believe it's in the new export endpoints. Agree!

@colinmegill
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants