Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1922986 - Create script that exports BMO data as JSON suitable for import into a BigQuery instance in GCP #2370

Merged
merged 13 commits into from
Jan 29, 2025

Conversation

dklawren
Copy link
Collaborator

@dklawren dklawren commented Dec 10, 2024

Sorry this so late. Lots of moving pieces had to be coordinated with webservices-infra, etc. just so I could test and then small changes one after another to get to this point.

  • Added config options for BMO ETL settings such as base url, project id, dataset id, etc.
  • A export script that will run nightly at 11pm and export daily snapshot of specific bug data to BigQuery
  • A test BQ emulator and test script that can be used to do basic testing of exporting bug data to the emulator locally. There is a data.yml file that creates the same schema as what will be in production before exporting.
  • Small change to circleci and github ci to run the tests in the extensions directory.

Let me know if you have any questions as it is a lot.

@dklawren dklawren requested a review from cgsheeh December 12, 2024 19:50
Copy link
Collaborator

@cgsheeh cgsheeh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just a few comments/questions.

It's a lot less scary in terms of PR size once you realize there is quite a lot of repetition. You could abstract a lot of that away, but if you're pressed for time/deadlines it's okay to leave as-is.

docker-compose.yml Outdated Show resolved Hide resolved
# to make sure other entries with this date are not already present.
check_for_duplicates();

### Bugs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code related to each individual table should be moved to a separate function, like extract_bugs, instead of being one large script separated by comments. This goes for attachments, flags, keywords etc.

Nit: also, looking at each of these sections, it seems there is some repeated structure amongst them. Each one preps a statement, runs a while loop over the total number of rows, fetches a row, checks the cache, then extracts the data and submits. You could eliminate that repetition.

extensions/BMO/bin/export_bmo_etl.pl Outdated Show resolved Hide resolved
extensions/BMO/bin/export_bmo_etl.pl Outdated Show resolved Hide resolved
extensions/BMO/bin/export_bmo_etl.pl Outdated Show resolved Hide resolved
extensions/BMO/bin/export_bmo_etl.pl Outdated Show resolved Hide resolved
extensions/BMO/bin/export_bmo_etl.pl Show resolved Hide resolved
@dklawren
Copy link
Collaborator Author

I need to make some other non-related changes anyway as I realized i left out two additional requirements. 1) I need to skip sending bug data related to the 'Legal' product and 2) I need to filter the summaries, etc. for bugs that are private.

https://docs.google.com/document/d/1D-0wLQCrYXw17k_qSZiaH8djaUV-iS3qUencaj0FTtc/edit?tab=t.0#heading=h.l46cjmt1mwn6

@dklawren
Copy link
Collaborator Author

I am going to make one last schema change to the BQ data so I will need to create the needed PRs in webservices-infra for this :( Once that is done, I will be able to submitted the updated code here for last review.

Context: The bugs.group field needs to be a string as I originally had it as 1) most bugs do not have more than one group anyway and 2) if there is more than one group, we just want to put the group name with the least amount of users. I missed that little detail from the schema outlined here: https://docs.google.com/spreadsheets/d/1LDrYkGxZGnktI3-Ei3fpPK2TAXjxTj7fBUjHaJ8tpBE/edit

Sorry for the extra delay.

- Added test to make sure we also add flag entries for deletions
- Fixed some typos in export script
…table into memory

** This is better for memory efficiency and container crashes
* Created a single function for all two column tables
@dklawren dklawren requested a review from cgsheeh January 24, 2025 22:05
@dklawren
Copy link
Collaborator Author

Ready for hopefully final review. Optimized as much as I could. Refactored to create functions where possible. Switched to using LIMIT/OFFSET to allow for breaking down the tables that do not have a primary key into chunks instead of loading the whole table into memory (i.e. dependencies, etc.). Let me know if anything needs to be elaborated on.

extensions/BMO/bin/export_bmo_etl.pl Outdated Show resolved Hide resolved
extensions/BMO/bin/export_bmo_etl.pl Outdated Show resolved Hide resolved
@dklawren dklawren merged commit 9caa744 into mozilla-bteam:master Jan 29, 2025
15 of 16 checks passed
@dklawren dklawren deleted the 1922986 branch January 29, 2025 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants