-
Notifications
You must be signed in to change notification settings - Fork 0
Conversation
* files include commit history from legacy project HTTPArchive/bigquery
* Removed intermediate JSONL writes to GCS and related arguments * Removed schema auto-detection * Updated table name generation to use a combination of config and pipeline data * Fixed `startedDateTime` conversion to epoch time
OK I've fixed the linting errors with some more setup in the repo, and a very small number of edits to the code. Hope you don't mind, but was just easier to test this in this branch as some of these config files are a little annoying. The static code analysis still has one error and 3 warnings (which I think are correct and should be looked at. Or they can be accepted if you strongly disagree). Will give it a quick review now. |
* use BigQuery streaming inserts * Reshuffle() before reading files in batch mode * Logging and error hangling updates * Add explicit `return None` where missing previously
* add unittest.yml GitHub Actions * update unit tests and related code * add requirements.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested changes to base the url
field on the _metadata.tested_url
property per #44 (comment)
* README topic should not expire * summary_pages.json add `metadata` column * test_utils.py add tests for crawl date and url hashing empty strings * transformation.py add new metadata * transformation.py update `url`(page) to leverage new metadata
Thanks @giancarloaf! This is good to merge now. There are still some open comments that we can resolve in a follow-up PR:
|
Resolves #16 to create a new DataFlow pipeline from HAR json in GCS to summary tables in BigQuery
Port existing logic from legacy PHP
Implement pipeline features
Validation
Follow up notes