Skip to content
This repository has been archived by the owner on Dec 18, 2024. It is now read-only.

M0 5/pipeline poc #34

Merged
merged 55 commits into from
May 12, 2022
Merged

M0 5/pipeline poc #34

merged 55 commits into from
May 12, 2022

Conversation

giancarloaf
Copy link
Collaborator

@giancarloaf giancarloaf commented Feb 24, 2022

Resolves #16 to create a new DataFlow pipeline from HAR json in GCS to summary tables in BigQuery

Port existing logic from legacy PHP

  • construct status info
  • parse pages
  • generate pages summary table
  • parse requests
  • parse responses
  • generate requests summary table

Implement pipeline features

  • batch
  • streaming

Validation

Follow up notes

  • discrepancies found when comparing pages could be due to the way that MYSQL handles duplicates - we see more results than expected in the new pipeline, but the counts line up with those from the raw HAR outputs

@giancarloaf giancarloaf added this to the M0.5: Migration to GCP milestone Feb 24, 2022
giancarloaf and others added 13 commits March 1, 2022 09:27
* files include commit history from legacy project HTTPArchive/bigquery
* Removed intermediate JSONL writes to GCS and related arguments

* Removed schema auto-detection

* Updated table name generation to use a combination of config and pipeline data

* Fixed `startedDateTime` conversion to epoch time
@tunetheweb
Copy link
Member

tunetheweb commented Apr 16, 2022

OK I've fixed the linting errors with some more setup in the repo, and a very small number of edits to the code. Hope you don't mind, but was just easier to test this in this branch as some of these config files are a little annoying.

The static code analysis still has one error and 3 warnings (which I think are correct and should be looked at. Or they can be accepted if you strongly disagree).

Will give it a quick review now.

modules/constants.py Outdated Show resolved Hide resolved
modules/import_har.py Outdated Show resolved Hide resolved
modules/transformation.py Outdated Show resolved Hide resolved
modules/transformation.py Show resolved Hide resolved
modules/transformation.py Outdated Show resolved Hide resolved
modules/transformation.py Outdated Show resolved Hide resolved
modules/transformation.py Outdated Show resolved Hide resolved
modules/utils.py Outdated Show resolved Hide resolved
giancarloaf added 10 commits May 1, 2022 21:27
* use BigQuery streaming inserts

* Reshuffle() before reading files in batch mode

* Logging and error hangling updates

* Add explicit `return None` where missing previously
* add unittest.yml GitHub Actions

* update unit tests and related code

* add requirements.txt
@giancarloaf giancarloaf marked this pull request as ready for review May 5, 2022 00:32
README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
modules/constants.py Show resolved Hide resolved
modules/constants.py Outdated Show resolved Hide resolved
modules/import_har.py Show resolved Hide resolved
modules/transformation.py Outdated Show resolved Hide resolved
modules/transformation.py Outdated Show resolved Hide resolved
modules/transformation.py Show resolved Hide resolved
modules/transformation.py Outdated Show resolved Hide resolved
modules/transformation.py Outdated Show resolved Hide resolved
Copy link
Member

@rviscomi rviscomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested changes to base the url field on the _metadata.tested_url property per #44 (comment)

modules/transformation.py Outdated Show resolved Hide resolved
modules/transformation.py Outdated Show resolved Hide resolved
* README topic should not expire

* summary_pages.json add `metadata` column

* test_utils.py add tests for crawl date and url hashing empty strings

* transformation.py add new metadata

* transformation.py update `url`(page) to leverage new metadata
@rviscomi
Copy link
Member

rviscomi commented May 12, 2022

Thanks @giancarloaf! This is good to merge now. There are still some open comments that we can resolve in a follow-up PR:

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generate summary tables in Dataflow
5 participants