M0 5/pipeline poc #34

giancarloaf · 2022-02-24T04:12:52Z

Resolves #16 to create a new DataFlow pipeline from HAR json in GCS to summary tables in BigQuery

Port existing logic from legacy PHP

Implement pipeline features

batch
streaming

Validation

successful batch job https://console.cloud.google.com/dataflow/jobs/us-west1/2022-03-13_23_28_58-7029567823971004188?project=httparchive
successful streaming job https://console.cloud.google.com/dataflow/jobs/us-west1/2022-03-13_23_53_43-13862671684015744184?project=httparchive
compare pages table
compare responses table
test streaming live (in parallel with existing pipeline)

Follow up notes

discrepancies found when comparing pages could be due to the way that MYSQL handles duplicates - we see more results than expected in the new pipeline, but the counts line up with those from the raw HAR outputs

* files include commit history from legacy project HTTPArchive/bigquery

* Removed intermediate JSONL writes to GCS and related arguments * Removed schema auto-detection * Updated table name generation to use a combination of config and pipeline data * Fixed `startedDateTime` conversion to epoch time

tunetheweb · 2022-04-16T13:48:02Z

OK I've fixed the linting errors with some more setup in the repo, and a very small number of edits to the code. Hope you don't mind, but was just easier to test this in this branch as some of these config files are a little annoying.

The static code analysis still has one error and 3 warnings (which I think are correct and should be looked at. Or they can be accepted if you strongly disagree).

Will give it a quick review now.

modules/constants.py

modules/import_har.py

modules/transformation.py

modules/utils.py

* use BigQuery streaming inserts * Reshuffle() before reading files in batch mode * Logging and error hangling updates * Add explicit `return None` where missing previously

* add unittest.yml GitHub Actions * update unit tests and related code * add requirements.txt

README.md

modules/constants.py

modules/import_har.py

modules/transformation.py

rviscomi

Suggested changes to base the url field on the _metadata.tested_url property per #44 (comment)

modules/transformation.py

* README topic should not expire * summary_pages.json add `metadata` column * test_utils.py add tests for crawl date and url hashing empty strings * transformation.py add new metadata * transformation.py update `url`(page) to leverage new metadata

rviscomi · 2022-05-12T02:35:43Z

Thanks @giancarloaf! This is good to merge now. There are still some open comments that we can resolve in a follow-up PR:

Creating separate batch/streaming shell scripts M0 5/pipeline poc #34 (comment)
Refactoring pipeline parameters M0 5/pipeline poc #34 (comment)
Using Python libraries to extract file metadata (type, etc) and appending those to the schema for comparison M0 5/pipeline poc #34 (comment)

igrigorik and others added 15 commits August 21, 2013 11:41

update pages schema

f0f8405

update HA schema

8396ea0

update HA schema

df4d78c

update to latest schema

ce35086

update to latest schema

449b77b

update schema

8350b9d

import tools

386559d

update schema

9ba6b38

updated to latest pages schema

fcc427d

Create initial proof of concept DataFlow pipeline

7088b5d

fix readme typo

1ad6bd5

Update import_har.py - remove comment

6b42560

import_har.py update TODOs and naming

2ae882d

utils.py add some globals from original implementation

6492a3c

transformation.py progress on PHP porting

ab7fab3

giancarloaf added this to the M0.5: Migration to GCP milestone Feb 24, 2022

giancarloaf and others added 13 commits March 1, 2022 09:27

progress on PHP porting for requests

713ec61

progress on PHP porting for requests

4a3717f

transformation.py implement aggregate_stats and other minor fixes

8b836fa

Implement streaming pipeline

eed5bb1

Refactor packaging to execute on DataflowRunner

093d34a

Add .gitignore

aa59bdb

Update README.md

797894d

Add client

0e216ec

import tools

63230f7

Add schemas for summary_pages and summary_requests

032fe28

* files include commit history from legacy project HTTPArchive/bigquery

Various improvements

49be932

* Removed intermediate JSONL writes to GCS and related arguments * Removed schema auto-detection * Updated table name generation to use a combination of config and pipeline data * Fixed `startedDateTime` conversion to epoch time

Update .gitignore

03e7781

Update imports

3583071

Fix pylint error

57cdd9b

tunetheweb reviewed Apr 16, 2022

View reviewed changes

giancarloaf added 10 commits May 1, 2022 21:27

Update/remove commented args parser attributes

741319d

Fix missing return in utils.py:client_name()

d9c2455

Update setup.py with explicit version for apache-beam

48ef42c

Updates to transformation.py

b734ffb

* use BigQuery streaming inserts * Reshuffle() before reading files in batch mode * Logging and error hangling updates * Add explicit `return None` where missing previously

Code reformatting

a4e0cbb

Malformed startedDateTime produces null, not exception

88fdec4

Code reformatting

0a4809f

Update unit tests and add GitHub Action

67c75fe

* add unittest.yml GitHub Actions * update unit tests and related code * add requirements.txt

Code reformatting

b8fc458

Code reformatting

ab9ae51

giancarloaf marked this pull request as ready for review May 5, 2022 00:32

rviscomi suggested changes May 5, 2022

View reviewed changes

giancarloaf added 3 commits May 6, 2022 00:29

PR review updates

3e0ee22

Code reformatting

eac1e80

Code reformatting

c93844c

rviscomi suggested changes May 10, 2022

View reviewed changes

modules/transformation.py Outdated Show resolved Hide resolved

modules/transformation.py Outdated Show resolved Hide resolved

PR review updates

f4e77fa

* README topic should not expire * summary_pages.json add `metadata` column * test_utils.py add tests for crawl date and url hashing empty strings * transformation.py add new metadata * transformation.py update `url`(page) to leverage new metadata

rviscomi approved these changes May 12, 2022

View reviewed changes

rviscomi merged commit 9592088 into main May 12, 2022

rviscomi deleted the m0_5/pipeline-poc branch May 12, 2022 02:35

This was referenced May 19, 2022

Unexpected type, found format:avif #58

Closed

Add new image formats and change typ to type #66

Closed

This was referenced May 26, 2022

Include schemes ws:// and wss:// for web sockets in hostname counts #70

Closed

Filter out hash urls HTTPArchive/wptagent#22

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

M0 5/pipeline poc #34

M0 5/pipeline poc #34

giancarloaf commented Feb 24, 2022 •

edited

Loading

tunetheweb commented Apr 16, 2022 •

edited

Loading

rviscomi left a comment

rviscomi commented May 12, 2022 •

edited by giancarloaf

Loading

M0 5/pipeline poc #34

M0 5/pipeline poc #34

Conversation

giancarloaf commented Feb 24, 2022 • edited Loading

tunetheweb commented Apr 16, 2022 • edited Loading

rviscomi left a comment

Choose a reason for hiding this comment

rviscomi commented May 12, 2022 • edited by giancarloaf Loading

giancarloaf commented Feb 24, 2022 •

edited

Loading

tunetheweb commented Apr 16, 2022 •

edited

Loading

rviscomi commented May 12, 2022 •

edited by giancarloaf

Loading