Refactor issue parser - lifecycle methods, use pydantic models, support reviewer lists #174

sneakers-the-rat · 2024-07-03T03:03:36Z

Wanted to do this for awhile, but i refactored the issue parser so it should be a bit easier to make modifications to it going forward.

Previously the field parsing logic was split in a bunch of different places, but this brings it into a linear set of lifecycle methods for parsing a review issue:

Currently:

start with a list of dictionaries
comment_to_list split whole issue body into a list of lists, split by \n and : . Separate out package name
get_issue_meta, iterate over list of lists with a fixed number of lines, processing key into snake_case, and handling special cases for different fields
clean_date_accepted_key a no op as far as i can tell?
get some date metadata from the passed dictionary
get categories
iterate over inline definition of expected keys, subsetting model dictionary
return dict of dicts

This PR:

Start with an Issue object (created from the github API's json schema) or a string (mostly for testing)
_split_header: Split the header from the body of the issue
split and strip body lines
_header_as_dict: Cast the header as a dictionary, preprocessing keys into snake case
_preprocess_meta: Apply any preprocessing steps to header metadata, split out into separable preprocessing methods (eg in this case by gathering old reviewer forms in _combine_reviewers)
_parse_field: Apply any special field parsing logic here, similar to get_issue_meta but only handling a single field at a time
_add_issue_metadata - add any requested fields from the Issue model
_postprocess_meta - do any postprocessing steps like gathering categories from the issue body
return ReviewModel

This gives clear places to add parsing logic, in a preprocessing, field processing, or postprocessing step. It also uses pydantic models instead of anonymous lists of lists, dicts of dicts, etc. so we can know what to expect at each of the stages. I noticed that a few of the methods were sort of awkward because they expected to operate over the whole list-of-lists of the '\n' and ': ' split issue body, so this separates things out a bit so they should be easier to maintain and test.

also:

Support both Reviewer 1:, Reviewer 2: ... and Reviewers: style specification of reviewers
removed need for passing a specific number of lines when parsing an issue by using the --- separator to split the header from the body first.
renamed return_response to get_issues bc i couldn't tell what was supposed to come out of it and parse_issue_header to parse_issues because it's plural (and created a parse_issue method to process just one)
unify handling of username lists where before they were special cased a little bit
added a tests/data directory for sample inputs so tests can be a bit less cluttered by defining them inline
fixed broken github token test by monkeypatching the .env file loader: the tests were failing in two ways for me on this: without a github token in my .env a bunch of tests failed expecting it to be there, and with one that test would fail because it was present but the monkeypatch only deleted the env variable but not the load dotenv function. I assume this is the case because the CI tests set a GITHUB_TOKEN env variable but not a .env file
fixed test dependencies - pytest-mock wasn't specified, not sure where that gets installed in CI but it wasn't when running locally.

…dels, do staged parsing and isolate special casing by treating each key separately, add support for reviewers as lists

sneakers-the-rat · 2024-07-03T03:12:51Z

i guess the thing that would be nice here re: pyOpenSci/pyopensci.github.io#434 is single-sourcing the template: like ideally we would give the human readable names as aliases or title or something and make a classmethod for the ReviewModel that generates that review template, so changing the review template would be a change to the ReviewModel and there is no gap between them where it's possible to forget. Then we could have example reviews from each time the template changes to make sure we still support the current and all prior review checklist formats.

sneakers-the-rat · 2024-07-03T03:51:44Z

it looks like a number of fields were marked as optional in ReviewModel, presumably to be able to test with a reduced input set, but that causes us to improperly validate review issues with missing information.

sneakers-the-rat · 2024-07-03T04:06:29Z

it would b nice to have something like VCR to record all the website updates and put that in pytest for local testing bc rn i'm just using the CI like pytest since i don't have the GITHUB_TOKEN

sneakers-the-rat · 2024-07-03T04:08:28Z

ok there ya have it. since the model has a ton of optional params, i can't be sure that things are actually still working, but that's enough for one day. hopefully saves us some headaches in the future and makes it a little easier to tweak the reviewer template/speeds adoption of the pyos bot by making it easier to handle parsing multiple forms of review template

lwasser · 2024-07-03T22:19:34Z

it would b nice to have something like VCR to record all the website updates and put that in pytest for local testing bc rn i'm just using the CI like pytest since i don't have the GITHUB_TOKEN

@sneakers-the-rat what would that look like. i've used i think VCS (casettes?) in a package where we did api things but someone else set it up..

lwasser · 2024-07-03T22:35:54Z

src/pyosmeta/cli/process_reviews.py

@@ -35,27 +34,21 @@ def main():
    # Get all issues for approved packages - load as dict
    # TODO: this doesn't have to be in process issues at all. it could fully
    # Call the github module
-    issues = process_review.return_response()
-    accepted_reviews = process_review.parse_issue_header(issues, 45)
+    issues = process_review.get_issues()


Above - if you rebase from main (i am pretty sure_ i fiex this

labels=["6/pyOS-approved 🚀🚀🚀"] -->
labels=["6/pyOS-approved"]

this is why we had a magical deleted file in our last pr 🙈 i removed the emojis.

merged in main, got the new label!

thank you!! now we are no longer deleting all of the packages ✨ 😆

lwasser · 2024-07-03T22:49:26Z

@sneakers-the-rat this is working really well. There is one other issue. when i run update-review-teams it used to populate the github data from each user's github profile - specifically name given we always have ghusername. but now it doesn't seem to be popoulating that.

Given this is a big refactor should we handle that separtely? or is this a small enough thing to fix in this pr (along with the other tiny thing that is the label name (i commented on this).

thank you for this. I refactored this code from a much bigger mess to pydantic last year and i learned a TON. but i knew it was still really hard to maintain / debug. My brain isn't always great about knowing how to best refactor things !! so i really appreciate this pr.

the test issue idea in the _data folder is so smart. We can then add other examples of issues to process in the future as things come up.

lwasser

i am approving this.
We can fix the things i noted here or elsewhere.

also happy to provide merge / admin permissions. 🙌🏻

lwasser · 2024-07-03T22:43:41Z

src/pyosmeta/models/base.py

-    editor: dict[str, str | None] = {}
-    reviewer_1: dict[str, str | None] = {}
-    reviewer_2: dict[str, str | None] = {}
+    editor: ReviewUser = {}


Can you please add the eic field here as well? For now allow it to default to None because we need to update all older reviews to have that field. It should exist for newer issues submitted. The goal is to make user our EiC get credit for their work in addition to the editors ✨

got it! cfdcad3

Amazing. thank you!!

lwasser · 2024-07-03T22:50:46Z

@all-contributors please add @sneakers-the-rat for code, review

allcontributors · 2024-07-03T22:50:55Z

@lwasser

I've put up a pull request to add @sneakers-the-rat! 🎉

lwasser · 2024-07-03T22:59:45Z

src/pyosmeta/models/github.py

+from pydantic import AnyUrl, BaseModel, ConfigDict, Field
+
+
+class User(BaseModel):


@sneakers-the-rat we do this in stravalib but have a bot to grab things. Can you answer a few questions for me about this (it's awesome and i want to learn more & document it correctly).

1 should we add datamodel-codegen to optional dependencies?
2. where is issue_schema .json coming from? with strava there is a swagger.json file that we grab online.
3. did you manually trim things to map to our workflow here?

thank you again.

@lwasser sorry I missed this question -

nah, it was just a one-off to create the model from the JSON schema, and then I heavily modified and trimmed it anyway.

the URL I got it from is in the docstring :)

yep! Basically I took out all the fields that we weren't/probably wouldnt use and set extra = "allow" in the model config so that those extra fields would be there/wouldnt throw a validation error, but youd get a static analysis warning if you tried to access them (hinting they should be added to the model formally). I also simplified the various User models - each of the blank models that inherit from the User class now were just duplicates. And I also made some fields optional I blv to let the tests run. But basically those models in the github module shouldnt be changed (unless github's API changes) - we just want those to reflect what comes back from the API, and the models we use internally in this package are separate.

sneakers-the-rat · 2024-07-03T23:20:58Z

There is one other issue. when i run update-review-teams it used to populate the github data from each user's github profile - specifically name given we always have ghusername. but now it doesn't seem to be popoulating that. ... Given this is a big refactor should we handle that separtely? or is this a small enough thing to fix in this pr (along with the other tiny thing that is the label name (i commented on this).

hmm mysterious. seems worth it to fix. let me see

sneakers-the-rat · 2024-07-03T23:57:01Z

@lwasser here try this - 8ff2ac3

looks like the all_active_maintainers was special-cased in the body of that block that handles the username instead of using the role as a generic index. that block was also just duplicated from the single case, so i consolidated them into a single function - could def use some more work bc there's some stuff in there that i'm not sure what it's up to, but hopefully that fixes problem?

lwasser · 2024-07-04T17:58:14Z

@lwasser here try this - 8ff2ac3

looks like the all_active_maintainers was special-cased in the body of that block that handles the username instead of using the role as a generic index. that block was also just duplicated from the single case, so i consolidated them into a single function - could def use some more work bc there's some stuff in there that i'm not sure what it's up to, but hopefully that fixes problem?

Amazing. thank you. yes that CLI script is really really hacky. i really fought with how to create it in a easy-to-maintain and simple way. This is all running locally now. i'm going to merge and also i'm going to bump your permissions here Jonny.

The one item i'll open an issue about is the data-code-generator piece. I just want to document how that was done. thank you so so much!

sneakers-the-rat added 3 commits July 2, 2024 19:41

refactor issue parser into clearer lifecycle methods, use pydantic mo…

a41e251

…dels, do staged parsing and isolate special casing by treating each key separately, add support for reviewers as lists

lint

268e940

whoops remove pdb call

00ad27e

sneakers-the-rat mentioned this pull request Jul 3, 2024

Update contributor and review data pyOpenSci/pyopensci.github.io#434

Merged

sneakers-the-rat added 6 commits July 2, 2024 20:17

handle None reviewers

63e9da1

dang this formatter that fights black is annoying lol

4978019

add repository_link from issue

253ec42

handle review validation errors

0152cee

directly test validation function instead of model instantiation

fbfc13c

handle bolded keys in review issues

df5939a

sneakers-the-rat added 3 commits July 2, 2024 20:53

restore repository_link

2ca1a3f

fix reviewer_1 attr access as reviewers

eeffee1

yet more attr access problems

7bd506d

lwasser reviewed Jul 3, 2024

View reviewed changes

lwasser approved these changes Jul 3, 2024

View reviewed changes

allcontributors bot mentioned this pull request Jul 3, 2024

docs: add sneakers-the-rat as a contributor for code, and review #177

Merged

lwasser mentioned this pull request Jul 3, 2024

Fix sigstore job #175

Merged

lwasser reviewed Jul 3, 2024

View reviewed changes

sneakers-the-rat added 3 commits July 3, 2024 16:08

add eic, make editor default None rather than mutable dict

cfdcad3

Merge branch 'main' into reviewer-list

9594f19

handle eic correctly

8d5eb57

consolidate reviewer name handling

8ff2ac3

handle all kinds of falsy user.name

e59f603

sneakers-the-rat mentioned this pull request Jul 4, 2024

Fix update-contributors handling of missing github_username #176

Closed

lwasser merged commit d9f9f59 into pyOpenSci:main Jul 4, 2024
2 checks passed

This was referenced Jul 4, 2024

Future: make sure we support eic field #141

Closed

Make sure code recognizes a third reviewer if they exist. #77

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor issue parser - lifecycle methods, use pydantic models, support reviewer lists #174

Refactor issue parser - lifecycle methods, use pydantic models, support reviewer lists #174

sneakers-the-rat commented Jul 3, 2024

sneakers-the-rat commented Jul 3, 2024

sneakers-the-rat commented Jul 3, 2024

sneakers-the-rat commented Jul 3, 2024

sneakers-the-rat commented Jul 3, 2024

lwasser commented Jul 3, 2024

lwasser Jul 3, 2024

sneakers-the-rat Jul 3, 2024 •

edited

Loading

lwasser Jul 4, 2024

lwasser commented Jul 3, 2024

lwasser left a comment

lwasser Jul 3, 2024

sneakers-the-rat Jul 3, 2024

lwasser Jul 4, 2024

lwasser commented Jul 3, 2024

allcontributors bot commented Jul 3, 2024

lwasser Jul 3, 2024

sneakers-the-rat Jul 4, 2024

sneakers-the-rat commented Jul 3, 2024

sneakers-the-rat commented Jul 3, 2024

lwasser commented Jul 4, 2024

		from pydantic import AnyUrl, BaseModel, ConfigDict, Field


		class User(BaseModel):

Refactor issue parser - lifecycle methods, use pydantic models, support reviewer lists #174

Refactor issue parser - lifecycle methods, use pydantic models, support reviewer lists #174

Conversation

sneakers-the-rat commented Jul 3, 2024

sneakers-the-rat commented Jul 3, 2024

sneakers-the-rat commented Jul 3, 2024

sneakers-the-rat commented Jul 3, 2024

sneakers-the-rat commented Jul 3, 2024

lwasser commented Jul 3, 2024

lwasser Jul 3, 2024

Choose a reason for hiding this comment

sneakers-the-rat Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

lwasser Jul 4, 2024

Choose a reason for hiding this comment

lwasser commented Jul 3, 2024

lwasser left a comment

Choose a reason for hiding this comment

lwasser Jul 3, 2024

Choose a reason for hiding this comment

sneakers-the-rat Jul 3, 2024

Choose a reason for hiding this comment

lwasser Jul 4, 2024

Choose a reason for hiding this comment

lwasser commented Jul 3, 2024

allcontributors bot commented Jul 3, 2024

lwasser Jul 3, 2024

Choose a reason for hiding this comment

sneakers-the-rat Jul 4, 2024

Choose a reason for hiding this comment

sneakers-the-rat commented Jul 3, 2024

sneakers-the-rat commented Jul 3, 2024

lwasser commented Jul 4, 2024

sneakers-the-rat Jul 3, 2024 •

edited

Loading