Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor issue parser - lifecycle methods, use pydantic models, support reviewer lists #174

Merged
merged 17 commits into from
Jul 4, 2024

Conversation

sneakers-the-rat
Copy link
Contributor

Fix: #173

Wanted to do this for awhile, but i refactored the issue parser so it should be a bit easier to make modifications to it going forward.

Previously the field parsing logic was split in a bunch of different places, but this brings it into a linear set of lifecycle methods for parsing a review issue:

Currently:

  • start with a list of dictionaries
  • comment_to_list split whole issue body into a list of lists, split by \n and : . Separate out package name
  • get_issue_meta, iterate over list of lists with a fixed number of lines, processing key into snake_case, and handling special cases for different fields
  • clean_date_accepted_key a no op as far as i can tell?
  • get some date metadata from the passed dictionary
  • get categories
  • iterate over inline definition of expected keys, subsetting model dictionary
  • return dict of dicts

This PR:

  • Start with an Issue object (created from the github API's json schema) or a string (mostly for testing)
  • _split_header: Split the header from the body of the issue
  • split and strip body lines
  • _header_as_dict: Cast the header as a dictionary, preprocessing keys into snake case
  • _preprocess_meta: Apply any preprocessing steps to header metadata, split out into separable preprocessing methods (eg in this case by gathering old reviewer forms in _combine_reviewers)
  • _parse_field: Apply any special field parsing logic here, similar to get_issue_meta but only handling a single field at a time
  • _add_issue_metadata - add any requested fields from the Issue model
  • _postprocess_meta - do any postprocessing steps like gathering categories from the issue body
  • return ReviewModel

This gives clear places to add parsing logic, in a preprocessing, field processing, or postprocessing step. It also uses pydantic models instead of anonymous lists of lists, dicts of dicts, etc. so we can know what to expect at each of the stages. I noticed that a few of the methods were sort of awkward because they expected to operate over the whole list-of-lists of the '\n' and ': ' split issue body, so this separates things out a bit so they should be easier to maintain and test.

also:

  • Support both Reviewer 1:, Reviewer 2: ... and Reviewers: style specification of reviewers
  • removed need for passing a specific number of lines when parsing an issue by using the --- separator to split the header from the body first.
  • renamed return_response to get_issues bc i couldn't tell what was supposed to come out of it and parse_issue_header to parse_issues because it's plural (and created a parse_issue method to process just one)
  • unify handling of username lists where before they were special cased a little bit
  • added a tests/data directory for sample inputs so tests can be a bit less cluttered by defining them inline
  • fixed broken github token test by monkeypatching the .env file loader: the tests were failing in two ways for me on this: without a github token in my .env a bunch of tests failed expecting it to be there, and with one that test would fail because it was present but the monkeypatch only deleted the env variable but not the load dotenv function. I assume this is the case because the CI tests set a GITHUB_TOKEN env variable but not a .env file
  • fixed test dependencies - pytest-mock wasn't specified, not sure where that gets installed in CI but it wasn't when running locally.

…dels, do staged parsing and isolate special casing by treating each key separately, add support for reviewers as lists
@sneakers-the-rat
Copy link
Contributor Author

i guess the thing that would be nice here re: pyOpenSci/pyopensci.github.io#434 is single-sourcing the template: like ideally we would give the human readable names as aliases or title or something and make a classmethod for the ReviewModel that generates that review template, so changing the review template would be a change to the ReviewModel and there is no gap between them where it's possible to forget. Then we could have example reviews from each time the template changes to make sure we still support the current and all prior review checklist formats.

@sneakers-the-rat
Copy link
Contributor Author

it looks like a number of fields were marked as optional in ReviewModel, presumably to be able to test with a reduced input set, but that causes us to improperly validate review issues with missing information.

@sneakers-the-rat
Copy link
Contributor Author

it would b nice to have something like VCR to record all the website updates and put that in pytest for local testing bc rn i'm just using the CI like pytest since i don't have the GITHUB_TOKEN

@sneakers-the-rat
Copy link
Contributor Author

ok there ya have it. since the model has a ton of optional params, i can't be sure that things are actually still working, but that's enough for one day. hopefully saves us some headaches in the future and makes it a little easier to tweak the reviewer template/speeds adoption of the pyos bot by making it easier to handle parsing multiple forms of review template

@lwasser
Copy link
Member

lwasser commented Jul 3, 2024

it would b nice to have something like VCR to record all the website updates and put that in pytest for local testing bc rn i'm just using the CI like pytest since i don't have the GITHUB_TOKEN

@sneakers-the-rat what would that look like. i've used i think VCS (casettes?) in a package where we did api things but someone else set it up..

@@ -35,27 +34,21 @@ def main():
# Get all issues for approved packages - load as dict
# TODO: this doesn't have to be in process issues at all. it could fully
# Call the github module
issues = process_review.return_response()
accepted_reviews = process_review.parse_issue_header(issues, 45)
issues = process_review.get_issues()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above - if you rebase from main (i am pretty sure_ i fiex this

labels=["6/pyOS-approved 🚀🚀🚀"] -->
labels=["6/pyOS-approved"]

this is why we had a magical deleted file in our last pr 🙈 i removed the emojis.

Copy link
Contributor Author

@sneakers-the-rat sneakers-the-rat Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged in main, got the new label!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!! now we are no longer deleting all of the packages ✨ 😆

@lwasser
Copy link
Member

lwasser commented Jul 3, 2024

@sneakers-the-rat this is working really well. There is one other issue. when i run update-review-teams it used to populate the github data from each user's github profile - specifically name given we always have ghusername. but now it doesn't seem to be popoulating that.

Given this is a big refactor should we handle that separtely? or is this a small enough thing to fix in this pr (along with the other tiny thing that is the label name (i commented on this).

Screenshot 2024-07-03 at 4 47 33 PM

thank you for this. I refactored this code from a much bigger mess to pydantic last year and i learned a TON. but i knew it was still really hard to maintain / debug. My brain isn't always great about knowing how to best refactor things !! so i really appreciate this pr.

the test issue idea in the _data folder is so smart. We can then add other examples of issues to process in the future as things come up.

Copy link
Member

@lwasser lwasser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am approving this.
We can fix the things i noted here or elsewhere.

also happy to provide merge / admin permissions. 🙌🏻

editor: dict[str, str | None] = {}
reviewer_1: dict[str, str | None] = {}
reviewer_2: dict[str, str | None] = {}
editor: ReviewUser = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add the eic field here as well? For now allow it to default to None because we need to update all older reviews to have that field. It should exist for newer issues submitted. The goal is to make user our EiC get credit for their work in addition to the editors ✨

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it! cfdcad3

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing. thank you!!

@lwasser
Copy link
Member

lwasser commented Jul 3, 2024

@all-contributors please add @sneakers-the-rat for code, review

Copy link
Contributor

@lwasser

I've put up a pull request to add @sneakers-the-rat! 🎉

@lwasser lwasser mentioned this pull request Jul 3, 2024
from pydantic import AnyUrl, BaseModel, ConfigDict, Field


class User(BaseModel):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sneakers-the-rat we do this in stravalib but have a bot to grab things. Can you answer a few questions for me about this (it's awesome and i want to learn more & document it correctly).

1 should we add datamodel-codegen to optional dependencies?
2. where is issue_schema .json coming from? with strava there is a swagger.json file that we grab online.
3. did you manually trim things to map to our workflow here?

thank you again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lwasser sorry I missed this question -

  1. nah, it was just a one-off to create the model from the JSON schema, and then I heavily modified and trimmed it anyway.
  2. the URL I got it from is in the docstring :)
  3. yep! Basically I took out all the fields that we weren't/probably wouldnt use and set extra = "allow" in the model config so that those extra fields would be there/wouldnt throw a validation error, but youd get a static analysis warning if you tried to access them (hinting they should be added to the model formally). I also simplified the various User models - each of the blank models that inherit from the User class now were just duplicates. And I also made some fields optional I blv to let the tests run. But basically those models in the github module shouldnt be changed (unless github's API changes) - we just want those to reflect what comes back from the API, and the models we use internally in this package are separate.

@sneakers-the-rat
Copy link
Contributor Author

There is one other issue. when i run update-review-teams it used to populate the github data from each user's github profile - specifically name given we always have ghusername. but now it doesn't seem to be popoulating that. ... Given this is a big refactor should we handle that separtely? or is this a small enough thing to fix in this pr (along with the other tiny thing that is the label name (i commented on this).

hmm mysterious. seems worth it to fix. let me see

@sneakers-the-rat
Copy link
Contributor Author

@lwasser here try this - 8ff2ac3

looks like the all_active_maintainers was special-cased in the body of that block that handles the username instead of using the role as a generic index. that block was also just duplicated from the single case, so i consolidated them into a single function - could def use some more work bc there's some stuff in there that i'm not sure what it's up to, but hopefully that fixes problem?

@lwasser
Copy link
Member

lwasser commented Jul 4, 2024

@lwasser here try this - 8ff2ac3

looks like the all_active_maintainers was special-cased in the body of that block that handles the username instead of using the role as a generic index. that block was also just duplicated from the single case, so i consolidated them into a single function - could def use some more work bc there's some stuff in there that i'm not sure what it's up to, but hopefully that fixes problem?

Amazing. thank you. yes that CLI script is really really hacky. i really fought with how to create it in a easy-to-maintain and simple way. This is all running locally now. i'm going to merge and also i'm going to bump your permissions here Jonny.

The one item i'll open an issue about is the data-code-generator piece. I just want to document how that was done. thank you so so much!

@lwasser lwasser merged commit d9f9f59 into pyOpenSci:main Jul 4, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support new review template format - process both reviewers list and eic
2 participants