Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to tell when new URLs are brought into COSMOS for a given collection #1015

Open
code-geek opened this issue Sep 6, 2024 · 0 comments
Assignees

Comments

@code-geek
Copy link
Contributor

code-geek commented Sep 6, 2024

Resources

Description

Right now, urls are only scraped one time, the collection is only curated one time, and it is only brought into prod one time. However, by next month, we will start reindexing sites. We will go back to the site and rescrape it, and during that process, a few things could happen:

  • urls could disapear
  • new urls could be added
  • the metadata for old urls could change (full text, title, etc)

So let's say that we reindex the site, and we get 20 new urls. Emily will now have to curate those 20 urls. But right now there is no way to know which 20 are new, or to tell Emily which 20 are new. So we need a way to both identify which 20 are new, and to let Emily know which 20 they are.

Similarly, if the titles change for 7 urls, Emily might need to update her title rules. The webapp needs a mechanism to identify which ones have changed, and then Emily needs a place where she can see the updated ones and potentially make fixes or changes to those specific 7 rules.

Existing Process

Right now the URL import process works like this:

  • Delete all existing Candidate URLs
  • Bring in fresh Candidate URLs from whichever server the user chooses

This obviously loses information such as:

  • How many URLs were there before
  • How many new URLs have been brought in
  • How many old URLs were removed
  • ...and any associated metadata (scraped_title, etc)

We need to rethink this process so that we can preserve old data while also highlighting anything new we've brought in.

Implementation Considerations

  • Needs a lot of frontend changes
  • Might need to be taken on fairly soon

Open Questions

Do we need to retain the old urls so that we can get the delta?
How will we retain them?
Do we add in a slack feature in the long term when deltas are discovered?
Do we set up the api to not supply new or changed urls to prod sde.

Deliverable

  • Design doc

Dependencies

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants