Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge in OCR from GPT #146

Merged
merged 35 commits into from
Oct 27, 2024
Merged

Merge in OCR from GPT #146

merged 35 commits into from
Oct 27, 2024

Conversation

danvk
Copy link
Owner

@danvk danvk commented Oct 24, 2024

Closes #145

Associated site update: oldnyc/oldnyc.github.io@fb0d1cc

Overall approach:

  • If there's only on-site OCR or GPT-based OCR, use that.
  • If there's both, default to using GPT. Exceptions (1,663 total):
    • If a cookie with >=100 edits touched the text and there's a big diff (edit distance >=10)
    • If the on-site text contains more unique dates than the GPT-based text (which has a habit of dropping dates)
    • If it's a big diff (edit distance >=70) and is marked as an exception from my manual review (I reviewed the top ~500 diffs manually)
    • If it has more misspelled words than the on-site OCR, with the exception of a manual review of ~80 items where the length changed by 25+ characters.

Some stats:

  • Before:
    • 25,421 images have text on the site
    • 29,406 images have GPT-based OCR
    • 21,926 have both
  • After:
    • 27,849 images have OCR from GPT
    • 5,042 have OCR carried over from before
    • 32,891 images have text

Stats for the diff against the existing site:

  • 2% clear losses
  • 6% minor losses
  • 20% neutral changes
  • 12% minor wins
  • 60% clear wins

Other things to like about this:

  • Eliminates the weird circular dependency between ingest.py / generate_static_site.py and the oldnyc.github.io repo. Now ingest.py doesn't need to reference the repo at all, and generate_static_site.py only needs it for timestamps. I'll have to do some follow-up work to support OCR correction again.
  • Much more aggressive about pruning out "NEG # 1234" lines.
  • More aggressive about pulling dates out of text.
  • Eliminates all the line noise that used to sometimes appear in OCR text.
  • Eliminates "editorialization" — where users add their own notes to the photo description. This is valuable, but it belongs in comments. I've moved these over where there were big diffs.
  • Adds a single-/multi-line toggle to OCR review interface; this makes the intra-line diffs wildly more useful.
  • Adds a localturk review mode for the OCR review interface. This uses base64-encoded JSON because of escaping issues.
  • Adds type hints for most JSON files.

There are more wins to be had, but this is good for now. See #147.

@danvk danvk marked this pull request as ready for review October 27, 2024 15:47
@danvk danvk merged commit 28bae2b into master Oct 27, 2024
4 checks passed
@danvk danvk mentioned this pull request Oct 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace some Ocropus OCR with GPT OCR
1 participant