Merge in OCR from GPT #146

danvk · 2024-10-24T14:15:13Z

Closes #145

Associated site update: oldnyc/oldnyc.github.io@fb0d1cc

Overall approach:

If there's only on-site OCR or GPT-based OCR, use that.
If there's both, default to using GPT. Exceptions (1,663 total):
- If a cookie with >=100 edits touched the text and there's a big diff (edit distance >=10)
- If the on-site text contains more unique dates than the GPT-based text (which has a habit of dropping dates)
- If it's a big diff (edit distance >=70) and is marked as an exception from my manual review (I reviewed the top ~500 diffs manually)
- If it has more misspelled words than the on-site OCR, with the exception of a manual review of ~80 items where the length changed by 25+ characters.

Some stats:

Before:
- 25,421 images have text on the site
- 29,406 images have GPT-based OCR
- 21,926 have both
After:
- 27,849 images have OCR from GPT
- 5,042 have OCR carried over from before
- 32,891 images have text

Stats for the diff against the existing site:

2% clear losses
6% minor losses
20% neutral changes
12% minor wins
60% clear wins

Other things to like about this:

Eliminates the weird circular dependency between ingest.py / generate_static_site.py and the oldnyc.github.io repo. Now ingest.py doesn't need to reference the repo at all, and generate_static_site.py only needs it for timestamps. I'll have to do some follow-up work to support OCR correction again.
Much more aggressive about pruning out "NEG # 1234" lines.
More aggressive about pulling dates out of text.
Eliminates all the line noise that used to sometimes appear in OCR text.
Eliminates "editorialization" — where users add their own notes to the photo description. This is valuable, but it belongs in comments. I've moved these over where there were big diffs.
Adds a single-/multi-line toggle to OCR review interface; this makes the intra-line diffs wildly more useful.
Adds a localturk review mode for the OCR review interface. This uses base64-encoded JSON because of escaping issues.
Adds type hints for most JSON files.

There are more wins to be had, but this is good for now. See #147.

danvk added 23 commits October 24, 2024 10:14

add types for site data

989ae20

del id

82e09b1

diff on-site OCR vs. GPT

2ec11b2

review site vs. GPT OCR

cd2b4cc

strip neg # 1234 lines from OCR

7106a7a

detect Month Year anywhere

90a3512

allow "1900."

3ab2214

fix I937 dates

9315e70

add feedback types, analyze heavy users

9ff2129

fix bug in score_utils

2360de1

expand is_negative

99a383b

ongoing heavy editor analysis

0aa96af

done with heavy user analysis

f14aacd

checkpoint GPT data

855d343

take more care to preserve dates

9e2b1c9

drop negative lines from GPT OCR

92f88fb

drop malformed JSON from GPT OCR

949cbfd

code to drop malformed JSON

97e9938

pull in manually-corrected rotations

3ab6e5f

update images.ndjson with new OCR

b220088

snapshot current site OCR, make a list of 1127 to keep

bd9b3fd

one more negative pattern

2ac712f

adopt GPT-based OCR

f3e8cbb

danvk mentioned this pull request Oct 26, 2024

Indicate source of OCR text oldnyc/oldnyc.github.io#44

Open

danvk added 6 commits October 26, 2024 15:21

fix bug with date detection

71cb184

more OCR replacement analysis

36f60f4

look at spelling

9725c1c

pare back ocr_shootout.py

7c4c18b

gut changes.*

d5468a3

ready for the switch!

e016cab

danvk marked this pull request as ready for review October 27, 2024 15:47

danvk added 6 commits October 27, 2024 11:50

checks

cdef777

move ocr_shootout to ocr/

6ac3f92

pythonpath

3147dbc

extract photos once

bd4781e

fully de-dupe branches

40488b7

drop stray line

b82057f

danvk merged commit 28bae2b into master Oct 27, 2024
4 checks passed

danvk mentioned this pull request Oct 27, 2024

GPT OCR followup #148

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge in OCR from GPT #146

Merge in OCR from GPT #146

danvk commented Oct 24, 2024 •

edited

Loading

Merge in OCR from GPT #146

Merge in OCR from GPT #146

Conversation

danvk commented Oct 24, 2024 • edited Loading

danvk commented Oct 24, 2024 •

edited

Loading