Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get an array of results from GPT #184

Merged
merged 38 commits into from
Nov 26, 2024
Merged

Get an array of results from GPT #184

merged 38 commits into from
Nov 26, 2024

Conversation

danvk
Copy link
Owner

@danvk danvk commented Nov 26, 2024

#164 follow-up; data update: oldnyc/oldnyc.github.io@2c8e18d

  • Change GPT prompt to ask for an array of location candidates.
  • Change Coder protocol to return a list of Locatables (could be a generator in the future).

This creates a few new failure modes. A notable one is that the rare photographer who includes their address in the byline becomes a major problem, since I give addresses precedence over place names. I've manually reviewed all addresses that appear in 2+ items or cause a photo to move 1+ km, but there are almost certainly more unique photographer addresses lurking.

  • Blacklist many photographer addresses
  • Be more careful about stripping directions from street names where they're critical: West Street, South Street, West End Ave, etc.
  • Blacklist some "cursed" intersections: Broadway & Amsterdam for now.
  • Adds a localturk template for reviewing GeoJSON changes (from diff_geojson.py).

This is a pretty big diff:

Changed: 543
  +geom: 1,321
  -geom: 34
  • The drops are a mixed bag. Some are losses, but many were incorrect before, e.g. due to photographer addresses.
  • The changes are also a mixed bag.
    • Of the 120 biggest movers (1+ km), 68 (57%) are wins, 30 (25%) are losses and 22 are neutral.
    • On a random sample of 15, it's 1 clear loss (3 small losses), 6 neutral, 3 clear wins (2 small wins)
  • The additions seem uniformly good. The first 15 (random sample) that I checked were all wins.

The diff on the truth data doesn't look too hot (+8 wins, +5 losses) but none of the new losses bother me too much. The primary cause is "address pinning", where if I geocode "1234 X St" but the highest address on X St is 789, then Google returns the location of "789 X St." This can happen when the street used to continue farther. I'm hoping that plugging in nyc-streets.geojson from the NYPL will help fix this by giving me more historic intersections.

TODO (this PR):

  • Refactor Google block in geocode.py

Remaining follow-up work:

  • Blacklist intersections like "59th Street & Central Park"
  • Look at what Google actually geocodes. It often makes surprising changes to the street name that we should reject.
  • Prioritize highly-specific subjects like "Central Park - The Lake" over intersections.
  • West Broadway ≠ Broadway

@danvk danvk marked this pull request as ready for review November 26, 2024 17:42
@danvk danvk merged commit 7c0c638 into master Nov 26, 2024
4 checks passed
@danvk danvk deleted the gpt-array branch November 26, 2024 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant