Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use GPT for geocoding #164

Merged
merged 37 commits into from
Nov 13, 2024
Merged

Use GPT for geocoding #164

merged 37 commits into from
Nov 13, 2024

Conversation

danvk
Copy link
Owner

@danvk danvk commented Nov 13, 2024

#150; data update: oldnyc/oldnyc.github.io@aea9ddf

I ran all the items that didn't match title-cross or title-address (~17k) through gpt-4o with this prompt:

Your goal is to extract location information from JSON describing a photograph taken in New York City. The location information should be either an intersection of two streets, a place name, or an address. It's also possible that there's no location information, or that the photo was not taken in New York City.

Respond in JSON matching the following TypeScript interface:

{
  type: "intersection";
  street1: string;
  street2: string;
} | {
  type: "address";
  number: number;
  street: string;
} | {
  type: "place_name";
  place_name: string;
} | {
  type: "no location information";
} | {
  type: "not in NYC";
}

It cost ~$11 to run this over the 17,000 items.

This is able to replace almost all the remaining matches from the milstein and extended-grid coders. It adds 1979 new items to the site. These are pretty fun to browse through. They're all wins so far as I can tell. The locations tend to come from either complicated patterns in the title, or (more satisfying!) from location information in the backing text that hasn't made it over to the title. A few examples:

  • 734081f: good spot, combines streets and cross-streets from backing text description of two different photos.
  • 732679f: beautiful, extracts location information from backing text that’s not in the metadata. Also 732533f, 732811f.
  • 730004f: Cool “general view” photo of Ft. Greene!

There were 382 items that moved. Many of these were wins. A lot of the losses were from GPT misinterpreting titles like "Fifth Avenue #57" as "5th Ave & 57th Street" instead of as an address. This led me to create the title-address coder to handle these directly.

I also added a "special cases" coder to help clean up some oddballs like the China Daily News series and the various Squatters colonies. I think these may have been handled specially when the "Address" column was created back in 2013.

-- Final stats --
25753 title-cross
  608 title-address
 2656 gpt
   88 special
 2348 subjects
   29 extended-grid
   69 milstein
31551 (total)

So only 98 items that still fall through to extended-grid and milstein.

Truth data diff:

Before:
  Geocodes
    165 / 269 = 61.34% of locatable images correctly located.
     10 / 175 = 5.71% incorrectly located.
After:
    208 / 269 = 77.32% of locatable images correctly located.
      9 / 217 = 4.15% incorrectly located.

So that's an all-around win. Thanks, GPT!

@danvk danvk marked this pull request as ready for review November 13, 2024 14:30
@danvk danvk merged commit 35a419c into master Nov 13, 2024
4 checks passed
@danvk danvk deleted the gpt-geocode branch November 13, 2024 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant