Use GPT for geocoding #164

danvk · 2024-11-13T14:17:19Z

#150; data update: oldnyc/oldnyc.github.io@aea9ddf

I ran all the items that didn't match title-cross or title-address (~17k) through gpt-4o with this prompt:

Your goal is to extract location information from JSON describing a photograph taken in New York City. The location information should be either an intersection of two streets, a place name, or an address. It's also possible that there's no location information, or that the photo was not taken in New York City.

Respond in JSON matching the following TypeScript interface:

{
  type: "intersection";
  street1: string;
  street2: string;
} | {
  type: "address";
  number: number;
  street: string;
} | {
  type: "place_name";
  place_name: string;
} | {
  type: "no location information";
} | {
  type: "not in NYC";
}

It cost ~$11 to run this over the 17,000 items.

This is able to replace almost all the remaining matches from the milstein and extended-grid coders. It adds 1979 new items to the site. These are pretty fun to browse through. They're all wins so far as I can tell. The locations tend to come from either complicated patterns in the title, or (more satisfying!) from location information in the backing text that hasn't made it over to the title. A few examples:

734081f: good spot, combines streets and cross-streets from backing text description of two different photos.
732679f: beautiful, extracts location information from backing text that’s not in the metadata. Also 732533f, 732811f.
730004f: Cool “general view” photo of Ft. Greene!

There were 382 items that moved. Many of these were wins. A lot of the losses were from GPT misinterpreting titles like "Fifth Avenue #57" as "5th Ave & 57th Street" instead of as an address. This led me to create the title-address coder to handle these directly.

I also added a "special cases" coder to help clean up some oddballs like the China Daily News series and the various Squatters colonies. I think these may have been handled specially when the "Address" column was created back in 2013.

-- Final stats --
25753 title-cross
  608 title-address
 2656 gpt
   88 special
 2348 subjects
   29 extended-grid
   69 milstein
31551 (total)

So only 98 items that still fall through to extended-grid and milstein.

Truth data diff:

Before:
  Geocodes
    165 / 269 = 61.34% of locatable images correctly located.
     10 / 175 = 5.71% incorrectly located.
After:
    208 / 269 = 77.32% of locatable images correctly located.
      9 / 217 = 4.15% incorrectly located.

So that's an all-around win. Thanks, GPT!

danvk added 30 commits November 7, 2024 15:06

look at patterns

117e69c

clean/normalize alt_title

9047746

clean/normalize alt_title data update

720b2a2

skip boro:A-B pattern for GPT

6b9d2fd

types

86c807e

Merge branch 'master' into gpt-geocode

8d82808

unit test for title_pattern

8db4d62

plug in title-pattern geocoder

e414d06

fix src, stub for extended-grid

c17cfa2

update logs

75c07b6

drop boro_int filter

457d979

rv irrelevant bits

57a013c

Merge branch 'master' into gpt-geocode

4649066

drop one more

d7b5775

update GPT geocodes

d4818d6

avoid feeding address into gpt

d6dfe3e

update gpt geocodes

1531e34

try a new prompt

17fca0d

old/bad geocodes analysis

4ab6652

do not print geocodes

a1fd2e9

Output added/dropped geometries

9dcee76

match an address

ac0e311

more tests

b273d3d

update test data; no more milstein or extended-grid!

37075ef

adjust log format, add --print_geocodes

fad7321

rewrite directional streets for better geocoding

b30e032

cleanup

d224335

try to match milstein punctuation; add special cases coder

3b666a8

with careful punctuation, down to 48 / 87

296b619

more special cases; 29 / 69

a78b84a

danvk added 5 commits November 13, 2024 08:54

slot in special cases coder

37b1bfc

update test data

3bb86f2

update geocache

3fc356d

restore the prompt I actually used

cf5c1fb

use subjects location for Columbus Circle

50ee34c

danvk marked this pull request as ready for review November 13, 2024 14:30

danvk added 2 commits November 13, 2024 09:36

update site data

1cf49fa

add sizes of new images

1f836fc

danvk merged commit 35a419c into master Nov 13, 2024
4 checks passed

danvk deleted the gpt-geocode branch November 13, 2024 15:16

danvk mentioned this pull request Nov 14, 2024

gpt+grid, Central Park West #173

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use GPT for geocoding #164

Use GPT for geocoding #164

danvk commented Nov 13, 2024 •

edited

Loading

Use GPT for geocoding #164

Use GPT for geocoding #164

Conversation

danvk commented Nov 13, 2024 • edited Loading

danvk commented Nov 13, 2024 •

edited

Loading