textstat_editdist #12

koheiw · 2020-01-18T07:01:50Z

Inspired by a SO post, I propose to make textstat_editdist(), whose output will be similar to

toks <- tokens(data_corpus_inaugural)
dat <- data.frame(feature = types(toks),
                  dist = stringdist::stringdist("the", types(toks)))
head(dat[order(dat$dist),], 10)
     feature dist
3        the    0
188     they    1
343       he    1
347      The    1
387     them    1
548     then    1
4223     she    1
4261     tie    1
7087     She    1
7683    thee    1

stringidist seems fairly fast

> microbenchmark::microbenchmark(
+   stringdist::stringdist("the", head(types(toks) , 1000)),
+   stringdist::stringdist("the", head(types(toks) , 10000))
+ )
Unit: microseconds
                                                    expr   min     lq   mean median     uq    max neval
  stringdist::stringdist("the", head(types(toks), 1000)) 241.5 258.95 306.87 271.85 315.95  946.1   100
 stringdist::stringdist("the", head(types(toks), 10000)) 564.1 621.00 702.29 651.65 686.40 1708.9   100
```r
Seems to work with non-ASCII characters

stringdist::stringdist("世界人権宣言", "世界平和宣言")
[1] 2

The text was updated successfully, but these errors were encountered:

koheiw self-assigned this Jan 18, 2020

kbenoit transferred this issue from quanteda/quanteda Nov 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

textstat_editdist #12

textstat_editdist #12

koheiw commented Jan 18, 2020

textstat_editdist #12

textstat_editdist #12

Comments

koheiw commented Jan 18, 2020