Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

textstat_editdist #12

Open
koheiw opened this issue Jan 18, 2020 · 0 comments
Open

textstat_editdist #12

koheiw opened this issue Jan 18, 2020 · 0 comments
Assignees

Comments

@koheiw
Copy link
Collaborator

koheiw commented Jan 18, 2020

Inspired by a SO post, I propose to make textstat_editdist(), whose output will be similar to

toks <- tokens(data_corpus_inaugural)
dat <- data.frame(feature = types(toks),
                  dist = stringdist::stringdist("the", types(toks)))
head(dat[order(dat$dist),], 10)
     feature dist
3        the    0
188     they    1
343       he    1
347      The    1
387     them    1
548     then    1
4223     she    1
4261     tie    1
7087     She    1
7683    thee    1

stringidist seems fairly fast

> microbenchmark::microbenchmark(
+   stringdist::stringdist("the", head(types(toks) , 1000)),
+   stringdist::stringdist("the", head(types(toks) , 10000))
+ )
Unit: microseconds
                                                    expr   min     lq   mean median     uq    max neval
  stringdist::stringdist("the", head(types(toks), 1000)) 241.5 258.95 306.87 271.85 315.95  946.1   100
 stringdist::stringdist("the", head(types(toks), 10000)) 564.1 621.00 702.29 651.65 686.40 1708.9   100
```r
Seems to work with non-ASCII characters

stringdist::stringdist("世界人権宣言", "世界平和宣言")
[1] 2

@koheiw koheiw self-assigned this Jan 18, 2020
@kbenoit kbenoit transferred this issue from quanteda/quanteda Nov 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant