Add short section on how to import pre-tokenized text #106

mrwunderbar666 · 2023-09-28T10:29:12Z

Hi,

Once in a while I happen to get a dataset that is already pre-tokenized (dataframe with columns for tokens and with doc_id). Every time that happens I need to search forever to figure out how to coerce that format to something that quanteda likes.

Maybe it is already in the docs, but google fails me when I try to search for that.

My solution is this one, but I am not sure whether that is the best way:

library(quanteda)
library(BTM)

# example data from the BTM package
data("brussels_reviews_anno")

# cast tokenized data to list
tmp_list <- aggregate(token ~ doc_id, data = brussels_reviews_anno, FUN = "list")

# unpack data and create named list
l <- tmp_list$token
names(l) <- tmp_list$doc_id

# transform to quanteda dfm
converted_corpus <- l |> quanteda::as.tokens() |> 
  quanteda::dfm()

koheiw · 2023-09-28T21:13:03Z

You should use split() to make a list. @stefan-mueller let's add to https://tutorials.quanteda.io/basic-operations/tokens/tokens/

library(quanteda)
#> Package version: 4.0.0
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.

data("brussels_reviews_anno", package = "udpipe")
lis <- split(brussels_reviews_anno$token, brussels_reviews_anno$doc_id)
toks <- as.tokens(lis)
head(toks)
#> Tokens consisting of 6 documents.
#> 10049756 :
#>  [1] "Muy"        "buena"      "estadia"    ","          "la"        
#>  [6] "habitacion" "donde"      "nos"        "hospedamos" "es"        
#> [11] "muy"        "amplia"    
#> [ ... and 189 more ]
#> 
#> 10061484 :
#>  [1] "Muy"             "buen"            "departamento,en" "una"            
#>  [5] "excelente"       "ubicacion"       "."               "Jacques"        
#>  [9] "es"              "un"              "buen"            "buen"           
#> [ ... and 34 more ]
#> 
#> 10066128 :
#>  [1] "Nous"        "avons"       "passe"       "un"          "excellent"  
#>  [6] "sejour"      "dans"        "ce"          "tres"        "joli"       
#> [11] "appartement" "."          
#> [ ... and 43 more ]
#> 
#> 10114635 :
#>  [1] "La"        "casa"      "es"        "muy"       "comoda"    ","        
#>  [7] "estaba"    "muy"       "limpia"    ","         "situacion" "perfecta" 
#> [ ... and 82 more ]
#> 
#> 10120339 :
#>  [1] "Sejour"      "parfait"     "chez"        "Olivier"     "."          
#>  [6] "Un"          "hote"        "attentionne" ","           "disponible" 
#> [11] "et"          "accueillant"
#> [ ... and 34 more ]
#> 
#> 10160362 :
#>  [1] "Apartamento" "que"         "esta"        "cerca"       "del"        
#>  [6] "centro"      "."           "Tiene"       "todo"        "lo"         
#> [11] "que"         "necesitas"  
#> [ ... and 124 more ]

kbenoit · 2023-09-29T12:43:14Z

Since the udpipe output is almost the same structure as that from spacyr::spacy_parse(), it can use the method for as.tokens.spacyr_parsed(). If you want the POS tag appended, you have to slightly rename the udpipe output.

library("quanteda")
#> Package version: 4.0.0
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 12 of 12 threads used.
#> See https://quanteda.io for tutorials and examples.

# example data from the udpipe package
data("brussels_reviews_anno", package = "udpipe")


toks_plain <- brussels_reviews_anno |>
    getS3method("as.tokens", class = "spacyr_parsed")()
print(toks_plain, 3, 6)
#> Tokens consisting of 1,500 documents.
#> 32198807 :
#> [1] "Gwen"       "fue"        "una"        "magnifica"  "anfitriona"
#> [6] "."         
#> [ ... and 111 more ]
#> 
#> 12919832 :
#> [1] "Aurelie"      "fue"          "muy"          "atenta"       "y"           
#> [6] "comunicativa"
#> [ ... and 41 more ]
#> 
#> 23786310 :
#> [1] "La"        "estancia"  "fue"       "muy"       "agradable" "."        
#> [ ... and 60 more ]
#> 
#> [ reached max_ndoc ... 1,497 more documents ]

toks_pos <- dplyr::rename(brussels_reviews_anno, pos = upos) |>
    getS3method("as.tokens", class = "spacyr_parsed")(include_pos = "pos")
print(toks_pos, 3, 6)
#> Tokens consisting of 1,500 documents.
#> 32198807 :
#> [1] "Gwen/NOUN"      "fue/VERB"       "una/DET"        "magnifica/NOUN"
#> [5] "anfitriona/ADJ" "./PUNCT"       
#> [ ... and 111 more ]
#> 
#> 12919832 :
#> [1] "Aurelie/NOUN"     "fue/VERB"         "muy/ADV"          "atenta/ADJ"      
#> [5] "y/CONJ"           "comunicativa/ADJ"
#> [ ... and 41 more ]
#> 
#> 23786310 :
#> [1] "La/DET"        "estancia/NOUN" "fue/VERB"      "muy/ADV"      
#> [5] "agradable/ADJ" "./PUNCT"      
#> [ ... and 60 more ]
#> 
#> [ reached max_ndoc ... 1,497 more documents ]

^{Created on 2023-09-29 with reprex v2.0.2}

koheiw · 2023-10-01T09:09:51Z

as.tokens.spacyr_parsed() should be just as.tokens.data.frame() so that people can use it more broadly.

kbenoit · 2023-10-01T11:23:15Z

I thought of that too, then we would just need the equivalents of docid_field = "doc_id", tokenid_field = "token", pos_field = "pos" etc. The udpipe output almost matches the spacyr_parsed column names but not for the POS tag, hence the renaming in my code above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add short section on how to import pre-tokenized text #106

Add short section on how to import pre-tokenized text #106

mrwunderbar666 commented Sep 28, 2023

koheiw commented Sep 28, 2023

kbenoit commented Sep 29, 2023

koheiw commented Oct 1, 2023

kbenoit commented Oct 1, 2023 •

edited

Loading

Add short section on how to import pre-tokenized text #106

Add short section on how to import pre-tokenized text #106

Comments

mrwunderbar666 commented Sep 28, 2023

koheiw commented Sep 28, 2023

kbenoit commented Sep 29, 2023

koheiw commented Oct 1, 2023

kbenoit commented Oct 1, 2023 • edited Loading

kbenoit commented Oct 1, 2023 •

edited

Loading