Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add short section on how to import pre-tokenized text #106

Open
mrwunderbar666 opened this issue Sep 28, 2023 · 4 comments
Open

Add short section on how to import pre-tokenized text #106

mrwunderbar666 opened this issue Sep 28, 2023 · 4 comments

Comments

@mrwunderbar666
Copy link

Hi,

Once in a while I happen to get a dataset that is already pre-tokenized (dataframe with columns for tokens and with doc_id). Every time that happens I need to search forever to figure out how to coerce that format to something that quanteda likes.

Maybe it is already in the docs, but google fails me when I try to search for that.

My solution is this one, but I am not sure whether that is the best way:

library(quanteda)
library(BTM)

# example data from the BTM package
data("brussels_reviews_anno")

# cast tokenized data to list
tmp_list <- aggregate(token ~ doc_id, data = brussels_reviews_anno, FUN = "list")

# unpack data and create named list
l <- tmp_list$token
names(l) <- tmp_list$doc_id

# transform to quanteda dfm
converted_corpus <- l |> quanteda::as.tokens() |> 
  quanteda::dfm()
@koheiw
Copy link
Collaborator

koheiw commented Sep 28, 2023

You should use split() to make a list. @stefan-mueller let's add to https://tutorials.quanteda.io/basic-operations/tokens/tokens/

library(quanteda)
#> Package version: 4.0.0
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.

data("brussels_reviews_anno", package = "udpipe")
lis <- split(brussels_reviews_anno$token, brussels_reviews_anno$doc_id)
toks <- as.tokens(lis)
head(toks)
#> Tokens consisting of 6 documents.
#> 10049756 :
#>  [1] "Muy"        "buena"      "estadia"    ","          "la"        
#>  [6] "habitacion" "donde"      "nos"        "hospedamos" "es"        
#> [11] "muy"        "amplia"    
#> [ ... and 189 more ]
#> 
#> 10061484 :
#>  [1] "Muy"             "buen"            "departamento,en" "una"            
#>  [5] "excelente"       "ubicacion"       "."               "Jacques"        
#>  [9] "es"              "un"              "buen"            "buen"           
#> [ ... and 34 more ]
#> 
#> 10066128 :
#>  [1] "Nous"        "avons"       "passe"       "un"          "excellent"  
#>  [6] "sejour"      "dans"        "ce"          "tres"        "joli"       
#> [11] "appartement" "."          
#> [ ... and 43 more ]
#> 
#> 10114635 :
#>  [1] "La"        "casa"      "es"        "muy"       "comoda"    ","        
#>  [7] "estaba"    "muy"       "limpia"    ","         "situacion" "perfecta" 
#> [ ... and 82 more ]
#> 
#> 10120339 :
#>  [1] "Sejour"      "parfait"     "chez"        "Olivier"     "."          
#>  [6] "Un"          "hote"        "attentionne" ","           "disponible" 
#> [11] "et"          "accueillant"
#> [ ... and 34 more ]
#> 
#> 10160362 :
#>  [1] "Apartamento" "que"         "esta"        "cerca"       "del"        
#>  [6] "centro"      "."           "Tiene"       "todo"        "lo"         
#> [11] "que"         "necesitas"  
#> [ ... and 124 more ]

@kbenoit
Copy link
Collaborator

kbenoit commented Sep 29, 2023

Since the udpipe output is almost the same structure as that from spacyr::spacy_parse(), it can use the method for as.tokens.spacyr_parsed(). If you want the POS tag appended, you have to slightly rename the udpipe output.

library("quanteda")
#> Package version: 4.0.0
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 12 of 12 threads used.
#> See https://quanteda.io for tutorials and examples.

# example data from the udpipe package
data("brussels_reviews_anno", package = "udpipe")


toks_plain <- brussels_reviews_anno |>
    getS3method("as.tokens", class = "spacyr_parsed")()
print(toks_plain, 3, 6)
#> Tokens consisting of 1,500 documents.
#> 32198807 :
#> [1] "Gwen"       "fue"        "una"        "magnifica"  "anfitriona"
#> [6] "."         
#> [ ... and 111 more ]
#> 
#> 12919832 :
#> [1] "Aurelie"      "fue"          "muy"          "atenta"       "y"           
#> [6] "comunicativa"
#> [ ... and 41 more ]
#> 
#> 23786310 :
#> [1] "La"        "estancia"  "fue"       "muy"       "agradable" "."        
#> [ ... and 60 more ]
#> 
#> [ reached max_ndoc ... 1,497 more documents ]

toks_pos <- dplyr::rename(brussels_reviews_anno, pos = upos) |>
    getS3method("as.tokens", class = "spacyr_parsed")(include_pos = "pos")
print(toks_pos, 3, 6)
#> Tokens consisting of 1,500 documents.
#> 32198807 :
#> [1] "Gwen/NOUN"      "fue/VERB"       "una/DET"        "magnifica/NOUN"
#> [5] "anfitriona/ADJ" "./PUNCT"       
#> [ ... and 111 more ]
#> 
#> 12919832 :
#> [1] "Aurelie/NOUN"     "fue/VERB"         "muy/ADV"          "atenta/ADJ"      
#> [5] "y/CONJ"           "comunicativa/ADJ"
#> [ ... and 41 more ]
#> 
#> 23786310 :
#> [1] "La/DET"        "estancia/NOUN" "fue/VERB"      "muy/ADV"      
#> [5] "agradable/ADJ" "./PUNCT"      
#> [ ... and 60 more ]
#> 
#> [ reached max_ndoc ... 1,497 more documents ]

Created on 2023-09-29 with reprex v2.0.2

@koheiw
Copy link
Collaborator

koheiw commented Oct 1, 2023

as.tokens.spacyr_parsed() should be just as.tokens.data.frame() so that people can use it more broadly.

@kbenoit
Copy link
Collaborator

kbenoit commented Oct 1, 2023

I thought of that too, then we would just need the equivalents of docid_field = "doc_id", tokenid_field = "token", pos_field = "pos" etc. The udpipe output almost matches the spacyr_parsed column names but not for the POS tag, hence the renaming in my code above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants