-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add short section on how to import pre-tokenized text #106
Comments
You should use library(quanteda)
#> Package version: 4.0.0
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
data("brussels_reviews_anno", package = "udpipe")
lis <- split(brussels_reviews_anno$token, brussels_reviews_anno$doc_id)
toks <- as.tokens(lis)
head(toks)
#> Tokens consisting of 6 documents.
#> 10049756 :
#> [1] "Muy" "buena" "estadia" "," "la"
#> [6] "habitacion" "donde" "nos" "hospedamos" "es"
#> [11] "muy" "amplia"
#> [ ... and 189 more ]
#>
#> 10061484 :
#> [1] "Muy" "buen" "departamento,en" "una"
#> [5] "excelente" "ubicacion" "." "Jacques"
#> [9] "es" "un" "buen" "buen"
#> [ ... and 34 more ]
#>
#> 10066128 :
#> [1] "Nous" "avons" "passe" "un" "excellent"
#> [6] "sejour" "dans" "ce" "tres" "joli"
#> [11] "appartement" "."
#> [ ... and 43 more ]
#>
#> 10114635 :
#> [1] "La" "casa" "es" "muy" "comoda" ","
#> [7] "estaba" "muy" "limpia" "," "situacion" "perfecta"
#> [ ... and 82 more ]
#>
#> 10120339 :
#> [1] "Sejour" "parfait" "chez" "Olivier" "."
#> [6] "Un" "hote" "attentionne" "," "disponible"
#> [11] "et" "accueillant"
#> [ ... and 34 more ]
#>
#> 10160362 :
#> [1] "Apartamento" "que" "esta" "cerca" "del"
#> [6] "centro" "." "Tiene" "todo" "lo"
#> [11] "que" "necesitas"
#> [ ... and 124 more ] |
Since the udpipe output is almost the same structure as that from library("quanteda")
#> Package version: 4.0.0
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 12 of 12 threads used.
#> See https://quanteda.io for tutorials and examples.
# example data from the udpipe package
data("brussels_reviews_anno", package = "udpipe")
toks_plain <- brussels_reviews_anno |>
getS3method("as.tokens", class = "spacyr_parsed")()
print(toks_plain, 3, 6)
#> Tokens consisting of 1,500 documents.
#> 32198807 :
#> [1] "Gwen" "fue" "una" "magnifica" "anfitriona"
#> [6] "."
#> [ ... and 111 more ]
#>
#> 12919832 :
#> [1] "Aurelie" "fue" "muy" "atenta" "y"
#> [6] "comunicativa"
#> [ ... and 41 more ]
#>
#> 23786310 :
#> [1] "La" "estancia" "fue" "muy" "agradable" "."
#> [ ... and 60 more ]
#>
#> [ reached max_ndoc ... 1,497 more documents ]
toks_pos <- dplyr::rename(brussels_reviews_anno, pos = upos) |>
getS3method("as.tokens", class = "spacyr_parsed")(include_pos = "pos")
print(toks_pos, 3, 6)
#> Tokens consisting of 1,500 documents.
#> 32198807 :
#> [1] "Gwen/NOUN" "fue/VERB" "una/DET" "magnifica/NOUN"
#> [5] "anfitriona/ADJ" "./PUNCT"
#> [ ... and 111 more ]
#>
#> 12919832 :
#> [1] "Aurelie/NOUN" "fue/VERB" "muy/ADV" "atenta/ADJ"
#> [5] "y/CONJ" "comunicativa/ADJ"
#> [ ... and 41 more ]
#>
#> 23786310 :
#> [1] "La/DET" "estancia/NOUN" "fue/VERB" "muy/ADV"
#> [5] "agradable/ADJ" "./PUNCT"
#> [ ... and 60 more ]
#>
#> [ reached max_ndoc ... 1,497 more documents ] Created on 2023-09-29 with reprex v2.0.2 |
|
I thought of that too, then we would just need the equivalents of |
Hi,
Once in a while I happen to get a dataset that is already pre-tokenized (dataframe with columns for tokens and with doc_id). Every time that happens I need to search forever to figure out how to coerce that format to something that quanteda likes.
Maybe it is already in the docs, but google fails me when I try to search for that.
My solution is this one, but I am not sure whether that is the best way:
The text was updated successfully, but these errors were encountered: