Skip to content
This repository has been archived by the owner on Dec 7, 2023. It is now read-only.
/ tokenizations Public archive

Robust and Fast tokenizations alignment library for Rust and Python https://tamuhey.github.io/tokenizations/

License

Notifications You must be signed in to change notification settings

explosion/tokenizations

Folders and files

NameName
Last commit message
Last commit date

Latest commit

2013910 · Jul 18, 2022
Dec 31, 2019
Nov 29, 2021
Apr 1, 2021
Nov 18, 2021
Mar 9, 2020
Sep 12, 2020
Jul 31, 2021
Jul 18, 2022
Jul 20, 2021
Mar 9, 2020
Jan 1, 2020
Apr 1, 2021
Jul 18, 2022
Jan 2, 2020
Aug 27, 2021

Repository files navigation

Robust and Fast tokenizations alignment library for Rust and Python

creates.io pypi Actions Status

sample

Demo: demo
Rust document: docs.rs
Blog post: How to calculate the alignment between BERT and spaCy tokens effectively and robustly

Usage (Python)

  • Installation
$ pip install -U pip # update pip
$ pip install pytokenizations
  • Or, install from source

This library uses maturin to build the wheel.

$ git clone https://github.com/tamuhey/tokenizations
$ cd tokenizations/python
$ pip install maturin
$ maturin build

Now the wheel is created in python/target/wheels directory, and you can install it with pip install *whl.

get_alignments

def get_alignments(a: Sequence[str], b: Sequence[str]) -> Tuple[List[List[int]], List[List[int]]]: ...

Returns alignment mappings for two different tokenizations:

>>> tokens_a = ["å", "BC"]
>>> tokens_b = ["abc"] # the accent is dropped (å -> a) and the letters are lowercased(BC -> bc)
>>> a2b, b2a = tokenizations.get_alignments(tokens_a, tokens_b)
>>> print(a2b)
[[0], [0]]
>>> print(b2a)
[[0, 1]]

a2b[i] is a list representing the alignment from tokens_a to tokens_b.

Usage (Rust)

See here: docs.rs

Related