Robust and Fast tokenizations alignment library for Rust and Python

Demo: demo
Rust document: docs.rs
Blog post: How to calculate the alignment between BERT and spaCy tokens effectively and robustly

Usage (Python)

Installation

$ pip install -U pip # update pip
$ pip install pytokenizations

Or, install from source

This library uses maturin to build the wheel.

$ git clone https://github.com/tamuhey/tokenizations
$ cd tokenizations/python
$ pip install maturin
$ maturin build

Now the wheel is created in python/target/wheels directory, and you can install it with pip install *whl.

`get_alignments`

def get_alignments(a: Sequence[str], b: Sequence[str]) -> Tuple[List[List[int]], List[List[int]]]: ...

Returns alignment mappings for two different tokenizations:

>>> tokens_a = ["å", "BC"]
>>> tokens_b = ["abc"] # the accent is dropped (å -> a) and the letters are lowercased(BC -> bc)
>>> a2b, b2a = tokenizations.get_alignments(tokens_a, tokens_b)
>>> print(a2b)
[[0], [0]]
>>> print(b2a)
[[0, 1]]

a2b[i] is a list representing the alignment from tokens_a to tokens_b.

Usage (Rust)

See here: docs.rs

Algorithm overview
Blog post
seqdiff is used for the diff process.
textspan
explosion/spacy-alignments: 💫 A spaCy package for Yohei Tamura's Rust tokenizations library
- Python bindings for this library, maintained by Explosion, author of spaCy. If you feel difficult to install pytokenizations, please try this.

Name	Name	Last commit message	Last commit date
Latest commit tamuhey update dependencies Jul 18, 2022 2013910 · Jul 18, 2022 History 292 Commits
.cargo	.cargo	create python directory	Dec 31, 2019
.github	.github	Add linux aarch64 wheel build support	Nov 29, 2021
benches	benches	bench done	Apr 1, 2021
demo	demo	Merge pull request #71 from explosion/dependabot/npm_and_yarn/demo/ww…	Nov 18, 2021
dockerfiles	dockerfiles	Fix CI for manylinux build of python package (#7 )	Mar 9, 2020
img	img	update demo.png	Sep 12, 2020
note	note	Update blog_post.md	Jul 31, 2021
python	python	update dependencies	Jul 18, 2022
src	src	lint with clippy	Jul 20, 2021
.dockerignore	.dockerignore	Fix CI for manylinux build of python package (#7 )	Mar 9, 2020
.gitignore	.gitignore	add .gitignore	Jan 1, 2020
CONTRIBUTING.md	CONTRIBUTING.md	update contributing.md (#54 )	Apr 1, 2021
Cargo.toml	Cargo.toml	update dependencies	Jul 18, 2022
LICENSE	LICENSE	Create LICENSE	Jan 2, 2020
README.md	README.md	Update README.md	Aug 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Robust and Fast tokenizations alignment library for Rust and Python

Usage (Python)

`get_alignments`

Usage (Rust)

Related

About

Releases 3

Packages 2

Contributors 4

Languages

License

explosion/tokenizations

Folders and files

Latest commit

History

Repository files navigation

Robust and Fast tokenizations alignment library for Rust and Python

Usage (Python)

get_alignments

Usage (Rust)

Related

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 2

Contributors 4

Languages

`get_alignments`