GitHub - gioui/uax: Unicode Text Segmentation Algorithms

Unicode Text Segmentation Algorithms

Text processing applications need to segment text into pieces. Segments may be

words
sentences
paragraphs

and so on. For western languages this is not too hard of a problem, but it may become an involved endeavor if you consider Arabic or Asian languages. From a typographic viewpoint some of these languages present serious challenges for correct segmenting. The Unicode consortium publishes recommendations and algorithms for various aspects of text segmentation in their Unicode Annexes (UAX).

Text Segmentation in Go

There exist a number of Unicode standards describing best practices for text segmentation. Unfortunately, implementations in Go are sparse. Marcel van Lohuizen from the Go Core Team seems to be working on text segmenting, but with low priority. In the long run, it will be best to wait for the standard library to include functions for text segmentation. However, for now I will implement my own.

Status

This is very much work in progress, not intended for production use. Please be patient.

Name		Name	Last commit message	Last commit date
Latest commit History 157 Commits
bidi		bidi
emoji		emoji
grapheme		grapheme
internal		internal
segment		segment
shaping		shaping
uax11		uax11
uax14		uax14
uax29		uax29
.gitignore		.gitignore
AUTHORS		AUTHORS
LICENSE		LICENSE
README.md		README.md
automata.go		automata.go
doc.go		doc.go
go.mod		go.mod
go.sum		go.sum
prioq.go		prioq.go
prioq_test.go		prioq_test.go
uax.go		uax.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unicode Text Segmentation Algorithms

Text Segmentation in Go

Status

About

Releases

Packages

Languages

License

gioui/uax

Folders and files

Latest commit

History

Repository files navigation

Unicode Text Segmentation Algorithms

Text Segmentation in Go

Status

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages