- Extract sentences from PDFs with Apache Tika (Thai sentences with
pythainlp
and English sentences withnltk
)
python extract_sentences.py --en_dir en_data/ --th_dir th_data/
- Align sentences using universal sentence encoder
python align_sentences_use.py --en_dir en_data/ --th_dir th_data/ --output_path assorted_government.csv
- @attapol - Extraction and normalization of Thai texts from PDF
- @pinedbean - Universal sentence encoder inference code
- @cstorm125 - Sentence alignment with universal sentence encoder
- @pnphannisa - Sourcing government document in PDF files