SemSim is a project for calculating semantic similarity of word pairs. In particular, it carefully implements the Semantic Diversity metric (SemD) from Hoffman et al. 2013 in Python. Additionally, it provides word and document vector models for German and English, optimized for semantic diversity which also perform very well on the Synonym Judgement Task. The models have been trained on the British National Corpus (BNC) (English) as well as on the DeWaC corpus (German).
conda env create -f environment.yml
conda activate semsim
pip install -r requirements.txt
pip install -e .
For downloading the BNC and DeWaC corpora visit:
In addition, download models and evaluation content from this
Google Drive folder and
extract it do semsim/data
.
The models referenced in Bechtold et al. 2023 are located in
semsin/data/SemD/DEWAC_1000_40k_v2
(LSI based word and document vectors + SemD values)semsin/data/SemD/DEWAC_1000_40k_d2v
(doc2vec based word and document vectors + SemD values)
This section describes how to convert the BNC and DeWaC corpora into a common format for further processing. If you are only interested in the vector models or in calculating SemD or the Synonym Judgement Task with pre-trained models, you can skip this section.
Download the full BNC (XML edition) from the Oxford Text Archive via:
British National Corpus (BNC).
The download format is a .zip
file which can be extracted anywhere, preferably into
semsim/data/corpora/BNC/<corpus-version>
, where <corpus-version>
is usually the stem of the
.zip
file. The path is now referenced as $BNC_DIR
and should contain a download
directory.
Run the extraction script via:
python -m semsim.corpus.bnc -i $BNC_DIR
For our reference model we were using
python -m semsim.corpus.bnc -i $BNC_DIR \
--window 1000 \
--min-doc-size 50 \
--lowercase \
--tags-blocklist PUN PUL PUR UNC PUQ
Filtering for uninformative POS tags made no significant difference with respect to model performance, but helps to improve the efficiency of the downstream pipeline.
For additional options run
python -m semsim.corpus.bnc --help
coming soon
In this step we extract the term-document-matrix from the pre-processed corpus and apply a normalization such as tf-idf or log-entropy-norm to the matrix. Hereafter we convert the sparse matrix into two dense matrices representing word and document vectors using latent semantic indexing (LSI). We can use these latent representations later to calculate SemD for a given set of words.
Run for details:
python -m semsim.metric.semantic_diversity --help
coming soon
coming soon