forked from adsieg/text_similarity
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
204 lines (135 loc) · 9.1 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
---
title: "Text Similarity"
author: "AdrienSIEG"
date: "17/04/2019"
output:
md_document:
variant: markdown_github
---
# The article:
### https://medium.com/@adriensieg/text-similarities-da019229c894
# Methods studied in the article
- Jaccard Similarity
- Different embeddings+ K-means
- Different embeddings+ Cosine Similarity
- Word2Vec + Smooth Inverse Frequency + Cosine Similarity
- Different embeddings+LSI + Cosine Similarity
- Different embeddings+ LDA + Jensen-Shannon distance
- Different embeddings+ Word Mover Distance
- Different embeddings+ Variational Auto Encoder (VAE)
- Different embeddings+ Universal sentence encoder
- Different embeddings+ Siamese Manhattan LSTM
- Knowledge-based Measures
# SOURCES
- **[Incredible !!!]** : https://github.com/nlptown/nlp-notebooks
- An Introduction to Word Embeddings
- Data exploration with sentence similarity
- Discovering and Visualizing Topics in Texts with LDA (en français !)
- Keras sentiment analysis with Elmo Embeddings
- Multilingual Embeddings - 1. Introduction
- Multilingual Embeddings - 2. Cross-lingual Sentence Similarity
- Multilingual Embeddings - 3. Transfer Learning
- NLP with pretrained models - spaCy and StanfordNLP
- Named Entity Recognition with Conditional Random Fields
- Sequence Labelling with a BiLSTM in PyTorch [sequence labelling tasks such as part-of-speech tagging or named entity recognition]
- Simple Sentence Similarity [Word Mover's Distance + Smooth Inverse Frequency + InferSent + Google Sentence Encoder + Pearson correlation]
- Text classification with BERT in PyTorch
- Text classification with a CNN in PyTorch
- Traditional text classification with Scikit-learn [ELI5]
- Updating spaCy's Named Entity Recognition System
- **[Tutorial WMD in jupyter notebook]** : https://github.com/makcedward/nlp/blob/master/sample/nlp-word_mover_distance.ipynb
- **[Word Mover Distance]** : https://www.kaggle.com/ankitswarnkar/word-embedding-using-glove-vector
- **[lstm-gru-sentiment-analysis]** : https://github.com/javaidnabi31/Word-Embeddding-Sentiment-Classification
https://towardsdatascience.com/elmo-contextual-language-embedding-335de2268604
- **[Learning Word Embedding (Mathematics)]** https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html
- **[A Beginner's Aha Moments for Word2Vec]** : https://yidatao.github.io/2017-08-03/word2vec-aha/
- **[Glove, Word2Vec, Fastext classes]** : https://github.com/makcedward/nlp/blob/master/sample/nlp-word_embedding.ipynb
- **[!!! Very nice tutorial about how word2vec works]** : https://towardsdatascience.com/word2vec-made-easy-139a31a4b8ae
- **[!!! An implementation guide to Word2Vec using NumPy and Google Sheets]** : https://towardsdatascience.com/an-implementation-guide-to-word2vec-using-numpy-and-google-sheets-13445eebd281
- **[WordRank embedding: “crowned” is most similar to “king”, not word2vec’s “Canute”]** : https://rare-technologies.com/wordrank-embedding-crowned-is-most-similar-to-king-not-word2vecs-canute/
- **[Nice !!! ]**: https://github.com/JacopoMangiavacchi/SwiftNLC/tree/master/ModelNotebooks
- Create Model With GloVe Embedding Bidirectional With Attention
- Create Model With FastText Embedding
- Create Model With GloVe Embedding
- Create Model With GloVe Embedding Bidirectional
- Create Model With NLTK Embedding
- Create Model With NS Linguistic Tagger Embedding
- **[Tutorial from ENSAE]** : http://www.xavierdupre.fr/app/papierstat/helpsphinx/notebooks/text_sentiment_wordvec.html#les-donnees
- **[CoreML with GloVe Word Embedding and Recursive Neural Network - nice tutorial]** : https://medium.com/@JMangia/coreml-with-glove-word-embedding-and-recursive-neural-network-part-2-ab238ca90970
- **[Big Benchmark]** : http://nlp.town/blog/sentence-similarity/
- Average W2V
- Average W2V + Stopwords
- Average W2V + TFIDF
- Average W2V + TFIDF + Stopwords
- Average Glove
- Average Glove + Stopwords
- Average Glove + TFIDF
- Average Glove + TFIDF + Stopwords
- W2V + WMD
- W2V + Stopwords + WMD
- Glove + WMD
- Glove + Stopwords + WMD
- Smooth Inverse Frequency + W2V
- Smooth Inverse Frequency + Glove
- InferSent (INF)
- GSE (Google Sentence Encoder)
**InferSent (INF)** = pre-trained encoder that was developed by Facebook Research. It is a BiLSTM with max pooling, trained on the SNLI dataset, 570k English sentence pairs labelled with one of three categories: entailment, contradiction or neutral.
**GSE (Google Sentence Encoder)** = Google’s answer to Facebook’s InferSent. It comes in two forms:
- an advanced model that takes the element-wise sum of the context-aware word representations produced by the encoding subgraph of a Transformer model
- a simpler Deep Averaging Network (DAN) where input embeddings for words and bigrams are averaged together and passed through a feed-forward deep neural network.
====-> **work with Pearson correlation**
- **[How to predict Quora Question Pairs using Siamese Manhattan LSTM]** : https://medium.com/mlreview/implementing-malstm-on-kaggles-quora-question-pairs-competition-8b31b0b16a07
- **[Latent Semantic Indexing (LSI) - An Example with mathematics]** : http://www1.se.cuhk.edu.hk/~seem5680/lecture/LSI-Eg.pdf
- **[Finding similar documents with Word2Vec and WMD]** : https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html
- **[Cosine Similarity]** : https://www.machinelearningplus.com/nlp/cosine-similarity/
- **[Tutorial on LSI]** : http://poloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-820-TextAlgorithms.pdf
http://robotics.stanford.edu/~scohen/research/emdg/emdg.html#flow_eqw_notopt
http://robotics.stanford.edu/~rubner/slides/sld014.htm
http://jxieeducation.com/2016-06-13/Document-Similarity-With-Word-Movers-Distance/
- **[Beyond Cosinu > Jensen-Shannon + Hypothesis Test]** : http://stefansavev.com/blog/beyond-cosine-similarity/
- **[Great ressources with MANY MANY notebooks]** : https://www.renom.jp/index.html?c=tutorial
- **[LE TRANSPORT OPTIMAL, COUTEAU SUISSE POUR LA DATA SCIENCE]**: https://weave.eu/le-transport-optimal-un-couteau-suisse-pour-la-data-science/
- **[BEST TUTORIAL variational-autoencoders]** : https://www.jeremyjordan.me/variational-autoencoders/
- **[BEST TUTORIAL : Earthmover Distance !!!!!]** : https://jeremykun.com/2018/03/05/earthmover-distance/
Problem: Compute distance between points with uncertain locations (given by samples, or differing observations, or clusters).
- **[Introduction to Wasserstein metric (earth mover’s distance) -> Mathematics]**: https://yoo2080.wordpress.com/2015/04/09/introduction-to-wasserstein-metric-earth-movers-distance/
- **[Word Mover's distance calculation between word pairs of two documents]** : https://stats.stackexchange.com/questions/303050/word-movers-distance-calculation-between-word-pairs-of-two-documents
- [WMD + Word2Vec] : https://github.com/stephenhky/PyWMD/blob/master/WordMoverDistanceDemo.ipynb
- **[Books about Optimal Transport]** : https://optimaltransport.github.io/pdf/ComputationalOT.pdf
- **[NICE !!!!!! How Autoencoders work - Understanding the math and implementation]** : https://www.kaggle.com/shivamb/how-autoencoders-work-intro-and-usecases
- **[Word2Vec to convert each question into a semantic vector then stack a Siamese network to detect if the pair is duplicate]** : http://www.erogol.com/duplicate-question-detection-deep-learning/
- [Amazing !!!] : https://github.com/makcedward/nlp
- ***Distance Measurement:***
- Euclidean Distance, Cosine Similarity and Jaccard Similarity
- Edit Distance + Levenshtein Distance
- Word Moving Distance (WMD)
- Supervised Word Moving Distance (S-WMD)
- Manhattan LSTM
- **Text Representation:**
**1. Traditional Method**
- Bag-of-words (BoW)
- Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA)
**2. Character Level**
- Character Embedding
**3. Word Level**
- Negative Sampling and Hierarchical Softmax
- Word2Vec, GloVe, fastText
- Contextualized Word Vectors (CoVe)
- Embeddings from Language Models (ELMo)
- Generative Pre-Training (GPT)
- Contextual String Embeddings
- Self-Governing Neural Networks (SGNN)
- Multi-Task Deep Neural Networks (MT-DNN)
- Generative Pre-Training-2 (GPT-2)
- Universal Language Model Fine-tuning (ULMFiT)
**4. Sentence Level**
- Skip-thoughts
- InferSent
- Quick-Thoughts
- General Purpose Sentence (GenSen)
- Bidirectional Encoder Representations from Transformers (BERT)
# [Zoom] : Google Sentence Encoder
- **[Reference]** : https://ai.googleblog.com/2018/05/advances-in-semantic-textual-similarity.html
- **[Waaaaaaaaaaaa]** :https://machinelearningmastery.com/encoder-decoder-deep-learning-models-text-summarization/
![](https://github.com/adsieg/text_similarity/blob/master/pictures/sentence_embedding_macro.png)
![](https://github.com/adsieg/text_similarity/blob/master/pictures/sentence_embedding_micro.png)