README.Rmd

---
title: "Text Similarity"
author: "AdrienSIEG"
date: "17/04/2019"
output:
  md_document:
    variant: markdown_github
---

# The article:

### https://medium.com/@adriensieg/text-similarities-da019229c894

# Methods studied in the article

- Jaccard Similarity

- Different embeddings+ K-means 

- Different embeddings+ Cosine Similarity

- Word2Vec + Smooth Inverse Frequency + Cosine Similarity 

- Different embeddings+LSI + Cosine Similarity

- Different embeddings+ LDA + Jensen-Shannon distance

- Different embeddings+ Word Mover Distance

- Different embeddings+ Variational Auto Encoder (VAE)

- Different embeddings+ Universal sentence encoder

- Different embeddings+ Siamese Manhattan LSTM

- Knowledge-based Measures

# SOURCES

- **[Incredible !!!]** : https://github.com/nlptown/nlp-notebooks 

    - An Introduction to Word Embeddings
    - Data exploration with sentence similarity
    - Discovering and Visualizing Topics in Texts with LDA (en français !)
    - Keras sentiment analysis with Elmo Embeddings
    - Multilingual Embeddings - 1. Introduction
    - Multilingual Embeddings - 2. Cross-lingual Sentence Similarity
    - Multilingual Embeddings - 3. Transfer Learning
    - NLP with pretrained models - spaCy and StanfordNLP
    - Named Entity Recognition with Conditional Random Fields
    - Sequence Labelling with a BiLSTM in PyTorch [sequence labelling tasks such as part-of-speech tagging or named entity recognition]
    - Simple Sentence Similarity [Word Mover's Distance + Smooth Inverse Frequency + InferSent + Google Sentence Encoder + Pearson correlation]
    - Text classification with BERT in PyTorch
    - Text classification with a CNN in PyTorch
    - Traditional text classification with Scikit-learn [ELI5]
    - Updating spaCy's Named Entity Recognition System
    
- **[Tutorial WMD in jupyter notebook]** : https://github.com/makcedward/nlp/blob/master/sample/nlp-word_mover_distance.ipynb

- **[Word Mover Distance]** : https://www.kaggle.com/ankitswarnkar/word-embedding-using-glove-vector

- **[lstm-gru-sentiment-analysis]** : https://github.com/javaidnabi31/Word-Embeddding-Sentiment-Classification

https://towardsdatascience.com/elmo-contextual-language-embedding-335de2268604

- **[Learning Word Embedding (Mathematics)]** https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html

- **[A Beginner's Aha Moments for Word2Vec]** : https://yidatao.github.io/2017-08-03/word2vec-aha/

- **[Glove, Word2Vec, Fastext classes]** : https://github.com/makcedward/nlp/blob/master/sample/nlp-word_embedding.ipynb

- **[!!! Very nice tutorial about how word2vec works]** : https://towardsdatascience.com/word2vec-made-easy-139a31a4b8ae

- **[!!! An implementation guide to Word2Vec using NumPy and Google Sheets]** : https://towardsdatascience.com/an-implementation-guide-to-word2vec-using-numpy-and-google-sheets-13445eebd281

- **[WordRank embedding: “crowned” is most similar to “king”, not word2vec’s “Canute”]** : https://rare-technologies.com/wordrank-embedding-crowned-is-most-similar-to-king-not-word2vecs-canute/

- **[Nice !!! ]**: https://github.com/JacopoMangiavacchi/SwiftNLC/tree/master/ModelNotebooks

    - Create Model With GloVe Embedding Bidirectional With Attention
    - Create Model With FastText Embedding
    - Create Model With GloVe Embedding
    - Create Model With GloVe Embedding Bidirectional
    - Create Model With NLTK Embedding
    - Create Model With NS Linguistic Tagger Embedding

- **[Tutorial from ENSAE]** : http://www.xavierdupre.fr/app/papierstat/helpsphinx/notebooks/text_sentiment_wordvec.html#les-donnees

- **[CoreML with GloVe Word Embedding and Recursive Neural Network - nice tutorial]** :  https://medium.com/@JMangia/coreml-with-glove-word-embedding-and-recursive-neural-network-part-2-ab238ca90970


- **[Big Benchmark]** : http://nlp.town/blog/sentence-similarity/

    -	Average W2V
    -	Average W2V + Stopwords
    -	Average W2V + TFIDF
    -	Average W2V + TFIDF + Stopwords
    -	Average Glove
    -	Average Glove + Stopwords
    -	Average Glove + TFIDF
    -	Average Glove + TFIDF + Stopwords
    -	W2V + WMD
    -	W2V + Stopwords + WMD
    -	Glove + WMD
    -	Glove + Stopwords + WMD
    -	Smooth Inverse Frequency + W2V
    -	Smooth Inverse Frequency + Glove
    -	InferSent (INF) 
    -	GSE (Google Sentence Encoder)
    
    **InferSent (INF)**  = pre-trained encoder that was developed by Facebook Research. It is a BiLSTM with max pooling, trained on the SNLI dataset, 570k English sentence pairs labelled with one of three categories: entailment, contradiction or neutral.
    
    **GSE (Google Sentence Encoder)** = Google’s answer to Facebook’s InferSent. It comes in two forms: 
    -	an advanced model that takes the element-wise sum of the context-aware word representations produced by the encoding subgraph of a Transformer model
    -	a simpler Deep Averaging Network (DAN) where input embeddings for words and bigrams are averaged together and passed through a feed-forward deep neural network.
    
    ====-> **work with Pearson correlation**

- **[How to predict Quora Question Pairs using Siamese Manhattan LSTM]** : https://medium.com/mlreview/implementing-malstm-on-kaggles-quora-question-pairs-competition-8b31b0b16a07

- **[Latent Semantic Indexing (LSI) - An Example with mathematics]** : http://www1.se.cuhk.edu.hk/~seem5680/lecture/LSI-Eg.pdf

- **[Finding similar documents with Word2Vec and WMD]** : https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html

- **[Cosine Similarity]** : https://www.machinelearningplus.com/nlp/cosine-similarity/

- **[Tutorial on LSI]** : http://poloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-820-TextAlgorithms.pdf

http://robotics.stanford.edu/~scohen/research/emdg/emdg.html#flow_eqw_notopt

http://robotics.stanford.edu/~rubner/slides/sld014.htm

http://jxieeducation.com/2016-06-13/Document-Similarity-With-Word-Movers-Distance/

- **[Beyond Cosinu > Jensen-Shannon + Hypothesis Test]** : http://stefansavev.com/blog/beyond-cosine-similarity/

- **[Great ressources with MANY MANY notebooks]** : https://www.renom.jp/index.html?c=tutorial

- **[LE TRANSPORT OPTIMAL, COUTEAU SUISSE POUR LA DATA SCIENCE]**: https://weave.eu/le-transport-optimal-un-couteau-suisse-pour-la-data-science/

- **[BEST TUTORIAL variational-autoencoders]** : https://www.jeremyjordan.me/variational-autoencoders/

- **[BEST TUTORIAL : Earthmover Distance !!!!!]** : https://jeremykun.com/2018/03/05/earthmover-distance/

Problem: Compute distance between points with uncertain locations (given by samples, or differing observations, or clusters).

- **[Introduction to Wasserstein metric (earth mover’s distance) -> Mathematics]**: https://yoo2080.wordpress.com/2015/04/09/introduction-to-wasserstein-metric-earth-movers-distance/

- **[Word Mover's distance calculation between word pairs of two documents]** : https://stats.stackexchange.com/questions/303050/word-movers-distance-calculation-between-word-pairs-of-two-documents

- [WMD + Word2Vec] : https://github.com/stephenhky/PyWMD/blob/master/WordMoverDistanceDemo.ipynb

- **[Books about Optimal Transport]** : https://optimaltransport.github.io/pdf/ComputationalOT.pdf

- **[NICE !!!!!! How Autoencoders work - Understanding the math and implementation]** : https://www.kaggle.com/shivamb/how-autoencoders-work-intro-and-usecases

- **[Word2Vec to convert each question into a semantic vector then stack a Siamese network to detect if the pair is duplicate]** : http://www.erogol.com/duplicate-question-detection-deep-learning/

- [Amazing !!!] : https://github.com/makcedward/nlp
    - ***Distance Measurement:***
      - Euclidean Distance, Cosine Similarity and Jaccard Similarity
      - Edit Distance + Levenshtein Distance
      - Word Moving Distance (WMD)
      - Supervised Word Moving Distance (S-WMD)
      - Manhattan LSTM
      
    - **Text Representation:**
    
      **1. Traditional Method**
        - Bag-of-words (BoW)
        - Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA)

      **2. Character Level** 
        - Character Embedding
        
      **3. Word Level** 
        - Negative Sampling and Hierarchical Softmax			
        - Word2Vec, GloVe, fastText
        - Contextualized Word Vectors (CoVe)
        - Embeddings from Language Models (ELMo)
        - Generative Pre-Training (GPT)
        - Contextual String Embeddings
        - Self-Governing Neural Networks (SGNN)
        - Multi-Task Deep Neural Networks (MT-DNN)
        - Generative Pre-Training-2 (GPT-2)
        - Universal Language Model Fine-tuning (ULMFiT)
      
      **4. Sentence Level**
        - Skip-thoughts
        - InferSent
        - Quick-Thoughts	
        - General Purpose Sentence (GenSen)
        - Bidirectional Encoder Representations from Transformers (BERT)

# [Zoom] : Google Sentence Encoder

- **[Reference]** : https://ai.googleblog.com/2018/05/advances-in-semantic-textual-similarity.html

- **[Waaaaaaaaaaaa]** :https://machinelearningmastery.com/encoder-decoder-deep-learning-models-text-summarization/

![](https://github.com/adsieg/text_similarity/blob/master/pictures/sentence_embedding_macro.png)

![](https://github.com/adsieg/text_similarity/blob/master/pictures/sentence_embedding_micro.png)