emoji2vec

A demo project to play word embedding and emoji with twitter data. Let's see how we can make a smarter emoji predictor.

Believe it or not, we have 1800+ emoji from standard Unicode. How to find one from them? Traditionally, we search key words of emoji description. We will use word embedding to find the best match with a context. The results reflects real users habit from social media. Now you are guided with most knowledgeable emoji master :)

Check the site http://nlpfun.com/emoji for a preview of what we can do next with the model!

Data

The zip file in the data folder is a 1M sentences with emoji from Twitter about 2017-Jan. It is randomly selected set from a much bigger corpus. Unzip the corpus.txt directly into the data folder for training.

Training

Run train.py file. Don't forget to set correct parameters like vector size, windows size, etc. It will dump a model and a raw text file for the embedding.

Install the gensim first with python 3.5

pip install -r requirements.txt

Results

Let's run the model.py and see the console output. If you ever played word2vec before, you know the answer for the famous play 'King' - 'man' + 'woman' = ? . Let check the results with our simple twitter data.

query + ['king', 'woman'] - ['man']
[('queen', 0.73), ('crown', 0.7), ('goddess', 0.7), ('princess', 0.7), ('actress', 0.69)]

query + ['china', 'tokyo'] - ['beijing']
[('japan', 0.73), ('theatre', 0.69), ('europe', 0.68), ('nyc', 0.67), ('nuttiness', 0.67)]

query + ['dog', 'cats'] - ['dogs']
[('cat', 0.92), ('kitten', 0.75), ('puppy', 0.73), ('coworker', 0.71), ('mom', 0.69)]

Finally, let's see how well we can find emoji by key words!

query + ['cat'] - []
[('🐱', 0.66), ('🐶', 0.5), ('🐈', 0.47), ('🐰', 0.42), ('🐕', 0.4), ('🐭', 0.39)]

query + ['happy', 'new', 'year'] - []
[('🎉', 0.56), ('🎄', 0.38), ('🏆', 0.38), ('🍾', 0.37), ('🎁', 0.36)]

query + ['king', 'woman'] - ['man']
[('👑', 0.63), ('👸', 0.59), ('🦁', 0.56)]

query + ['💗'] - []
[('💓', 0.94), ('💜', 0.92), ('💖', 0.92), ('💘', 0.91), ('💞', 0.91), ('💕', 0.91), ('💛', 0.81), ('💙', 0.81), ('💟', 0.79), ('💝', 0.72)]

To better understanding what's going on inside, here are some recommendations:
https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/ http://web.stanford.edu/class/cs224n/assignment1/index.html

Visualization

C++ Engine

If you want to put this into your project, it is never hard to write a version in the language you prefer. The C++ demo loads the word embeddings from compressed code and work the same as python version.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
C++ engine		C++ engine
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cluster.png		cluster.png
comp.py		comp.py
d3_data.html		d3_data.html
logo.png		logo.png
model _comp.py		model _comp.py
model.py		model.py
requirements.txt		requirements.txt
train.py		train.py
util.py		util.py
vis.py		vis.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

emoji2vec

Data

Training

Results

Visualization

C++ Engine

About

Releases

Packages

Languages

License

jiali-ms/emoji2vec

Folders and files

Latest commit

History

Repository files navigation

emoji2vec

Data

Training

Results

Visualization

C++ Engine

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages