-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathTopic_Modeling_Gensim.py
659 lines (533 loc) · 591 KB
/
Topic_Modeling_Gensim.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
# -*- coding: utf-8 -*-
# ---
# jupyter:
# jupytext:
# formats: ipynb,py:percent
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.11.4
# kernelspec:
# display_name: Python 3 (ipykernel)
# language: python
# name: python3
# ---
# %% [markdown]
# <div class="alert" style="background-color:#fff; color:white; padding:0px 10px; border-radius:5px;"><h1 style='margin:15px 15px; color:#5d3a8e; font-size:40px'> Topic Modeling with Gensim (Python)</h1>
# </div>
# https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
# %% [markdown] id="t0GCxAfUJMEl"
# Topic Modeling is a technique to extract the hidden topics from large volumes of text. Latent Dirichlet Allocation(LDA) is a popular algorithm for topic modeling with excellent implementations in the Python’s Gensim package. The challenge, however, is *how to extract good quality of topics that are clear, segregated and meaningful.* This depends heavily on the quality of text preprocessing and the strategy of finding the optimal number of topics. This tutorial attempts to tackle both of these problems.
# %% [markdown] id="SLRV_Vnq7XTv"
# <div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> Content</h2>
# </div>
#
# 1. Introduction
# 2. Prerequisites – Download nltk stopwords and spacy model
# 3. Import Packages
# 4. What does LDA do?
# 5. Prepare Stopwords
# 6. Import Newsgroups Data
# 7. Remove emails and newline characters
# 8. Tokenize words and Clean-up text
# 9. Creating Bigram and Trigram Models
# 10. Remove Stopwords, Make Bigrams and Lemmatize
# 11. Create the Dictionary and Corpus needed for Topic Modeling
# 12. Building the Topic Model
# 13. View the topics in LDA model
# 14. Compute Model Perplexity and Coherence Score
# 15. Visualize the topics-keywords
# 16. Building LDA Mallet Model
# 17. How to find the optimal number of topics for LDA?
# 18. Finding the dominant topic in each sentence
# 19. Find the most representative document for each topic
# 20. Topic distribution across documents
# %% [markdown] id="rhXY8GzxJMEu"
# <div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 1. Introduction</h2>
# </div>
# %% [markdown] id="Olq16KXFJMEv"
# One of the primary applications of natural language processing is to automatically extract what topics people are discussing from large volumes of text. Some examples of large text could be feeds from social media, customer reviews of hotels, movies, etc, user feedbacks, news stories, e-mails of customer complaints etc.
#
# Knowing what people are talking about and understanding their problems and opinions is highly valuable to businesses, administrators, political campaigns. And it’s really hard to manually read through such large volumes and compile the topics.
#
# Thus is required an automated algorithm that can read through the text documents and automatically output the topics discussed.
#
# In this tutorial, we will take a real example of the ’20 Newsgroups’ dataset and use LDA to extract the naturally discussed topics.
#
# I will be using the Latent Dirichlet Allocation (LDA) from Gensim package along with the Mallet’s implementation (via Gensim). Mallet has an efficient implementation of the LDA. It is known to run faster and gives better topics segregation.
#
# We will also extract the volume and percentage contribution of each topic to get an idea of how important a topic is.
#
# Let’s begin!
# Topic Modeling with Gensim in Python. Photo by Jeremy Bishop.
# %% [markdown] id="g0uCG78TJMEv"
# <div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 2. Prerequisites – Download nltk stopwords and spacy model</h2>
# </div>
#
# We will need the stopwords from NLTK and spacy’s en model for text pre-processing. Later, we will be using the spacy model for lemmatization.
#
# Lemmatization is nothing but converting a word to its root word. For example: the lemma of the word ‘machines’ is ‘machine’. Likewise, ‘walking’ –> ‘walk’, ‘mice’ –> ‘mouse’ and so on.
# %% colab={"base_uri": "https://localhost:8080/"} id="t5T9ylYgLEEv" outputId="5fefff2a-4f36-402a-8da0-aa86c0e5d55f"
# Run in python console
import nltk; nltk.download('stopwords')
# Run in terminal or command prompt
#python3 -m spacy download en
# %% [markdown] id="vTOmOEW0JMEw"
# <div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 3. Import Packages</h2>
# </div>
#
# The core packages used in this tutorial are `re`, `gensim`, `spacy` and `pyLDAvis`. Besides this we will also using `matplotlib`,`numpy` and `pandas` for data handling and visualization. Let’s import them.
# %%
# pip install pyLDAvis
# pip install gensim
# pip install spacy==2.2.0
# %% id="2bHmxbxeJMEx"
import re
import numpy as np
import pandas as pd
from pprint import pprint
# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
# spacy for lemmatization
import spacy
# Plotting tools
import pyLDAvis
#import pyLDAvis.gensim # don't skip this
import pyLDAvis.gensim_models
import matplotlib.pyplot as plt
# %matplotlib inline
# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)
# %% [markdown] id="AQEEaKhVJMEx"
# <div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 4. What does LDA do?</h2>
# </div>
# %% [markdown] id="EqXiW_NrJMEy"
# LDA’s approach to topic modeling is it considers each document as a collection of topics in a certain proportion. And each topic as a collection of keywords, again, in a certain proportion.
#
# Once you provide the algorithm with the number of topics, all it does it to rearrange the topics distribution within the documents and keywords distribution within the topics to obtain a good composition of topic-keywords distribution.
#
# When I say topic, what is it actually and how it is represented?
#
# A topic is nothing but a collection of dominant keywords that are typical representatives. Just by looking at the keywords, you can identify what the topic is all about.
#
# The following are key factors to obtaining good segregation topics:
#
#
# 1)The quality of text processing.
#
# 2)The variety of topics the text talks about.
#
# 3)The choice of topic modeling algorithm.
#
# 4)The number of topics fed to the algorithm.
#
# 5)The algorithms tuning parameters.
# %% [markdown] id="LU8bySG4JMEy"
# <div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 5. Prepare Stopwords</h2>
# </div>
#
#
# We have already downloaded the stopwords. Let's import them and make it available in `stop_words`.
# %% id="kxK_dgu1JMEz"
# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
# %% [markdown] id="3gjPwI-UJME0"
# <div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 6. Import Newsgroups Data</h2>
# </div>
# %% [markdown] id="1N8yzen0_B_4"
# We will be using the 20-Newsgroups dataset for this exercise. This version of the dataset contains about 11k newsgroups posts from 20 different topics. This is available as [newsgroups.json](https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json).
#
# This is imported using `pandas.read_json` and the resulting dataset has 3 columns as shown.
# %% colab={"base_uri": "https://localhost:8080/", "height": 296} id="5LK_zPh8ZzXn" outputId="7d505b12-30d0-49df-ac79-b5d5ffa45a67"
# Import Dataset
df = pd.read_json('https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json')
print(df.target_names.unique())
df.head()
# %% [markdown] id="_Pvuv-EDJME5"
# <div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 7. Remove emails and newline characters</h2>
# </div>
#
# As you can see there are many emails, newline and extra spaces that is quite distracting. Let's get rid of them using [regular expressions](https://www.machinelearningplus.com/python/python-regex-tutorial-examples/).
# %% colab={"base_uri": "https://localhost:8080/"} id="H9UFTq4GJME5" outputId="f85bd407-2e06-4416-f0da-4a62dbebc92c"
# Convert to list
data = df.content.values.tolist()
# Remove Emails
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]
# Remove new line characters
data = [re.sub('\s+', ' ', sent) for sent in data]
# Remove distracting single quotes
data = [re.sub("\'"," ", sent) for sent in data]
pprint(data[:2])
# %% [markdown] id="GMWsrdMYJME6"
# <div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 8. Tokenize words and Clean-up text </h2>
# </div>
#
# The sentences look better now, but you want to tokenize each sentence into a list of words, removing punctuations and unneccessary characters altogether.
#
# Gensim's `simple_preprocess()` is great for this. Additionally I have set `deacc=True` to remove the punctuations.
# %% colab={"base_uri": "https://localhost:8080/", "height": 212} id="hEawaeOJJME6" outputId="932f8566-a89c-4c10-a074-1f1fec6840e9"
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence).encode('utf-8'), deacc=True)) # deacc=True removes punctuations
data_words = list(sent_to_words(data))
print(data_words[:1])
# %% [markdown] id="ZSg33gKvJME6"
# <div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 9. Create Bigram and Trigram Models</h2>
# </div>
#
# Bigrams are two words frequently occurring together in the document. Trigrams are 3 words frequently occurring.
#
# Some examples in our example are: 'front bumper', 'oil leak', 'maryland college park' etc.
#
# Gensim's `Phrases` model can build and implement the bigrams, trigrams, quadgrams and more. The two important arguments to `Phrases` are `min_count` and `threshold`. The higher the values of these param, the harder it is for words to be combined to bigrams.
# %% id="yrwK3tkBJME7"
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
# See trigram example
print(trigram_mod[bigram_mod[data_words[0]]])
# %% [markdown] id="mSKmvrrMJME7"
# <div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 10. Remove Stopwords, Make Bigrams and Lemmatize</h2>
# </div>
#
#
# The bigrams model is ready. Let's define the functions to remove the stopwords, make bigrams and lemmatization and call them sequentially.
# %% id="pIAmYQLkJME7"
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
def make_bigrams(texts):
return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
return [trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
"""https://spacy.io/api/annotation"""
texts_out = []
for sent in texts:
doc = nlp(" ".join(sent))
texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
return texts_out
# %% [markdown] id="9vD4jR4UBygH"
# Let’s call the functions in order.
# %%
# pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz
# %% id="MGC-N8_EJME7"
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)
# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)
# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
print(data_lemmatized[:1])
# %% [markdown] id="R7uMrXUpJME8"
# <div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 11. Create the Dictionary and Corpus needed for Topic Modeling</h2>
# </div>
#
# The two main inputs to the LDA topic model are the dictionary(`id2word`) and the corpus. Let's create them.
# %% id="DqlOgG27JME8"
# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)
# Create Corpus
texts = data_lemmatized
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]
# View
print(corpus[:1])
# %% [markdown] id="LjFNNXzAJME9"
# Gensim creates a unique id for each word in the document. The produced corpus shown above is a mapping of (word_id, word_frequency).
#
# For example, (0, 1) above implies, word id 0 occurs once in the first document. Likewise, word id 1 occurs twice and so on.
#
# This is used as the input by the LDA model.
#
# %% [markdown] id="vl0gK6hEJME-"
# If you want to see what word a given id corresponds to, pass the id as a key to the dictionary.
# %% id="RLtyx0RpJME-"
id2word[0]
# %% [markdown] id="rUb3j6Q3JME_"
# Or, you can see a human readable form of the corpus itself.
# %% id="02ezXeDuJME_"
corpus[:1][0][:10]
# %% id="U6ZzD5UiJME_"
# Human readable format of corpus (term-frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]
# %% [markdown] id="tpqMJzvFJMFA"
# Alright, without digressing further let's jump back on track with the next step: Building the topic model.
# %% [markdown] id="DAuJo71iJMFA"
# <div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 12. Building the Topic Model</h2>
# </div>
# %% [markdown] id="rJ0KiNu8JMFA"
# We have everything required to train the LDA model. In addition to the corpus and dictionary, you need to provide the number of topics as well.
#
# Apart from that, `alpha` and `eta` are hyperparameters that affect sparsity of the topics. According to the gensim docs, both defaults to 1.0/num_topics prior.
#
# `chunksize` is the number of documents to be used in each training chunk. `update_every` determines how often the model parameters should be updated and `passes` is the total number of training passes.
# %% id="VAPlP5krJMFA"
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=20,
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha='auto',
per_word_topics=True)
# %% [markdown] id="YVvKis8MJMFA"
# <div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 13. View the topics in LDA model</h2>
# </div>
#
# The above LDA model is built with 20 different topics where each topic is a combination of keywords and each keyword contributes a certain weightage to the topic.
#
# You can see the keywords for each topic and the weightage(importance) of each keyword using `lda_model.print_topics()` as shown next.
# %% id="_xrMa9_nJMFB"
# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]
# %% [markdown] id="y9ZOlWfRJMFB"
# How to interpret this?
#
# Topic 0 is a represented as - 0.040*"evidence" + 0.030*"believe" + 0.030*"reason" + 0.027*"claim" + '
# '0.021*"sense" + 0.021*"say" + 0.019*"faith" + 0.019*"exist" + '
# '0.014*"people" + 0.014*"science".
#
# It means the top 10 keywords that contribute to this topic are: 'evidence', 'believe', 'reason'.. and so on and the weight of 'evidence' on topic 0 is 0.040.
#
# The weights reflect how important a keyword is to that topic.
#
# Looking at these keywords, can you guess what this topic could be? You may summarise it either are 'cars' or 'automobiles'.
#
# Likewise, can you go through the remaining topic keywords and judge what the topic is?
# %% [markdown] id="EgvO-m-H5DH3"
# 
#
# Inferring Topic from Keywords
# %% [markdown] id="efQl15CYJMFB"
# <div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 14. Compute Model Perplexity and Coherence Score</h2>
# </div>
#
# %% [markdown] id="41dafJVjJMFC"
# Model perplexity and [topic coherence](https://rare-technologies.com/what-is-topic-coherence/) provide a convenient measure to judge how good a given topic model is. In my experience, topic coherence score, in particular, has been more helpful.
# %% id="wuud9M0VJMFC"
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus)) # a measure of how good the model is. lower the better.
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)
# %% [markdown] id="987_oCkWJMFC"
# There you have a coherence score of 0.44.
# %% [markdown] id="UqNrGJbJJMFC"
# <div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 15. Visualize the topics-keywords</h2>
# </div>
#
# Now that the LDA model is built, the next step is to examine the produced topics and the associated keywords. There is no better tool than pyLDAvis package’s interactive chart and is designed to work well with jupyter notebooks.
# %% id="VwiiQ6bEJMFD"
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis
# %% [markdown] id="lvtNAgXwJMFD"
# So how to infer pyLDAvis’s output?
#
# Each bubble on the left-hand side plot represents a topic. The larger the bubble, the more prevalent is that topic.
#
# A good topic model will have fairly big, non-overlapping bubbles scattered throughout the chart instead of being clustered in one quadrant.
#
# A model with too many topics, will typically have many overlaps, small sized bubbles clustered in one region of the chart.
#
# Alright, if you move the cursor over one of the bubbles, the words and bars on the right-hand side will update. These words are the salient keywords that form the selected topic.
#
# We have successfully built a good looking topic model.
#
# Given our prior knowledge of the number of natural topics in the document, finding the best model was fairly straightforward.
#
# Upnext, we will improve upon this model by using Mallet’s version of LDA algorithm and then we will focus on how to arrive at the optimal number of topics given any large corpus of text.
# %% [markdown] id="wKDDvZbFJMFE"
# <div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 16. Building LDA Mallet Model</h2>
# </div>
#
# So far you have seen Gensim’s inbuilt version of the LDA algorithm. Mallet’s version, however, often gives a better quality of topics.
#
# Gensim provides a wrapper to implement Mallet's LDA from within gensim itself. You only need to [download](http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip) the zipfile, unzip it and provide the path to mallet in the unzipped directory to `gensim.models.wrappers.LdaMallet`. See how I have done this below.
# %%
import os
os.environ.update({'MALLET_HOME':r'C:/Users/Lenovo/Desktop/mallet-2.0.8'})
# %% id="axMNiS5wJMFE"
## Download File: http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
mallet_path = 'C:/Users/Lenovo/Desktop/mallet-2.0.8/bin/mallet' # update this path
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=20, id2word=id2word)
# %% id="vkmCE0EzJMFF"
# Show Topics
pprint(ldamallet.show_topics(formatted=False))
# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet)
# %% [markdown] id="9VklPkhfJMFG"
# Just by changing the LDA algorithm, we increased the coherence score from .448 to .54. Not bad!
# %% [markdown] id="QbLZKoXbJMFG"
# <div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 17. How to find the optimal number of topics for LDA?</h2>
# </div>
# %% [markdown] id="PFtNpYa_JMFG"
# My approach to finding the optimal number of topics is to build many LDA models with different values of number of topics (k) and pick the one that gives the highest coherence value.
#
# Choosing a ‘k’ that marks the end of a rapid growth of topic coherence usually offers meaningful and interpretable topics. Picking an even higher value can sometimes provide more granular sub-topics.
#
# If you see the same keywords being repeated in multiple topics, it’s probably a sign that the ‘k’ is too large.
#
# The `compute_coherence_values()` (see below) trains multiple LDA models and provides the models and their corresponding coherence scores.
# %% id="uMFCxJkwJMFG"
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
"""
Compute c_v coherence for various number of topics
Parameters:
----------
dictionary : Gensim dictionary
corpus : Gensim corpus
texts : List of input texts
limit : Max num of topics
Returns:
-------
model_list : List of LDA topic models
coherence_values : Coherence values corresponding to the LDA model with respective number of topics
"""
coherence_values = []
model_list = []
for num_topics in range(start, limit, step):
model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)
model_list.append(model)
coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_values.append(coherencemodel.get_coherence())
return model_list, coherence_values
# %% id="D0-aktieJMFG"
# Can take a long time to run.
model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=data_lemmatized, start=2, limit=40, step=6)
# %% id="RL2BqYroJMFH"
# Show graph
limit=40; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()
# %% id="SRYNvTVRJMFH"
# Print the coherence scores
for m, cv in zip(x, coherence_values):
print("Num Topics =", m, " has Coherence Value of", round(cv, 4))
# %% [markdown] id="JYI58TziJMFI"
# If the coherence score seems to keep increasing, it may make better sense to pick the model that gave the highest CV before flattening out. This is exactly the case here.
#
# So for further steps I will choose the model with 20 topics itself.
# %% id="JS53wUsOJMFI"
# Select the model and print the topics
optimal_model = model_list[3]
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10))
# %% [markdown] id="aqEb9HL-JMFI"
# Those were the topics for the chosen LDA model.
# %% [markdown] id="8elTXI8hJMFJ"
# <div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 18. Finding the dominant topic in each sentence</h2>
# </div>
#
# One of the practical application of topic modeling is to determine what topic a given document is about.
#
# To find that, we find the topic number that has the highest percentage contribution in that document.
#
# The `format_topics_sentences()` function below nicely aggregates this information in a presentable table.
# %% id="2va-5pS2JMFJ"
def format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data):
# Init output
sent_topics_df = pd.DataFrame()
# Get main topic in each document
for i, row in enumerate(ldamodel[corpus]):
row = sorted(row, key=lambda x: (x[1]), reverse=True)
# Get the Dominant topic, Perc Contribution and Keywords for each document
for j, (topic_num, prop_topic) in enumerate(row):
if j == 0: # => dominant topic
wp = ldamodel.show_topic(topic_num)
topic_keywords = ", ".join([word for word, prop in wp])
sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
else:
break
sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']
# Add original text to the end of the output
contents = pd.Series(texts)
sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
return(sent_topics_df)
df_topic_sents_keywords = format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=data)
# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']
# Show
df_dominant_topic.head(10)
# %% [markdown] id="pRwcvOu2JMFK"
# <div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 19. Find the most representative document for each topic</h2>
# </div>
#
# Sometimes just the topic keywords may not be enough to make sense of what a topic is about. So, to help with understanding the topic, you can find the documents a given topic has contributed to the most and infer the topic by reading that document. Whew!!
#
#
# %% id="QsU-DC9_JMFK"
# Group top 5 sentences under each topic
sent_topics_sorteddf_mallet = pd.DataFrame()
sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant_Topic')
for i, grp in sent_topics_outdf_grpd:
sent_topics_sorteddf_mallet = pd.concat([sent_topics_sorteddf_mallet,
grp.sort_values(['Perc_Contribution'], ascending=[0]).head(1)],
axis=0)
# Reset Index
sent_topics_sorteddf_mallet.reset_index(drop=True, inplace=True)
# Format
sent_topics_sorteddf_mallet.columns = ['Topic_Num', "Topic_Perc_Contrib", "Keywords", "Text"]
# Show
sent_topics_sorteddf_mallet.head()
# %% [markdown] id="SAsfgPE4JMFK"
# The tabular output above had 20 rows, one each for a topic. It has the topic number, the keywords and the most representative document. The `Perc_Contribution` column is nothing but the percentage contribution of the topic in the given document.
# %% [markdown] id="DuzxtBiAJMFL"
# <div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 20. Topic distribution across documents</h2>
# </div>
#
# Finally, we want to understand the volume and distribution of topics in order to judge how widely it was discussed. The below table exposes that information.
# %% id="0cFm7x2wJMFL"
# Number of Documents for Each Topic
topic_counts = df_topic_sents_keywords['Dominant_Topic'].value_counts()
# Percentage of Documents for Each Topic
topic_contribution = round(topic_counts/topic_counts.sum(), 4)
# Topic Number and Keywords
topic_num_keywords = df_topic_sents_keywords[['Dominant_Topic', 'Topic_Keywords']]
# Concatenate Column wise
df_dominant_topics = pd.concat([topic_num_keywords, topic_counts, topic_contribution], axis=1)
# Change Column names
df_dominant_topics.columns = ['Dominant_Topic', 'Topic_Keywords', 'Num_Documents', 'Perc_Documents']
# Show
df_dominant_topics.head(20)
# %% [markdown] id="KLinOkXaJMFM"
# <div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 21. Conclusion</h2>
# </div>
#
# We started with understanding what topic modeling can do. We built a basic topic model using Gensim’s LDA and visualize the topics using pyLDAvis. Then we built mallet’s LDA implementation. You saw how to find the optimal number of topics using coherence scores and how you can come to a logical understanding of how to choose the optimal model.
#
# Finally we saw how to aggregate and present the results to generate insights that may be in a more actionable.
#
# Hope you enjoyed reading this. I would appreciate if you leave your thoughts in the comments section below.
#
# Edit: I see some of you are experiencing errors while using the LDA Mallet and I don’t have a solution for some of the issues. So, I’ve implemented a workaround and more useful [topic model visualizations](https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/). Hope you will find it helpful.
# %% [markdown]
# ### All Rights Reserved. This notebook is proprietary content of machinelearningplus.com. This can be shared solely for educational purposes, with due credits to machinelearningplus.com