Skip to content

Commit b7bd0cc

Browse files
committed
updates
1 parent 547bb26 commit b7bd0cc

File tree

2 files changed

+1702
-1841
lines changed

2 files changed

+1702
-1841
lines changed

loading_corpora.ipynb

+40-40
Original file line numberDiff line numberDiff line change
@@ -234,7 +234,7 @@
234234
"source": [
235235
"It seems messy, but nothing we can't clean. This basic method replaces some of the issues with the formatting, and prints the errors if any for debugging. Let us clean one of the raw text files. \n",
236236
"\n",
237-
"Note: we skip any text data which isn't utf-8 encoded here. I do this to keep things clean; you might want more data and not have that in."
237+
"Note: we skip any text data which isn't utf-8 encoded here. I do this to keep things clean; you might want more data or anticipate special characters and not include that restriction."
238238
]
239239
},
240240
{
@@ -273,7 +273,7 @@
273273
"cell_type": "markdown",
274274
"metadata": {},
275275
"source": [
276-
"Nice. This is looking a lot cleaner. We can now run some of our lucem_illud text cleaning methods we talked about in week 4. "
276+
"Nice. This is looking a lot cleaner. We can now run some of our lucem_illud text cleaning methods we discuss/model in week 4. "
277277
]
278278
},
279279
{
@@ -2329,7 +2329,7 @@
23292329
"cell_type": "markdown",
23302330
"metadata": {},
23312331
"source": [
2332-
"Great - now let us create a dataframe with the movie names, the raw words, the tokenized words, and so on.\n",
2332+
"Great! Now let us create a Pandas dataframe with movie names, raw words, tokenized words, and so on.\n",
23332333
"The file \"sources_movies.zip\" has this information. Similar information files are found for the other datasets too, in their respective folders."
23342334
]
23352335
},
@@ -2404,7 +2404,7 @@
24042404
"\n",
24052405
"First, let us create a dictionary mapping file-id to all the text. Each movie will be mapped to a list of the tokenized words.\n",
24062406
"\n",
2407-
"In this example, I only use it to load 1000 movies. You can comment this out or increase/decrease the number as you see fit."
2407+
"In this example, I only use it to load 1000 movies. You can comment this out or increase/decrease the number as inspired."
24082408
]
24092409
},
24102410
{
@@ -2504,63 +2504,63 @@
25042504
" </thead>\n",
25052505
" <tbody>\n",
25062506
" <tr>\n",
2507-
" <th>3603861</th>\n",
2508-
" <td>Another Fine Mess</td>\n",
2509-
" <td>Comedy, Short</td>\n",
2510-
" <td>1930</td>\n",
2507+
" <th>6861982</th>\n",
2508+
" <td>Blonde Crazy</td>\n",
2509+
" <td>Comedy, Crime, Drama</td>\n",
2510+
" <td>1931</td>\n",
25112511
" <td>English</td>\n",
2512-
" <td>[Dear, ladies, and, gentlemen, Hal, Roach, pre...</td>\n",
2512+
" <td>[Who, cares, for, starlit, skies, when, you, '...</td>\n",
25132513
" </tr>\n",
25142514
" <tr>\n",
2515-
" <th>6421562</th>\n",
2516-
" <td>East of Shanghai</td>\n",
2517-
" <td>Comedy, Romance, Thriller</td>\n",
2515+
" <th>6606107</th>\n",
2516+
" <td>Five and Ten</td>\n",
2517+
" <td>Drama, Romance</td>\n",
25182518
" <td>1931</td>\n",
25192519
" <td>English</td>\n",
2520-
" <td>[Hello, Em, Hello, Fred, I, think, you, 'll, l...</td>\n",
2520+
" <td>[Subtitles, Lu, s, Filipe, Bernardes, Mr, Rari...</td>\n",
25212521
" </tr>\n",
25222522
" <tr>\n",
2523-
" <th>3130930</th>\n",
2524-
" <td>The Skin Game</td>\n",
2525-
" <td>Drama</td>\n",
2523+
" <th>6406611</th>\n",
2524+
" <td>Five Star Final</td>\n",
2525+
" <td>Crime, Drama</td>\n",
25262526
" <td>1931</td>\n",
25272527
" <td>English</td>\n",
2528-
" <td>[Captioning, made, possible, by, lions, gate, ...</td>\n",
2528+
" <td>[Extra, Extra, Extra, Five, star, final, Indis...</td>\n",
25292529
" </tr>\n",
25302530
" <tr>\n",
2531-
" <th>4735124</th>\n",
2532-
" <td>The Lost Atlantis</td>\n",
2533-
" <td>Adventure, Fantasy</td>\n",
2534-
" <td>1932</td>\n",
2535-
" <td>English</td>\n",
2536-
" <td>[All, in, one, word, Atlantis, An, ancient, dr...</td>\n",
2531+
" <th>3251135</th>\n",
2532+
" <td>The Smiling Lieutenant</td>\n",
2533+
" <td>Comedy, Romance, Musical</td>\n",
2534+
" <td>1931</td>\n",
2535+
" <td>English, French</td>\n",
2536+
" <td>[Bell_Rings, Bell_Rings, Sighing, Yawns, Yes, ...</td>\n",
25372537
" </tr>\n",
25382538
" <tr>\n",
2539-
" <th>96272</th>\n",
2540-
" <td>I Am a Fugitive from a Chain Gang</td>\n",
2541-
" <td>Crime, Drama, Film-Noir</td>\n",
2539+
" <th>6909562</th>\n",
2540+
" <td>Faithless</td>\n",
2541+
" <td>Drama</td>\n",
25422542
" <td>1932</td>\n",
25432543
" <td>English</td>\n",
2544-
" <td>[Hey, pipe, down, you, mugs, Sorry, to, break,...</td>\n",
2544+
" <td>[But, Carol, this, bank, is, your, guardian, W...</td>\n",
25452545
" </tr>\n",
25462546
" </tbody>\n",
25472547
"</table>\n",
25482548
"</div>"
25492549
],
25502550
"text/plain": [
2551-
" Movie Name Genre Year \\\n",
2552-
"3603861 Another Fine Mess Comedy, Short 1930 \n",
2553-
"6421562 East of Shanghai Comedy, Romance, Thriller 1931 \n",
2554-
"3130930 The Skin Game Drama 1931 \n",
2555-
"4735124 The Lost Atlantis Adventure, Fantasy 1932 \n",
2556-
"96272 I Am a Fugitive from a Chain Gang Crime, Drama, Film-Noir 1932 \n",
2551+
" Movie Name Genre Year \\\n",
2552+
"6861982 Blonde Crazy Comedy, Crime, Drama 1931 \n",
2553+
"6606107 Five and Ten Drama, Romance 1931 \n",
2554+
"6406611 Five Star Final Crime, Drama 1931 \n",
2555+
"3251135 The Smiling Lieutenant Comedy, Romance, Musical 1931 \n",
2556+
"6909562 Faithless Drama 1932 \n",
25572557
"\n",
2558-
" Country Tokenized Texts \n",
2559-
"3603861 English [Dear, ladies, and, gentlemen, Hal, Roach, pre... \n",
2560-
"6421562 English [Hello, Em, Hello, Fred, I, think, you, 'll, l... \n",
2561-
"3130930 English [Captioning, made, possible, by, lions, gate, ... \n",
2562-
"4735124 English [All, in, one, word, Atlantis, An, ancient, dr... \n",
2563-
"96272 English [Hey, pipe, down, you, mugs, Sorry, to, break,... "
2558+
" Country Tokenized Texts \n",
2559+
"6861982 English [Who, cares, for, starlit, skies, when, you, '... \n",
2560+
"6606107 English [Subtitles, Lu, s, Filipe, Bernardes, Mr, Rari... \n",
2561+
"6406611 English [Extra, Extra, Extra, Five, star, final, Indis... \n",
2562+
"3251135 English, French [Bell_Rings, Bell_Rings, Sighing, Yawns, Yes, ... \n",
2563+
"6909562 English [But, Carol, this, bank, is, your, guardian, W... "
25642564
]
25652565
},
25662566
"execution_count": 28,
@@ -2584,7 +2584,7 @@
25842584
"cell_type": "markdown",
25852585
"metadata": {},
25862586
"source": [
2587-
"You are encouraged to try the similar process and load other datasets."
2587+
"You are encouraged to try the similar process and load the other datasets."
25882588
]
25892589
}
25902590
],

0 commit comments

Comments
 (0)