nlp-with-transformers
diff --git a/‎01_introduction.ipynb
+77 b/‎01_introduction.ipynb
+77
diff --git a/‎03_transformer-anatomy.ipynb
+56 b/‎03_transformer-anatomy.ipynb
+56
diff --git a/‎04_multilingual-ner.ipynb
+35 b/‎04_multilingual-ner.ipynb
+35
diff --git a/‎05_text-generation.ipynb
+28 b/‎05_text-generation.ipynb
+28
diff --git a/‎06_summarization.ipynb
+14 b/‎06_summarization.ipynb
+14
@@ -31,27 +31,83 @@
     "# Hello Transformers"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"transformer-timeline\" caption=\"The transformers timeline\" src=\"images/chapter01_timeline.png\" id=\"transformer-timeline\"/>"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## The Encoder-Decoder Framework"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"rnn\" caption=\"Unrolling an RNN in time.\" src=\"images/chapter01_rnn.png\" id=\"rnn\"/>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"enc-dec\" caption=\"Encoder-decoder architecture with a pair of RNNs. In general, there are many more recurrent layers than those shown.\" src=\"images/chapter01_enc-dec.png\" id=\"enc-dec\"/>"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Attention Mechanisms"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"enc-dec-attn\" caption=\"Encoder-decoder architecture with an attention mechanism for a pair of RNNs.\" src=\"images/chapter01_enc-dec-attn.png\" id=\"enc-dec-attn\"/> "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"attention-alignment\" width=\"500\" caption=\"RNN encoder-decoder alignment of words in English and the generated translation in French (courtesy of Dzmitry Bahdanau).\" src=\"images/chapter02_attention-alignment.png\" id=\"attention-alignment\"/> "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"transformer-self-attn\" caption=\"Encoder-decoder architecture of the original Transformer.\" src=\"images/chapter01_self-attention.png\" id=\"transformer-self-attn\"/> "
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Transfer Learning in NLP"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"transfer-learning\" caption=\"Comparison of traditional supervised learning (left) and transfer learning (right).\" src=\"images/chapter01_transfer-learning.png\" id=\"transfer-learning\"/>  "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"ulmfit\" width=\"500\" caption=\"The ULMFiT process (courtesy of Jeremy Howard).\" src=\"images/chapter01_ulmfit.png\" id=\"ulmfit\"/>"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -497,13 +553,34 @@
     "## The Hugging Face Ecosystem"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"ecosystem\" width=\"500\" caption=\"An overview of the Hugging Face ecosystem of libraries and the Hub.\" src=\"images/chapter01_hf-ecosystem.png\" id=\"ecosystem\"/>"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "### The Hugging Face Hub"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"hub-overview\" width=\"1000\" caption=\"The models page of the Hugging Face Hub, showing filters on the left and a list of models on the right.\" src=\"images/chapter01_hub-overview.png\" id=\"hub-overview\"/> "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"hub-model-card\" width=\"1000\" caption=\"A example model card from the Hugging Face Hub. The inference widget is shown on the right, where you can interact with the model.\" src=\"images/chapter01_hub-model-card.png\" id=\"hub-model-card\"/> "
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
 
@@ -48,20 +48,41 @@
     "## The Transformer Architecture"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"transformer-encoder-decoder\" caption=\"Encoder-decoder architecture of the transformer, with the encoder shown in the upper half of the figure and the decoder in the lower half\" src=\"images/chapter03_transformer-encoder-decoder.png\" id=\"transformer-encoder-decoder\"/>"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## The Encoder"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"encoder-zoom\" caption=\"Zooming into the encoder layer\" src=\"images/chapter03_encoder-zoom.png\" id=\"encoder-zoom\"/>"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "### Self-Attention"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"Contextualized embeddings\" caption=\"Diagram showing how self-attention updates raw token embeddings (upper) into contextualized embeddings (lower) to create representations that incorporate information from the whole sequence\" src=\"images/chapter03_contextualized-embedding.png\" id=\"contextualized-embeddings\"/>"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -181,6 +202,13 @@
     "### End sidebar"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"Operations in scaled dot-product attention\" height=\"125\" caption=\"Operations in scaled dot-product attention\" src=\"images/chapter03_attention-ops.png\" id=\"attention-ops\"/>"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -351,6 +379,13 @@
     "#### Multi-headed attention"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"Multi-head attention\" height=\"125\" caption=\"Multi-head attention\" src=\"images/chapter03_multihead-attention.png\" id=\"multihead-attention\"/>"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -557,6 +592,13 @@
     "### Adding Layer Normalization"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"Transformer layer normalization\" height=\"500\" caption=\"Different arrangements of layer normalization in a transformer encoder layer\" src=\"images/chapter03_layer-norm.png\" id=\"layer-norm\"/>"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -757,6 +799,13 @@
     "## The Decoder"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"Transformer decoder zoom\" caption=\"Zooming into the transformer decoder layer\" src=\"images/chapter03_decoder-zoom.png\" id=\"decoder-zoom\"/> "
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -851,6 +900,13 @@
     "### The Transformer Tree of Life"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"Transformer family tree\" caption=\"An overview of some of the most prominent transformer architectures\" src=\"images/chapter03_transformers-compact.png\" id=\"family-tree\"/>"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
 
@@ -960,6 +960,13 @@
     "### The Tokenizer Pipeline"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"Tokenizer pipeline\" caption=\"The steps in the tokenization pipeline\" src=\"images/chapter04_tokenizer-pipeline.png\" id=\"toknizer-pipeline\"/>"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -1001,6 +1008,20 @@
     "## Transformers for Named Entity Recognition"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"Architecture of a transformer encoder for classification.\" caption=\"Fine-tuning an encoder-based transformer for sequence classification\" src=\"images/chapter04_clf-architecture.png\" id=\"clf-arch\"/>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"Architecture of a transformer encoder for named entity recognition. The wide linear layer shows that the same linear layer is applied to all hidden states.\" caption=\"Fine-tuning an encoder-based transformer for named entity recognition\" src=\"images/chapter04_ner-architecture.png\" id=\"ner-arch\"/>"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -1015,6 +1036,13 @@
     "### Bodies and Heads"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"bert-body-head\" caption=\"The `BertModel` class only contains the body of the model, while the `BertFor&lt;Task&gt;` classes combine the body with a dedicated head for a given task\" src=\"images/chapter04_bert-body-head.png\" id=\"bert-body-head\"/>"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -4096,6 +4124,13 @@
     "## Interacting with Model Widgets"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"A Hub widget\" caption=\"Example of a widget on the Hugging Face Hub\" src=\"images/chapter04_ner-widget.png\" id=\"ner-widget\"/>  "
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
 
@@ -41,13 +41,34 @@
     "# Text Generation"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"LM Meta Learning\" width=\"800\" caption=\"During pretraining, language models are exposed to sequences of tasks that can be adapted during inference (courtesy of Tom B. Brown)\" src=\"images/chapter05_lm-meta-learning.png\" id=\"lm-meta-learning\"/>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"Meena\" width=\"300\" caption=\"Meena on the left telling a corny joke to a human on the right (courtesy of Daniel Adiwardana and Thang Luong)\" src=\"images/chapter05_meena.png\" id=\"meena\"/>"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## The Challenge with Generating Coherent Text"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"Text generation\" width=\"700\" caption=\"Generating text from an input sequence by adding a new word to the input at each step\" src=\"images/chapter05_text-generation.png\" id=\"text-generation\"/> "
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -320,6 +341,13 @@
     "## Beam Search Decoding"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"Beam search\" width=\"700\" caption=\"Beam search with two beams—the most probable sequences at each timestep are highlighted in blue\" src=\"images/chapter05_beam-search.png\" id=\"beam-search\"/> "
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
 
@@ -269,6 +269,13 @@
     "### T5"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"T5\" width=\"700\" caption=\"Diagram of T5's text-to-text framework (courtesy of Colin Raffel); besides translation and summarization, the CoLA (linguistic acceptability) and STSB (semantic similarity) tasks are shown\" src=\"images/chapter08_t5.png\" id=\"T5\"/>"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -307,6 +314,13 @@
     "### PEGASUS"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img alt=\"pegasus\" width=\"700\" caption=\"Diagram of PEGASUS architecture (courtesy of Jingqing Zhang et al.)\" src=\"images/chapter08_pegasus.png\" id=\"pegasus\"/>"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,