|
214 | 214 | [2] xGen‐MM (BLIP‐3): A Family of Open Large Multimodal Models. https://www.arxiv.org/abs/2408.08872.
|
215 | 215 | [3] InternVL: Scaling Up Vision Foundation Models and Aligning for Generic Visual‐Linguistic Tasks. https://arxiv.org/abs/2312.14238.
|
216 | 216 | [4] Meta’s Llama. https://llama.meta.com/.
|
217 |
| -[5] Byte‐Pair Encoding Tokenization. https://huggingface.co/learn/nlp‐course/en/chapter6/5. |
| 217 | +[5] Byte‐Pair Encoding Tokenization. https://huggingface.co/learn/nlp-course/en/chapter6/5. |
218 | 218 | [6] LAION‐5B: An Open Large‐Scale Dataset for Training Next Generation Image‐Text Models. https://arxiv.org/abs/2210.08402.
|
219 | 219 | [7] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. https://arxiv.org/abs/2010.11929.
|
220 |
| -[8] Language Models are Unsupervised Multitask Learners. https://cdn.openai.com/better‐language‐models/language_models_are_unsupervised_multitask_learners.pdf. |
| 220 | +[8] Language Models are Unsupervised Multitask Learners. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf. |
221 | 221 | [9] Learning Transferable Visual Models From Natural Language Supervision. https://arxiv.org/abs/2103.00020.
|
222 |
| -[10] Cross‐Entropy. https://en.wikipedia.org/wiki/Cross‐entropy. |
| 222 | +[10] Cross‐Entropy. https://en.wikipedia.org/wiki/Cross-entropy. |
223 | 223 | [11] CIDEr: Consensus‐Based Image Description Evaluation. https://arxiv.org/abs/1411.5726.
|
224 |
| -[12] TF‐IDF Introduction. https://web.stanford.edu/class/cs276/19handouts/lecture6‐tfidf‐1per.pdf. |
| 224 | +[12] TF‐IDF Introduction. https://web.stanford.edu/class/cs276/19handouts/lecture6-tfidf-1per.pdf. |
225 | 225 | [13] TF‐IDF. https://en.wikipedia.org/wiki/Tf%E2%80%93idf.
|
226 |
| -[14] Visual Question Answering Introduction. https://huggingface.co/tasks/visual‐question‐answering. |
| 226 | +[14] Visual Question Answering Introduction. https://huggingface.co/tasks/visual-question-answering. |
227 | 227 | [15] Cross‐Domain Image Captioning with Discriminative Finetuning. https://arxiv.org/abs/2304.01662.
|
228 |
| -[16] Crossmodal‐3600 — Multilingual Reference Captions for Geographically Diverse Images. https://research.google/blog/crossmodal‐3600‐multilingual‐reference‐captions‐for‐geographically‐diverse‐images/. |
| 228 | +[16] Crossmodal‐3600 — Multilingual Reference Captions for Geographically Diverse Images. https://research.google/blog/crossmodal-3600-multilingual-reference-captions-for-geographically-diverse-images/. |
229 | 229 | [17] Efficient Image Captioning for Edge Devices. https://arxiv.org/abs/2212.08985.
|
230 | 230 | [18] Ensemble Model Using an Image Captioning and Ranking Example. https://cloud.google.com/dataflow/docs/notebooks/run_inference_multi_model.
|
231 | 231 |
|
|
239 | 239 | [4] Optical Character Recognition. https://en.wikipedia.org/wiki/Optical_character_recognition.
|
240 | 240 | [5] Dedoc GitHub Repository. https://github.com/ispras/dedoc.
|
241 | 241 | [6] LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis. https://arxiv.org/abs/2103.15348.
|
242 |
| -[7] Google Cloud document parser API. https://cloud.google.com/document‐ai/docs/layout‐parse‐chunk. |
243 |
| -[8] PDF.CO document parser API. https://developer.pdf.co/api/document‐parser/index.html. |
| 242 | +[7] Google Cloud document parser API. https://cloud.google.com/document-ai/docs/layout-parse-chunk. |
| 243 | +[8] PDF.CO document parser API. https://developer.pdf.co/api/document-parser/index.html. |
244 | 244 | [9] Character text splitter in LangChain. https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/character_text_splitter/.
|
245 | 245 | [10] Elasticsearch. https://www.elastic.co/elasticsearch.
|
246 | 246 | [11] A Survey on Knowledge Graphs: Representation, Acquisition, and Applications. https://ieeexplore.ieee.org/document/9416312.
|
247 | 247 | [12] Christopher D. Manning. Introduction to Information Retrieval.2008.
|
248 | 248 | [13] Modern Information Retrieval: A Brief Overview. http://singhal.info/ieee2001.pdf.
|
249 | 249 | [14] Learning Transferable Visual Models From Natural Language Supervision. https://arxiv.org/abs/2103.00020.
|
250 |
| -[15] OpenAI finetuning documentation. https://platform.openai.com/docs/guides/fine‐tuning. |
251 |
| -[16] Anthropic finetuning. https://www.anthropic.com/news/fine‐tune‐claude‐3‐haiku. |
| 250 | +[15] OpenAI finetuning documentation. https://platform.openai.com/docs/guides/fine-tuning. |
| 251 | +[16] Anthropic finetuning. https://www.anthropic.com/news/fine-tune-claude-3-haiku. |
252 | 252 | [17] RAFT: Adapting Language Model to Domain Specific RAG. https://arxiv.org/abs/2403.10131.
|
253 | 253 | [18] Euclidean Distance. https://en.wikipedia.org/wiki/Euclidean_distance.
|
254 | 254 | [19] Cosine Similarity. https://en.wikipedia.org/wiki/Cosine_similarity.
|
|
258 | 258 | [23] Similarity search in high dimensions via hashing. https://www.cs.princeton.edu/courses/archive/spring13/cos598C/Gionis.pdf.
|
259 | 259 | [24] Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. https://arxiv.org/abs/1603.09320.
|
260 | 260 | [25] Faiss Documentation. https://faiss.ai/.
|
261 |
| -[26] ScaNN. https://research.google/blog/announcing‐scann‐efficient‐vector‐similarity‐search/. |
262 |
| -[27] Developer Playground. https://docs.cohere.com/v2/docs/playground‐overview. |
| 261 | +[26] ScaNN. https://research.google/blog/announcing-scann-efficient-vector-similarity-search/. |
| 262 | +[27] Developer Playground. https://docs.cohere.com/v2/docs/playground-overview. |
263 | 263 | [28] Chain‐of‐Thought Prompting Elicits Reasoning in Large Language Models. https://arxiv.org/abs/2201.11903.
|
264 | 264 | [29] Tree of Thoughts: Deliberate Problem Solving with Large Language Models. https://arxiv.org/abs/2305.10601.
|
265 |
| -[30] OpenAI o1. https://openai.com/index/learning‐to‐reason‐with‐llms/. |
| 265 | +[30] OpenAI o1. https://openai.com/index/learning-to-reason-with-llms/. |
266 | 266 | [31] Scaling LLM Test‐Time Compute Optimally can be More Effective than Scaling Model Parameters. https://arxiv.org/abs/2408.03314.
|
267 | 267 | [32] Language Models are Few‐Shot Learners. https://arxiv.org/abs/2005.14165.
|
268 | 268 | [33] Machine Learning System Design Interview. https://www.aliaminian.com/books.
|
|
285 | 285 | [2] Auto‐Encoding Variational Bayes. https://arxiv.org/abs/1312.6114.
|
286 | 286 | [3] Generative Adversarial Networks. https://arxiv.org/abs/1406.2661.
|
287 | 287 | [4] Combating Mode Collapse in GAN Training: An Empirical Analysis Using Hessian Eigenvalues. https://arxiv.org/abs/2012.09673.
|
288 |
| -[5] Google’s GAN Course. https://developers.google.com/machine‐learning/gan/training. |
| 288 | +[5] Google’s GAN Course. https://developers.google.com/machine-learning/gan/training. |
289 | 289 | [6] StackGAN: Text to Photo‐Realistic Image Synthesis with Stacked Generative Adversarial Networks. https://arxiv.org/abs/1612.03242.
|
290 | 290 | [7] Zero‐Shot Text‐to‐Image Generation. https://arxiv.org/abs/2102.12092.
|
291 | 291 | [8] Muse: Text‐To‐Image Generation via Masked Generative Transformers. https://arxiv.org/abs/2301.00704.
|
292 |
| -[9] DALL∙E 3. https://openai.com/index/dall‐e‐3/. |
| 292 | +[9] DALL∙E 3. https://openai.com/index/dall-e-3/. |
293 | 293 | [10] Attribute‐Specific Control Units in StyleGAN for Fine‐Grained Image Manipulation. https://arxiv.org/abs/2111.13010.
|
294 | 294 | [11] A Guide to Convolution Arithmetic for Deep Learning. https://arxiv.org/abs/1603.07285.
|
295 | 295 | [12] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. https://arxiv.org/abs/1502.03167.
|
|
300 | 300 | [17] PyTorch’s Tanh Layer. https://pytorch.org/docs/stable/generated/torch.nn.Tanh.html.
|
301 | 301 | [18] A Style‐Based Generator Architecture for Generative Adversarial Networks. https://arxiv.org/abs/1812.04948.
|
302 | 302 | [19] Minimax. https://en.wikipedia.org/wiki/Minimax.
|
303 |
| -[20] Loss Functions in GANs. https://developers.google.com/machine‐learning/gan/loss. |
| 303 | +[20] Loss Functions in GANs. https://developers.google.com/machine-learning/gan/loss. |
304 | 304 | [21] Towards Principled Methods for Training Generative Adversarial Networks. https://arxiv.org/abs/1701.04862.
|
305 | 305 | [22] Unrolled Generative Adversarial Networks. https://arxiv.org/abs/1611.02163.
|
306 | 306 | [23] Stabilizing Training of Generative Adversarial Networks through Regularization. https://arxiv.org/abs/1705.09367.
|
|
339 | 339 | [12] NTIRE 2024 Challenge on Image Super‐Resolution (×4): Methods and Results. https://arxiv.org/abs/2404.09790.
|
340 | 340 | [13] Muse: Text‐To‐Image Generation via Masked Generative Transformers. https://arxiv.org/abs/2301.00704.
|
341 | 341 | [14] VQGAN‐CLIP: Open Domain Image Generation and Editing with Natural Language Guidance. https://arxiv.org/abs/2204.08583.
|
342 |
| -[15] LAR‐SR: A Local Autoregressive Model for Image Super‐Resolution. https://openac cess.thecvf.com/content/CVPR2022/papers/Guo_LAR‐SR_A_Local_Autoregressive_Model_for_Image_Super‐Resolution_CVPR_2022_paper.pdf. |
| 342 | +[15] LAR‐SR: A Local Autoregressive Model for Image Super‐Resolution. https://openaccess.thecvf.com/content/CVPR2022/papers/Guo_LAR-SR_A_Local_Autoregressive_Model_for_Image_Super-Resolution_CVPR_2022_paper.pdf. |
343 | 343 | [16] Long Horizon Temperature Scaling. https://arxiv.org/abs/2302.03686.
|
344 | 344 | [17] Learning Rate Scheduling. https://d2l.ai/chapter_optimization/lr‐scheduler.html.
|
345 |
| -[18] Adversarial Training. https://adversarial‐ml‐tutorial.org/adversarial_training/. |
| 345 | +[18] Adversarial Training. https://adversarial-ml-tutorial.org/adversarial_training/. |
346 | 346 | [19] Progressive Growing of GANs for Improved Quality, Stability, and Variation. https://arxiv.org/abs/1710.10196.
|
347 | 347 | [20] CogView2: Faster and Better Text‐to‐Image Generation via Hierarchical Transformers. https://arxiv.org/abs/2204.14217.
|
348 | 348 |
|
349 | 349 | ---
|
350 | 350 |
|
351 | 351 | ## Chapter 9: Text-to-Image Generation
|
352 | 352 |
|
353 |
| -[1] OpenAI’s DALL‐E 3. https://openai.com/index/dall‐e‐3/. |
| 353 | +[1] OpenAI’s DALL‐E 3. https://openai.com/index/dall-e-3/. |
354 | 354 | [2] Imagen 3. https://arxiv.org/abs/2408.07009.
|
355 | 355 | [3] Adobe’s Firefly. https://www.adobe.com/products/firefly.html.
|
356 | 356 | [4] Introducing ChatGPT. https://openai.com/index/chatgpt/.
|
|
370 | 370 | [18] Denoising Diffusion Probabilistic Models. https://arxiv.org/abs/2006.11239.
|
371 | 371 | [19] Classifier‐Free Diffusion Guidance. https://arxiv.org/abs/2207.12598.
|
372 | 372 | [20] Denoising Diffusion Implicit Models. https://arxiv.org/abs/2010.02502.
|
373 |
| -[21] Introduction to Diffusion Models. https://lilianweng.github.io/posts/2021‐07‐11‐diffusion‐models/. |
| 373 | +[21] Introduction to Diffusion Models. https://lilianweng.github.io/posts/2021-07-11-diffusion-models/. |
374 | 374 | [22] Mixed Precision Training. https://arxiv.org/abs/1710.03740.
|
375 | 375 | [23] FSDP tutorial. https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html.
|
376 | 376 | [24] DeepSpeed. https://github.com/microsoft/DeepSpeed.
|
|
379 | 379 | [27] Inception score. https://en.wikipedia.org/wiki/Inception_score.
|
380 | 380 | [28] FID calculation. https://en.wikipedia.org/wiki/Fr%C3%A9chet_inception_distance.
|
381 | 381 | [29] CLIPScore: A Reference‐free Evaluation Metric for Image Captioning. https://arxiv.org/abs/2104.08718.
|
382 |
| -[30] Sora overview. https://openai.com/index/video‐generation‐models‐as‐world‐simulators/. |
| 382 | +[30] Sora overview. https://openai.com/index/video-generation-models-as-world-simulators/. |
383 | 383 | [31] Imagen Video: High Definition Video Generation with Diffusion Models. https://arxiv.org/abs/2210.02303.
|
384 |
| -[32] Finetune Stable Diffusion Models with DDPO via TRL. https://huggingface.co/blog/trl‐ddpo. |
| 384 | +[32] Finetune Stable Diffusion Models with DDPO via TRL. https://huggingface.co/blog/trl-ddpo. |
385 | 385 | [33] Kandinsky: an Improved Text‐to‐Image Synthesis with Image Prior and Latent Diffusion. https://arxiv.org/abs/2310.03502.
|
386 | 386 | [34] On the Importance of Noise Scheduling for Diffusion Models. https://arxiv.org/abs/2301.10972.
|
387 | 387 | [35] Patchn’Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution. https://arxiv.org/abs/2307.06304.
|
388 | 388 | [36] InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual‐Linguistic Tasks. https://arxiv.org/abs/2312.14238.
|
389 | 389 | [37] BLIP‐2: Bootstrapping Language‐Image Pre‐training with Frozen Image Encoders and Large Language Models. https://arxiv.org/abs/2301.12597.
|
390 | 390 | [38] Adding Conditional Control to Text‐to‐Image Diffusion Models. https://arxiv.org/abs/2302.05543.
|
391 |
| -[39] StyleDrop: Text‐to‐image generation in any style. https://research.google/blog/styledrop‐text‐to‐image‐generation‐in‐any‐style/. |
| 391 | +[39] StyleDrop: Text‐to‐image generation in any style. https://research.google/blog/styledrop-text-to-image-generation-in-any-style/. |
392 | 392 |
|
393 | 393 | ---
|
394 | 394 |
|
395 | 395 | ## Chapter 10: Personal Headshot Generation
|
396 | 396 |
|
397 |
| -[1] Imagine yourself: Tuning‐Free Personalized Image Generation. https://ai.meta.com/research/publications/imagine‐yourself‐tuning‐free‐personalized‐image‐generation/. |
| 397 | +[1] Imagine yourself: Tuning‐Free Personalized Image Generation. https://ai.meta.com/research/publications/imagine-yourself-tuning-free-personalized-image-generation/. |
398 | 398 | [2] MoA: Mixture‐of‐Attention for Subject‐Context Disentanglement in Personalized Image Generation. https://arxiv.org/abs/2404.11565.
|
399 | 399 | [3] InstantID: Zero‐shot Identity‐Preserving Generation in Seconds. https://arxiv.org/abs/2401.07519.
|
400 |
| -[4] An Image is Worth One Word: Personalizing Text‐to‐Image Generation using Textual Inversion. https://textual‐inversion.github.io/. |
| 400 | +[4] An Image is Worth One Word: Personalizing Text‐to‐Image Generation using Textual Inversion. https://textual-inversion.github.io/. |
401 | 401 | [5] DreamBooth: Fine Tuning Text‐to‐Image Diffusion Models for Subject‐Driven Generation. https://arxiv.org/abs/2208.12242.
|
402 | 402 | [6] LoRA: Low‐Rank Adaptation of Large Language Models. https://arxiv.org/abs/2106.09685.
|
403 | 403 | [7] Language Models are Few‐Shot Learners. https://arxiv.org/abs/2005.14165.
|
|
407 | 407 | [11] Inception score. https://en.wikipedia.org/wiki/Inception_score.
|
408 | 408 | [12] Learning Transferable Visual Models From Natural Language Supervision. https://arxiv.org/abs/2103.00020.
|
409 | 409 | [13] Emerging Properties in Self‐Supervised Vision Transformers. https://arxiv.org/abs/2104.14294.
|
410 |
| -[14] Contrastive Representation Learning. https://lilianweng.github.io/posts/2021‐05‐31‐contrastive/. |
| 410 | +[14] Contrastive Representation Learning. https://lilianweng.github.io/posts/2021-05-31-contrastive/. |
411 | 411 | [15] DINOv2: Learning Robust Visual Features without Supervision. https://arxiv.org/abs/2304.07193.
|
412 | 412 | [16] An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine‐tuning. https://arxiv.org/abs/2308.08747.
|
413 | 413 | [17] SDXL: Improving Latent Diffusion Models for High‐Resolution Image Synthesis. https://arxiv.org/abs/2307.01952.
|
|
419 | 419 |
|
420 | 420 | ## Chapter 11: Text-to-Video Generation
|
421 | 421 |
|
422 |
| -[1] Video generation models as world simulators. https://openai.com/index/video‐generation‐models‐as‐world‐simulators/. |
423 |
| -[2] H100 Tensor Core GPU. https://www.nvidia.com/en‐us/data‐center/h100/. |
| 422 | +[1] Video generation models as world simulators. https://openai.com/index/video-generation-models-as-world-simulators/. |
| 423 | +[2] H100 Tensor Core GPU. https://www.nvidia.com/en-us/data-center/h100/. |
424 | 424 | [3] High‐Resolution Image Synthesis with Latent Diffusion Models. https://arxiv.org/abs/2112.10752.
|
425 |
| -[4] Meta Movie Gen. https://ai.meta.com/research/movie‐gen/. |
| 425 | +[4] Meta Movie Gen. https://ai.meta.com/research/movie-gen/. |
426 | 426 | [5] Auto‐Encoding Variational Bayes. https://arxiv.org/abs/1312.6114.
|
427 |
| -[6] The Illustrated Stable Diffusion. https://jalammar.github.io/illustrated‐stable‐diffusion/. |
| 427 | +[6] The Illustrated Stable Diffusion. https://jalammar.github.io/illustrated-stable-diffusion/. |
428 | 428 | [7] On the De‐duplication of LAION‐2B. https://arxiv.org/abs/2303.12733.
|
429 | 429 | [8] The Llama 3 Herd of Models. https://arxiv.org/abs/2407.21783.
|
430 |
| -[9] LLaVA‐NeXT: A Strong Zero‐shot Video Understanding Model. https://llava‐vl.github.io/blog/2024‐04‐30‐llava‐next‐video/. |
| 430 | +[9] LLaVA‐NeXT: A Strong Zero‐shot Video Understanding Model. https://llava-vl.github.io/blog/2024-04-30-llava-next-video/. |
431 | 431 | [10] Lumiere: A Space‐Time Diffusion Model for Video Generation. https://arxiv.org/abs/2401.12945.
|
432 |
| -[11] OpenSora Technical Report. https://github.com/hpcaitech/Open‐Sora/blob/main/docs/report_02.md. |
| 432 | +[11] OpenSora Technical Report. https://github.com/hpcaitech/Open-Sora/blob/main/docs/report_02.md. |
433 | 433 | [12] RoFormer: Enhanced Transformer with Rotary Position Embedding. https://arxiv.org/abs/2104.09864.
|
434 | 434 | [13] Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets. https://arxiv.org/abs/2311.15127.
|
435 | 435 | [14] Emu Video: Factorizing Text‐to‐Video Generation by Explicit Image Conditioning. https://arxiv.org/abs/2311.10709.
|
436 | 436 | [15] Imagen Video: High Definition Video Generation with Diffusion Models. https://arxiv.org/abs/2210.02303.
|
437 | 437 | [16] HyperAttention: Long‐context Attention in Near‐Linear Time. https://arxiv.org/abs/2310.05869.
|
438 | 438 | [17] Mixture of Experts Explained. https://huggingface.co/blog/moe.
|
439 |
| -[18] VBench: Comprehensive Benchmark Suite for Video Generative Models. https://vchitect.github.io/VBench‐project/. |
| 439 | +[18] VBench: Comprehensive Benchmark Suite for Video Generative Models. https://vchitect.github.io/VBench-project/. |
440 | 440 | [19] Movie Gen Bench. https://github.com/facebookresearch/MovieGenBench.
|
441 | 441 | [20] FID calculation. https://en.wikipedia.org/wiki/Fr%C3%A9chet_inception_distance.
|
442 | 442 | [21] Inception score. https://en.wikipedia.org/wiki/Inception_score.
|
|
451 | 451 | [30] Photorealistic Text‐to‐Image Diffusion Models with Deep Language Understanding. https://arxiv.org/abs/2205.11487.
|
452 | 452 | [31] CustomVideo: Customizing Text‐to‐Video Generation with Multiple Subjects. https://arxiv.org/abs/2401.09962.
|
453 | 453 | [32] Control‐A‐Video: Controllable Text‐to‐Video Generation with Diffusion Models. https://controlavideo.github.io/.
|
454 |
| -[33] Introducing Stable Cascade. https://stability.ai/news/introducing‐stable‐cascade. |
| 454 | +[33] Introducing Stable Cascade. https://stability.ai/news/introducing-stable-cascade. |
455 | 455 |
|
456 | 456 | ---
|
457 | 457 |
|
0 commit comments