Skip to content

Files

Latest commit

d3c3663 · Mar 3, 2025

History

History
177 lines (90 loc) · 3.38 KB

Large_Language_Models.md

File metadata and controls

177 lines (90 loc) · 3.38 KB

CS2916 LLMs

Lec1

引言

  • Generative Pretraining(vectors)

  • Prompting Engineering

数据标注 数据合成

NLP 基础

Challenges:

  • 多义性(ambigious)

  • 递归性(recursive)

  • 结构复杂性(complex)

Tasks:

  • Text to Label

  • Text-Span to Label(A union of text and a span)

  • Text-Text to Label

  • Text to Labels

  • Text to Text

  • Text to Tree

  • Word Prediction

Feature Architecture Objective Prompt

Lec 2

Fine-tuning

Prompting

ML is to find a mapping!

Three Concepts of ML:

  • Model

  • Learning Policy

  • Optimization(SGD is often used in LLMs)

self-supervised training

GD vs SGD

training set, validation set, test set

SGD and batch GD

Optimization and Normalization(decay, SGD, early stopping)

Some h-paras:

  • gradient_accumulation

  • weight_decay

  • warmup_ratio

MLPs

LLMs automatically extract features!

NN and DL

The Bitter Lesson

Activation Functions

ReLU, tanh, Sigmoid

gradient vanish

h-paras, learning paras, status values

inductive bias

No-Free-Lunch-Law

Lec 3

RNN: difficult to be parallelled, and the long-distance dependencies are difficult to be catched.

Gradient boom: use gradient clipping to ensure the gradient is in a proper range.

LSTM

BERT

ResNet

Highway Network

CNN kernel, step, padding,

Narrow CNN, Equal CNN and Wide CNN.

Multi-kernel CNN

Pooling, max pooling may be the most common one.

Pooling is a great way to reduce dimensions.

LM

We use i = 1 n P ( ω i | w 1 , w 2 , . . . ) to judge whether a model performs well or not.

Lec 4

NMT

encoder-decoder model

When decoding, the output token will be re-input in the model.

seq2seq: input tokens as a sequence, and the outputs are in a sequence too.

Attention and RNN

LSTM: Consider the situation that a key word is in the middle of the setence.

gradient vanishing and gradient exploding: the problem of long-term dependency

A easy way is to use a max-pooling to the all h in the models.

Attention Mechanism: the key point is that it build a direct connection between the input and the output.

RNN with attention: s i = f ( s i 1 , y i 1 , c i ) , where c i = j = 1 N α i j h j , y i 1 is the last output token, and s i 1 is the last value.

Self-attention: attention embedded in encoder-encoder, which can be seen LSTM with a better output way: use a weight matrix to measure the significance of the all tokens. Multi-head self-attention.

Further more, we don't use RNN, and add positional embedding code.

Transformer

QKV

Attention ( Q , K , V ) = softmax ( Q K T d k ) V

In transformer, we will let head numbers × head dimension = input dimension , which is helpful for calculation: since we'll use many layers of Transformer, we want lwt the input and the output have the same dimension, which does coding a favor.

Position Encoding: use cos and sin. However, as for as I know, RoPE is much more common today.

Decoder-Only LLMs

When decoder, is self-regressive.

GPT-3

SFT(supervised fine-tuning), RLHF(reinforce learning)

Use human data to build a reward model. PPO

language generation

in-context learning

world knowledge

SFT: answer the question;generlization; code generation; CoT.