-
Generative Pretraining(vectors)
-
Prompting Engineering
数据标注
Challenges:
-
多义性(ambigious)
-
递归性(recursive)
-
结构复杂性(complex)
Tasks:
-
Text to Label
-
Text-Span to Label(A union of text and a span)
-
Text-Text to Label
-
Text to Labels
-
Text to Text
-
Text to Tree
-
Word Prediction
Feature
Fine-tuning
Prompting
ML is to find a mapping!
Three Concepts of ML:
-
Model
-
Learning Policy
-
Optimization(SGD is often used in LLMs)
self-supervised training
GD vs SGD
training set, validation set, test set
SGD and batch GD
Optimization and Normalization(decay, SGD, early stopping)
Some h-paras:
-
gradient_accumulation
-
weight_decay
-
warmup_ratio
MLPs
LLMs automatically extract features!
The Bitter Lesson
Activation Functions
ReLU, tanh, Sigmoid
gradient vanish
h-paras, learning paras, status values
inductive bias
No-Free-Lunch-Law
RNN: difficult to be parallelled, and the long-distance dependencies are difficult to be catched.
Gradient boom: use gradient clipping to ensure the gradient is in a proper range.
LSTM
BERT
ResNet
Highway Network
CNN kernel, step, padding,
Narrow CNN, Equal CNN and Wide CNN.
Multi-kernel CNN
Pooling, max pooling may be the most common one.
Pooling is a great way to reduce dimensions.
LM
We use
encoder-decoder model
When decoding, the output token will be re-input in the model.
seq2seq: input tokens as a sequence, and the outputs are in a sequence too.
Attention and RNN
LSTM: Consider the situation that a key word is in the middle of the setence.
gradient vanishing and gradient exploding: the problem of long-term dependency
A easy way is to use a max-pooling to the all
Attention Mechanism: the key point is that it build a direct connection between the input and the output.
RNN with attention:
Self-attention: attention embedded in encoder-encoder, which can be seen LSTM with a better output way: use a weight matrix to measure the significance of the all tokens. Multi-head self-attention.
Further more, we don't use RNN, and add positional embedding code.
QKV
In transformer, we will let
Position Encoding: use cos and sin. However, as for as I know, RoPE is much more common today.
When decoder, is self-regressive.
GPT-3
SFT(supervised fine-tuning), RLHF(reinforce learning)
Use human data to build a reward model. PPO
language generation
in-context learning
world knowledge
SFT: answer the question;generlization; code generation; CoT.