CS336-2025-lec1

2025625

10:22

Overview, tokenization

 

 

当参数量较小时,MLPMHAFLOPS计算量大致相当,但是当参数量很大时,MLPFLOPS占主导地位。所以优化ATTENTION运算对于大模型的好处要小于小模型。

 

大模型训练时具有涌现现象,在训练到某个FLOPs数量以前,指标一直很低,但是到了某一步之后,指标突然增加。

 

优化大模型的计算效率很重要,因为资源是有限的,有限的GPU,有限的显存,有限的数据,有限的时间等。

 

A rule of thumb: 训练所需要的token数量大致是模型参数量的20倍,即1B的模型需要在20Btokens上训练。

 

对齐Alignment:

base model is good at completing next token, Alignment makes the model actual useful.

对齐的目标:

1. Get the language model to follow instructions.

2. Tune the style, format, tone, etc.

3. Safety, refuse harmful questions

对齐的两个阶段:

1.SFT,微调模型使得p(response|prompt) 概率最大化,基本和预训练一样。

2.learning from feedback

1)偏好数据, AB哪个好

2) verifiers , llm as judge

3)算法:PPO DPO(仅适用于偏好数据) GRPO

 

Byte Pair Encoding (BPE)

Basic idea: train the tokenizer on raw text to automatically determine the vocabulary.

    Intuition: common sequences of characters are represented by a single token, rare sequences are represented by many tokens.

    The GPT-2 paper used word-based tokenization(pre tokenization to break up the text into inital segments and run the original BPE algorithm on each segment.

    Sketch: start with each byte as a token, and successively merge the most common pair of adjacent tokens.

 

 

 

 

已使用 OneNote 创建。