NOV-24 Things I've Read

small notes on things I've read in november 2024

Weights Don't Move

weights dont change form their inital configuration usually initized at random. so if you put some image in weights of neural network and train it, you can still see that image. this also mean neural network is not finding all possible configuration, only one close to initiliztion, so more advace optimization tequines can make better smaller netwroks,

but keep in mind, This where experiments on small model with relu activation and trained on mnist with sgd, no adam, no weight decay so yet to see of this extrapolate to large networks

paper: https://arxiv.org/pdf/2012.02550

Latent Preference Optimization (LPO)

Problem: Current LLM rely on fixed temperature decoding, we know high temperature is more creative and low temperature is more factual. Solution: Have dynamic temperature based on context.

This is neat trick, they add another decoding head at end (dim, vocab_size), use it to predict temperature. Now we generate perference data with this dynamic temperature. tell ppl or model to rank it. based on it we DPO finetune model. Loss of temperature head can be in same closed from as DPO,

L_{\text{LPO}} = -\log \sigma \left[\beta \log P(\tau_c) - \beta \log P(\tau_r)\right]

paper: https://arxiv.org/abs/2411.15124

Tülu 3

paper is nice and detailed, but open code make it delightful,

SFT performance varies based on the random seed
Adding task specific pretraining data mix in SFT dataset improves performance
Tulu 3 is a good example of how to use Tulu to fine tune LLM
Chat template impacts performance
Random seed affects performance of SFT :)
Length normalized verient of DPO is best
duplication dosent affect performance of DPO
Prompt not present in SFT dataset leads better DPO performance
DPO on on policy + off policy data is best
PPO is expensive, gives silimar results to DPO

Reinforcement Learning on Verifiable Rewards Screenshot of RRL architecture

generate data by LM, verify it with external tool and optimize it with PPO.

\max_{\pi_\theta} \mathbb{E}_{y \sim \pi_\theta(x)} \left[ R_{\text{RLVR}}(x, y) \right] = \left[ \nu(x, y) - \beta \, \text{KL}\big[\pi_\theta(y|x) \| \pi_{\text{ref}}(y|x)\big] \right]

V(x,y)

is 10 for correct and 0 for incorrect, simple objective with KLD, They didnt used reward model. Interesting it works so well. Screenshot of RRL architecture

I can go on and on, ill stop here.

Read the paper

Annotated History of Modern AI and Deep Learning

Schmidhuber layed out the history of deep learning and AI. Lot of idea do resurface agian and agin.

Read the paper

TokenFormer

grift, hard try to mask mlp as attention

Implimted my version Read the paper

Scaling Laws for Precision

quantization have cost, with higher praram count everyone is trying to move to lower precision. but there is no free lunch, it possible large models have 2bit per prarameter stored in them, but that dosent mean we can train it. fp6 seem to be compute optimal and 4bit is hard limit for quality.

Read the paper

The Impact of Depth on Compositional Generalization in Transformer Language Models

Deep layer models have better compositional generalization and perform better on downstream tasks. But shallow models work just as well are faster to train and inference. having large width is better tradeoff than depth.

Read the paper

Physics in Next-token Prediction

If you want too see ppl trowing physics and math concept for on auto regressive models for absolute no reason. this is paper for you.

What? hawking would be crining rn. quantum information is conversed because quantum openration unitary under closed system, Throwing this at neural networking for no reason. absolute waste of time. Physics in Next-token Prediction

huh, who would have guessed?

Don't Read the paper