- Published on
NOV-24 Things I've Read
- Authors
- Name
- Joey00072
- @shxf0072
small notes on things I've read in november 2024
Weights Don't Move
weights dont change form their inital configuration usually initized at random. so if you put some image in weights of neural network and train it, you can still see that image. this also mean neural network is not finding all possible configuration, only one close to initiliztion, so more advace optimization tequines can make better smaller netwroks,
but keep in mind, This where experiments on small model with relu activation and trained on mnist with sgd, no adam, no weight decay so yet to see of this extrapolate to large networks
paper: https://arxiv.org/pdf/2012.02550
Latent Preference Optimization (LPO)
Problem: Current LLM rely on fixed temperature decoding, we know high temperature is more creative and low temperature is more factual. Solution: Have dynamic temperature based on context.
This is neat trick, they add another decoding head at end (dim, vocab_size), use it to predict temperature. Now we generate perference data with this dynamic temperature. tell ppl or model to rank it. based on it we DPO finetune model. Loss of temperature head can be in same closed from as DPO,
paper: https://arxiv.org/abs/2411.15124
Tülu 3
paper is nice and detailed, but open code make it delightful,
- SFT performance varies based on the random seed
- Adding task specific pretraining data mix in SFT dataset improves performance
- Tulu 3 is a good example of how to use Tulu to fine tune LLM
- Chat template impacts performance
- Random seed affects performance of SFT :)
- Length normalized verient of DPO is best
- duplication dosent affect performance of DPO
- Prompt not present in SFT dataset leads better DPO performance
- DPO on on policy + off policy data is best
- PPO is expensive, gives silimar results to DPO
I can go on and on, ill stop here.
Annotated History of Modern AI and Deep Learning
Schmidhuber layed out the history of deep learning and AI. Lot of idea do resurface agian and agin.
TokenFormer
grift, hard try to mask mlp as attention
Implimted my version Read the paper
Scaling Laws for Precision
quantization have cost, with higher praram count everyone is trying to move to lower precision. but there is no free lunch, it possible large models have 2bit per prarameter stored in them, but that dosent mean we can train it. fp6 seem to be compute optimal and 4bit is hard limit for quality.
The Impact of Depth on Compositional Generalization in Transformer Language Models
Deep layer models have better compositional generalization and perform better on downstream tasks. But shallow models work just as well are faster to train and inference. having large width is better tradeoff than depth.
Physics in Next-token Prediction
If you want too see ppl trowing physics and math concept for on auto regressive models for absolute no reason. this is paper for you.