- Published on
OCT-24 Things I've Read
- Authors
- Name
- Joey00072
- @shxf0072
small notes on things I've read in october 2024
Rethinking Softmax- Self-Attention with Polynomial Activations
Paper argues softmax is not a special operation, it works because it acts as regularizer Frobenius norm (frobenius_norm = x.square().sum()
) for attention matrix
IMO,
this just proves that making models big and stabilizing training makes them just work.
Architecture doesn't matter as much. They also mention for vision their method worked better than language. So softmax is still special :?
Yeah paper is good, but conclusion should be train bigger model with reasonable activation and it will work.
Sparse Crosscoders for Cross-Layer Features and Model Diffing
Anthropic is ssly cooking in mech interpretability side, cross coders are more generalization of SAE, it's naturally continuation of idea that content is not stored in single layer but across layers and if they are linearly separable we can interpret them.
HOW TO EVALUATE REWARD MODELS FOR RLHF
Does the reward model lead to good post-RLHF language model performance?
YES duh,
Paper introduces new benchmark for reward model Preference Proxy Evaluations (PPE), nothing else special to point out but new benchmark good. RewardBench is overused.
Entropix
xjdr and doomslide cooked this awesome idea. sample from llm based on entropy and varentropy. lm have inate abality to express uncertainty to express its uncertainty. use it to create better sampler. its under explored area and maybe o1 uses some of this concept (just guessing). hrishioa put nice explaintion blog Entropixplained
THINKING LLMS - GENERAL INSTRUCTION FOLLOWING WITH THOUGHT
Paper was trying to ride O1 hype train, they used DPO with CoT
- LM to generate n responses with think prompt
- Judge model ranks outputs (not thought)
- DPO on highest with lowest rank (including CoT)
Named method Thought Preference Optimization (TPO) cause hype.
Sabotage evaluations for frontier models
Did they find anything harming, NO. They put sabotage into 4 groups: human decision sabotage, code, sandbagging and undermining oversight. But I get this feeling they try to create paranoia and fear. But nice work by Anthropic, safety is important.
Meta Lingua
Nasty Grad Accumulation BUG
Zach Muller put nice TLDR: https://x.com/TheZachMueller/status/1847021850586476919
Efficient Dictionary Learning with Switch Sparse Autoencoders
Replace SAE up down matrix with MoE. It works but I don't like it, you are essentially doing double gating. 1st MoE gate and 2nd activation gate this is forced deactivation of some neurons, but it's computationally efficient so good option for GPU poor or extra large models I guess.
Differential Transformer
Ided is create two attention matix substract one from other, this will reduces attenion noise and outlier. while theoretically this makde sense, I have doubt it will increase in performance. is because of diffing, also when you get to really long context softmax forces you to have really small attention scores. i think is likly bcs of rms norm applied after attention. (which they show group norm in paper for some reason) it look like one of cases where adding structure improves performance in short term,
claims looks legit, code is open source so i will test it what going on here.
Fitting an Elephant with Four non-Zero Parameters
we can create elephant with 4 parameters,
is it useful? no. but it's fun. isnt that why we do science?