Ludwig

Saturday Paper Reading

I thought it would be fun to dump the notes I took while catching up and reading through some of my bookmarks on this fine Saturday. The workflow was something like this:

Some of the readings were inspired by some of the conversations I had here, here, here and there. Some are just papers I had in my backlog for a bit, or were recently sent to me.

The reading list

Today's program includes:

(I had already skimmed this one around the time it releaseed but finally read it top to bottom today.)

(These two I wanted to refresh myself on, since I first read them when I had basically 0 clue about what I was reading. I was just force-feeding myself papers in the hopes I'd start understanding them)

(Already read it a few times, this time I just jotted down some impressions I had when I read it once more today!)

Tomorrow's program is mostly composed of a few CS articles I saved up and wanna get through, but I also want to try and get rid of as many of these papers as possible:

Do Llamas Work in English? On the Latent Language of Multilingual Transformers

We ask whether multilingual language models trained on unbalanced, English-dominated corpora use English as an internal pivot language -- a question of key importance for understanding how language models function and the origins of linguistic bias. Focusing on the Llama-2 family of transformer models, our study uses carefully constructed non-English prompts with a unique correct single-token continuation. From layer to layer, transformers gradually map an input embedding of the final prompt token to an output embedding from which next-token probabilities are computed. Tracking intermediate embeddings through their high-dimensional space reveals three distinct phases, whereby intermediate embeddings (1) start far away from output token embeddings; (2) already allow for decoding a semantically correct next token in the middle layers, but give higher probability to its version in English than in the input language; (3) finally move into an input-language-specific region of the embedding space. We cast these results into a conceptual model where the three phases operate in "input space", "concept space", and "output space", respectively. Crucially, our evidence suggests that the abstract "concept space" lies closer to English than to other languages, which may have important consequences regarding the biases held by multilingual language models.

I had skimmed that paper before, but felt like reading it thoroughly after posting this and being linked to it.

The abstract TL;DR'd is that they basically take Llama 2 and feed it a few translation tasks from one language to another that is not English and then try to apply the unembedding matrix to the latents after N layers and see if the model is using English as a pivot while translation from language X to Y.

What they find is that while the language does seem to use English as a pivot language, but it's more because the "concept space" is biased for English, is thinking in a conceptual space that just happens to be biased towards English tokens due to the training data having a whole bunch of it.

The parts I struggled with:

I fed these explanations I put together to ChatGPT 4.5 and the summary is very clean:

Progress measures for grokking via mechanistic interpretability

Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding emergence is to find continuous progress measures that underlie the seemingly discontinuous qualitative changes. We argue that progress measures can be found via mechanistic interpretability: reverseengineering learned behaviors into their individual components. As a case study, we investigate the recently-discovered phenomenon of “grokking” exhibited by small transformers trained on modular addition tasks. We fully reverse engineer the algorithm learned by these networks, which uses discrete Fourier transforms and trigonometric identities to convert addition to rotation about a circle. We confirm the algorithm by analyzing the activations and weights and by performing ablations in Fourier space. Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup. Our results show that grokking, rather than being a sudden shift, arises from the gradual amplification of structured mechanisms encoded in the weights, followed by the later removal of memorizing components.

This one I just skimmed. I'm gonna be honest, there was a lot of math, I had already had a long day and I want to play MLB The Show 25.

Towards Automated Circuit Discovery for Mechanistic Interpretability

Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process they followed. First, researchers choose a metric and dataset that elicit the desired model behavior. Then, they apply activation patching to find which abstract neural network units are involved in the behavior. By varying the dataset, metric, and units under investigation, researchers can understand the functionality of each component. We automate one of the process' steps: to identify the circuit that implements the specified behavior in the model's computational graph. We propose several algorithms and reproduce previous interpretability results to validate them. For example, the ACDC algorithm rediscovered 5/5 of the component types in a circuit in GPT-2 Small that computes the Greater-Than operation. ACDC selected 68 of the 32,000 edges in GPT-2 Small, all of which were manually found by previous work.

Also skimmed. This one feels like it is the birth of modern mechanical interpretability (as I know it, at least). 2023 is a lifetime ago in that field and there are tons of more recent papers from Anthropic so I didn't spend too much time on it. Some notes:

Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes

This one is an incredible banger that is right up my alley. Caleb put me on a couple weeks ago and I've read it a couple times already, including on the flight back from SF. It's from 2009, but hits on many of my core models as to how things work. There's a few Wittgenstein-level hand-waves here and there that I didn't especially care about, especially around the analogies to "subjective beauty", art, etc..., hence why none of these concepts made it into notes.

It basically claims that agents seek out data that has regularities that aren't yet known, that still has "patterns" yet to be compressed. This is what allows for improvements in prediction and compression. The drive rewards discovery of data whose encoding requires fewer and fewer bits of the time, increasing some subjective "simplicity". It's deeply anchored to information theory and the Kolmogorov complexity -- the simplest explanation (shortest program) is the most valuable, and it leads agents to seek out environments or experiences where you can make keep making these simplifications.

One of the interesting bits is this model of consciousness being "compression-driven": the paper makes the claim that consciousness naturally arises as the compressor (eg the brain) develops internal symbolic representations (including self-representations) in a quest to keep improving encoding efficiency. For the few who've read about my thoughts on consciousness on x.com or heard me speak about it in spaces last year, you can probably tell how elegant that idea is to me. Schmidhuber basically claims consciousness to be a "computationally straight forward byproduct of the ongoing compression process", an evident necessity to "create some of sort internal symbol or code representing the agent itself" to "efficiently encode the entire data history". The idea that consciousness is simple byproduct of effective compressing is something I've thrown around off-handedly (1, 2, 3, [4](https://x.com/ludwigABAP/status/1823677608854192390, [4]%28https://x.com/ludwigABAP/status/1858152990420431290%29), ). This is also a very Joscha Bach-esque model of consciousness which I really buy (discovering Joscha Bach helped me get some of the necessary vocabulary to formulate my thoughts on this).

One of the thing that ends up sticking out is that effectively making compression / your prediction models better and better is what underlies cognition, general intelligence etc...

“Since short and simple explanations of the past usually reflect some repetitive regularity that helps to predict the future as well, every intelligent system interested in achieving future goals should be motivated to compress the history of raw sensory inputs in response to its actions, simply to improve its ability to plan ahead.”

“The agent should monitor the improvements of the adaptive data compressor: whenever it learns to reduce the number of bits required to encode the historic data, generate an intrinsic reward signal or curiosity reward signal in proportion to the learning progress or compression progress, that is, the number of saved bits.”

“Generally speaking we may say that a major goal of traditional unsupervised learning is to improve the compression of the observed data, by discovering a program that computes and thus explains the history… but is clearly shorter than the shortest previously known program of this kind.”

Another interesting thing that sticks out is how "valuable" noise is in this paper, due to this train of thought: it defines the edge of compressibility. It's actually a signal for potential opportunities to progress compression further. Pure noise is not interesting, it has no compression progress potential. Fully predictable data is also uninteresting, as it's been compressed as efficiently as possible already. So highly interesting data basically becomes regions where noise is temporarily compressible once your compression model improves. You have this sort of sweet spot of noise where data appears random at first and your predictive models cannot compress it well, but where you get to reveal some deep, hidden regularities that allow you to improve your model substantially.

Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers

A very simple, short paper -- skimmed to get more food for thoughts re: circuit creation in transformers. It showcases attention heads organically "stack" to achieve the given chain-reasoning tasks pretty well.

I think it gives a good intuition for the kind of circuitry that attention heads can form for tasks where the setup is something like A = 7, B = A, C = B, D = C and the prompt is what is the value of D/. In that case, Layer 1 Head 1 can retrieve and encode the information that A = 7 into the residual stream. Then, Layer 2 Head 1 accesses this encoded representation to encode that B = 7, Layer 3 Head 1 then encodes C = 7, and finally Layer 4 Head 1 uses these accumulated representations to correctly infer D = 7.

Note: while writing this for myself and after realizing it might be fun to publish for anyone who is interested, I realized that paragraphs like this are actually dangerous because I make massive leaps in language since I am writing to myself. Models don't actually encode symbolic statements like this, obviously. They only operate exclusively numerically, on vectors. But since that's obvious to me (or all?!), I just skip stating the obvious.