How LLMs work

malwrar • 1 hour ago

Back when ChatGPT came out, I was so shocked by how _good_ it was for an “AI” product that I simply had to know how it worked. Over the next month I ended up drawing out a block diagram on a whiteboard I have in my office, with the math involved next to each step in the blackboard. I’d puzzle about each step along the way, and the triumph of completing the drawing was also that of this sense of deep understanding. I kept that drawing up for many months after, and would gaze at it often during meetings and idle moments in wonder.

This is to say: the autoregressive decoder-only transformer llm architecture as pioneered by openai is wildly simple for how revolutionary its results are. I was reading about non-learned classical SLAM systems (uses video + handcrafted math to produce 3d mappings of physical spaces while also locating the camera in those spaces) at the time, and comparatively speaking I’d say the math is about as complicated as ONE of the components in those complex formulations. The only reason frontier LLMs need 6-figure computers to run is because the model designers made the middle bit in those models REALLY BIG, dimensionally speaking. They just took the steam engine, made a few gargantuan versions of it, and are selling them as the ultimate source of power.

This was openai’s entire breakthrough. Making this particular model architecture larger leads to emergent capabilities like being able to pick the best ending to a story/set of instructions or answer questions about broad factual knowledge. I’ve been meanwhile watching these AI companies attempt, successfully, to sell this capability as some sort of robot consciousness hand-crafted by supergeniuses. The fact that they are getting away with it is almost as shocking to me as the discovery itself.

10GBps • 57 minutes ago

Yep. It's nearly identical to the neural nets we were using in the 90s. Back then even a supercomputer wasn't big enough or fast enough to do what we do today.

I have to wonder though. Is this all a human brain is? A similar thing to an LLM just scaled exponentially larger. I mean a brain is not just neurons with simple connections to each other. The neurons, axons, dendrites, <insert_unexplained_thing>, etc in a brain are all holding and processing information in different ways and doing it nearly 100% in parallel. That's a really big model.

The biological discoveries show how complex a biological brain actually is. Even the tiny brains in a bee or spider are able to solve puzzles and use tools. That's crazy.

foxes • 30 minutes ago

Probably better to not simply reduce it by just saying X is Y then if it has all that extra complexity and capacity.

darksim905 • 52 minutes ago

For anyone who is curious about the first paragraph here, this is actually a great video overview of how LLM works and the tokenization part.

Tangentially related: This part always seemed fuzzy to me, especially when dealing with data scientists and how they talk about how 'ML' looks at problems. I had this issue when working at a SIEM vendor where they kept going on about use case development having to be designed a certain way to catch things. It was all very frustrating.

jfim • 52 minutes ago

Indeed. It's pretty interesting to realize after implementing GPT-2 that the frontier models are scaled up versions of that, with various tweaks to improve performance, model-wise.

The secret sauce though is all the datasets, RL training, knowledge of what works from doing all kinds of ablation experiments, and a massive compute moat.

pkoird • 51 minutes ago

aka "the bitter lesson"

faurroar • 44 minutes ago

Architectures have evolved significantly since then. DeepSeek v4 =/= GPT-3. Even then, a great deal of complexity lies in everything surrounding the architectures e.g. how do you implement them performantly on modern accelerators, how do you distribute the model across a set of accelerators, how do you post-train, etc. And pre-training itself is a dark art. If you legitimately think that frontier labs are doing something equivalent to whatever you wrote on your whiteboard, you’re clueless.

jumploops • 32 minutes ago

Those are all just optimizations.

We still don’t really know why they work, we just know how to build them.

trollbridge • 27 minutes ago

We don't really know why language works with humans, either. If you raise a baby from birth, you kind of observe how it is learning language, but the process is also rather mysterious. My eldest son's first word was to actually imitate a cow mooing, and then after that to imitate a motor noise of a tractor or truck. And then after that a meow. (His first complete sentence was "King Graham fell"...)

My next child took a completely different path to language, including skipping all the non-verbal imitations.

And then at some point, you just suddenly can two-way communicate with them when you couldn't before, and then after that, they can engage in reasoning.

10GBps • 2 hours ago

I learned TCP/IP by watching and reading raw packets over packet radio at 1200 baud.

I've noticed the same thing is possible if you watch the output of a slow LLM. Eventually you start to see the machinery. input tokens = output tokens, it's math. I can't exactly predict the tokens generated but I can see how they are formed. It's a lot like chess. You can't see every possible move but the mechanism is understandable.

trollbridge • 29 minutes ago

Comment <-> username synergy.

andai • 3 hours ago

I couldn't load the article directly due to an SSL issue, so here's the archive link:

https://archive.ph/aWtFG

whateveracct • 29 minutes ago

accidentally quadratic

codeakki • 41 minutes ago

What's the point of this? Im not here to engage with AI bots

lhd1 • 2 hours ago

find it difficult to engage with AI generated text. What am I getting here that I couldn't get from a chatbot.

dialsMavis • 1 hour ago

Is this text generated by AI? I couldn't tell but I'd believe it if it was.

I imagine if resources were spent writing this text then one benefit of using it is not using more resources or the pollution caused from a chatbot.

zemo • 37 minutes ago

normal people talk and write with some notion of meter, the cadence of communicating where pauses are inserted at places that naturally suit the speaker (and listener) to pause for thought. LLM's don't really do that, they just write a bunch of sentences.

> Researchers have found that some neurons inside the FFN are strongly associated with specific concepts or facts. One neuron might activate strongly on Eiffel-Tower-related text. Another on programming languages. Another on past-tense verbs.

People don't really write like this and they don't really talk like this (and no, people don't necessarily write exactly how they talk because they don't read exactly how they listen; the written word can be backtracked while the heard cannot, and speakers/writers know this, either consciously or unconsciously). A person would probably structure this more like:

> Researchers have found that some neurons inside the FFN are strongly associated with specific concepts or facts. For example, there could be one neuron that activates strongly on Eiffel-Tower-related text, another that activates strongly on programming languages, a third neuron activating on past-tense verbs, and so on.

Usually people wouldn't write "Another on programming languages." as a standalone sentence like that because the periods introduce an unnatural pause like they're giving a TED talk, unless of course they were punctuating that way for effect, but you'd essentially never communicate with that effect full time.

rippeltippel • 1 hour ago

The voice of several passages resembles ChatGPT very closely.

blackoil • 1 hour ago

Hopefully someone has asked right questions and removed confusing answers/hallucinations.

singpolyma3 • 3 hours ago

Next do "why LLMs work"

krackers • 60 minutes ago

See Tegmark's "why does deep cheap learning work so well" (well not so cheap anymore...)

https://www.youtube.com/watch?v=5MdSE-N0bxs is remarkably prescient given that it was written before LLMs

sheeshkebab • 3 hours ago

considering they work with any architecture/configuration given enough compute, just more or less efficiently - then maybe it's fundamental, in the same sense as why electricity works...

soupspaces • 2 hours ago

Universal approximation theorem, embeddings, self-attention, gradient descent. And empirically, scaling laws.

skydhash • 2 hours ago

Why does linear regression works? Why does computer works? Because it's about math and the encoding information. If we can encode words as numbers, then why can't we encode their order as a relation? It's just that neural networks are very apt at finding that relation even if it's noisy.