LLM fundamentals :: How we got to modern LLMs

The origins: neural networks

Before we had LLMs, we had neural networks. The concept goes back to the 1950s, inspired by how neurons in the brain connect and communicate. The basic idea is simple: create a network of interconnected nodes that can learn patterns from data.

Think of it like this: imagine teaching a child to recognize cats. You show them many pictures, saying "cat" or "not cat" for each one. Eventually, they learn to spot the patterns: pointy ears, whiskers, four legs. Neural networks learn the same way, but with numbers instead of pictures.

What's a neural network?

A neural network is a computer program organized in layers of connected nodes. Each connection has a weight (a number) that determines how much influence one node has on another. During training, these weights adjust to recognize patterns in the data.

How neural networks learn

The next few paragraphs explain in detail how neural networks discover and cluster features. For a quicker read, you can skip to "The hardware revolution" section. Essential point: neural networks automatically learn to recognize patterns by adjusting weights, organizing similar inputs into clusters, and building increasingly abstract features across layers.

Here's what makes neural networks powerful: they automatically discover features that matter. When you train a network to recognize cats, you don't tell it to look for whiskers or pointy ears. Instead, the network discovers these features on its own by adjusting its weights.

The network organizes similar patterns into clusters. Show it a picture of a tabby cat, and certain neurons activate. Show it a different tabby cat, and similar neurons activate, even though the pictures aren't identical. The network learned that these images share common features.

This clustering happens across layers. Early layers might recognize simple patterns like edges and corners. Middle layers combine these into more complex features like shapes and textures. Deep layers recognize high-level concepts like "this is a cat face." Each layer feeds its output to the next layer, building increasingly abstract representations.

When you later show the network a cat it has never seen before, it activates neurons similar to those used for other cats. The network recognizes the new cat because it shares features with cats in the training data. This generalization, recognizing patterns in new data based on learned features, is what makes neural networks useful.

Early neural networks were limited. They could handle simple tasks like recognizing handwritten digits, but they couldn't process complex patterns or understand language. The architecture was too simple, the data was too limited, and the computers were too slow.

For decades, neural networks remained mostly a research curiosity. They worked in theory but failed to deliver practical results for complex problems. That all changed in the 2010s when three critical pieces fell into place.

The hardware revolution

Neural networks need massive computational power. Each training run involves billions of calculations, adjusting weights across millions of connections. For years, CPUs handled this work, but they weren't designed for the kind of parallel processing that neural networks require.

Then came GPUs. Originally built to render graphics for video games, GPUs excel at doing many calculations simultaneously. Someone realized that the same parallel processing that renders realistic game graphics could also train neural networks much faster.

GPU vs CPU

A CPU is like a single very fast worker who tackles tasks one at a time. A GPU is like thousands of slower workers who can all work simultaneously. For neural networks, having many workers doing simple calculations in parallel beats having one fast worker doing them sequentially.

Deep dive ahead

The next few paragraphs explain the technical details of why CPUs and GPUs are designed differently. If you just want the basics, you can skip to "The big data explosion" section below. The key takeaway: GPUs have thousands of simple cores optimized for parallel work, while CPUs have fewer complex cores optimized for varied tasks.

Why are CPUs and GPUs designed so differently? It comes down to their intended use and the tradeoffs involved.

CPUs are designed for versatility. They handle diverse tasks: running your operating system, managing files, executing complex logic with many branches and decisions. This requires sophisticated features like branch prediction, out-of-order execution, and large caches. These features make individual CPU cores fast and flexible, but they consume space and power. A modern CPU typically has 4 to 32 cores because each core is complex and expensive.

GPUs, by contrast, are designed for repetitive calculations. Rendering graphics means doing the same operation on millions of pixels: calculate color, apply texture, determine lighting. This doesn't require complex logic or decision-making. It requires doing simple math operations over and over on different data.

So GPU designers made a tradeoff: they stripped out the complex features that make CPU cores fast at varied tasks, and instead packed thousands of simpler cores onto the chip. Each core is slower and less flexible than a CPU core, but when you need to do the same operation thousands of times, having thousands of cores wins.

Why can't we have thousands of fast, complex cores? Physics and economics. Each transistor produces heat. More transistors mean more heat. At some point, you can't cool the chip effectively. Complex cores also require more power. There's a limit to how much power you can deliver to a chip and how much heat you can remove.

Additionally, complex cores take up more space on the chip. If each core had all the sophisticated features of a CPU core, you could only fit a handful on a chip. GPU designers chose quantity over individual quality because graphics (and neural networks) benefit more from parallelism than from per-core speed.

The shift from CPUs to GPUs reduced training time from months to days, sometimes hours. Tasks that were previously impractical became feasible. This wasn't just an incremental improvement; it was transformational.

Cloud providers started offering GPU access by the hour. You no longer needed to buy expensive hardware to experiment with neural networks. This democratized AI research, allowing smaller teams and individual researchers to train models that previously only tech giants could afford.

The big data explosion

Neural networks learn from examples. Show them enough examples, and they learn the patterns. But "enough" is a lot more than anyone initially expected. Small datasets produced models that memorized examples rather than learning general patterns.

Then the internet provided the solution: massive amounts of data. GitHub hosted millions of code repositories. Wikipedia contained detailed articles on nearly every topic. Websites, books, research papers, and forums created an unprecedented collection of human knowledge, all digitized and accessible.

Companies started scraping this data. They downloaded entire sections of the internet, GitHub repositories, digitized books, academic papers, and public conversations. This created training datasets containing billions, sometimes trillions, of words.

What does training mean?

Training is the process of showing a neural network many examples and adjusting its internal weights to minimize errors. For language models, this means feeding it massive amounts of text and teaching it to predict what word comes next. The model that emerges has learned statistical patterns about how language works.

The combination of better hardware and massive datasets changed what was possible. Models started showing emergent behaviors: capabilities that weren't explicitly programmed but arose from the patterns in the data. They could answer questions, write coherent paragraphs, and even generate working code.

The breakthrough: transformers

In 2017, researchers at Google published a paper titled "Attention Is All You Need." It introduced the transformer architecture, which became the foundation for modern LLMs.

Technical explanation ahead

The next few paragraphs explain why previous neural networks struggled with long sequences. If you want just the essentials, skip to "The architecture also scaled better" below. Key point: old networks compressed information into a fixed-size summary that lost details, while transformers can see all words simultaneously.

Previous neural networks processed text sequentially, word by word, like reading a sentence from left to right. These were called recurrent neural networks (RNNs). At each step, they combined the current word with a hidden state carrying information from previous words.

This created problems. The hidden state acts like a summary of everything seen so far, compressed into a fixed-size vector of numbers. As the sentence gets longer, this summary must capture more information in the same limited space. Important details from early in the sentence get progressively diluted or overwritten by newer information.

Think of it like a game of telephone played with yourself. You read the first few words and create a mental summary. Then you read the next word and update your summary. By word 50, your summary has been updated so many times that details from word 3 have faded. The network faces the same problem: compressing a long sequence into a fixed-size hidden state means losing information.

Researchers tried various solutions like LSTM (Long Short-Term Memory) networks, which were better at retaining information, but they still struggled with very long sequences. The fundamental issue remained: processing sequentially meant relying on a compressed representation that bottlenecked information flow.

Transformers changed this. Instead of processing words one at a time, they process entire sequences simultaneously. They can look at all words at once and understand how they relate to each other, regardless of their position in the text. No compression into a hidden state, no forgetting early context.

The architecture also scaled better than previous approaches. You could make transformers larger by adding more layers and parameters, and their performance kept improving. Previous architectures hit limits where making them bigger didn't help. Transformers didn't have that problem.

What's an architecture?

In this context, architecture means the structure and organization of the neural network. It defines how nodes connect, how data flows through the network, and what calculations happen at each step. Different architectures solve different problems, and transformers proved ideal for processing language.

How attention mechanisms work

The key innovation in transformers is the attention mechanism. This is what allows the model to understand relationships between words, even when they're far apart in a sentence.

Think of it this way: when you read the sentence "The trophy doesn't fit in the suitcase because it's too big," you need to figure out what "it" refers to. Is "it" the trophy or the suitcase? You naturally consider both options and use context to decide.

Attention mechanisms do something similar. For each word, the model calculates how much attention to pay to every other word in the sequence. This creates a map of relationships: which words are relevant to which other words.

In our example, when processing "it's," the model would assign high attention to both "trophy" and "suitcase," then use surrounding context ("too big") to determine that "it" likely refers to the trophy. The suitcase is what's too small, so the trophy doesn't fit.

Self-attention explained

Self-attention means the model attends to different parts of its input. It compares each word against every other word to understand their relationships. This happens in parallel across the entire sequence, making it both powerful and computationally efficient.

Detailed architecture explanation

The following sections dive deep into how transformer layers work internally. If you prefer a lighter understanding, skip to "From GPT to modern LLMs" below. Main idea: transformers stack many layers that each refine word representations, with early layers learning basic patterns and deep layers learning complex meanings.

This attention happens in multiple layers stacked on top of each other. Understanding how layers work is crucial to understanding both classical neural networks and transformers.

In a neural network, layers process data sequentially. The input layer receives raw data (like pixel values or word representations). This data passes through one or more hidden layers that transform it. Finally, an output layer produces the result (like "this is a cat" or "the next word should be 'mat'").

Each layer takes the output from the previous layer as its input, applies some mathematical transformations, and produces an output for the next layer. In classical neural networks, each node in a layer connects to all nodes in the previous layer. The node multiplies each input by a weight, sums these products, adds a bias term, and applies an activation function to produce its output.

Think of it like an assembly line. Raw materials enter at one end. Each station transforms the materials based on what the previous station produced. The final station outputs the finished product. The "weights" are like the specific settings at each station that determine how the transformation happens.

Transformers use a different layer structure. Each transformer layer has two main components: an attention sublayer and a feed-forward sublayer. The attention sublayer takes all the word representations and produces new representations based on how words relate to each other. Then the feed-forward sublayer processes each word's representation independently, applying the same transformation to each position.

Crucially, transformers add the input to each sublayer's output before passing it to the next sublayer. This is called a residual connection. It helps information flow through many layers without degrading. Without residual connections, deep networks struggle because gradients (used during training to adjust weights) become vanishingly small as they propagate backward through many layers.

Modern LLMs stack dozens of transformer layers. GPT-3 has 96 layers. Each layer refines the representations from the previous layer. Early layers might focus on basic patterns like which words appear together. Middle layers might capture grammatical relationships. Deep layers understand complex semantic relationships and context.

The output of the final layer contains rich representations of each word that capture not just the word itself but its meaning in context, its relationships to other words, and relevant information from the entire sequence. These representations then feed into a final prediction layer that produces the model's output.

Neural networks to transformers

Please sign in to tackle the quizzes.

From GPT to modern LLMs

GPT stands for Generative Pre-trained Transformer. The first version, GPT-1, arrived in 2018 with 117 million parameters. It could generate coherent text but wasn't particularly impressive. GPT-2 came next with 1.5 billion parameters and showed notable improvements.

What are parameters?

Parameters are the weights in the neural network that the model adjusts during training. More parameters generally mean the model can capture more complex patterns, though there are diminishing returns and practical limits on how large models can grow.

Then came GPT-3 in 2020 with 175 billion parameters. This was a quantum leap. It could write essays, answer questions, translate languages, and generate code with minimal examples. It showed that scaling up transformers continued to improve performance in ways that smaller models couldn't match.

Here's how major LLMs have evolved in terms of scale:

Model	Year	Parameters	Training tokens (est.)	Increase from previous
GPT-1	2018	117M	~5B	-
GPT-2	2019	1.5B	~40B	13x params, 8x tokens
GPT-3	2020	175B	~300B	117x params, 7.5x tokens
GPT-4	2023	~1.8T (rumored)	~13T (estimated)	~10x params, ~43x tokens
LLaMA	2023	7B-65B	1.4T	N/A (different approach)
LLaMA 2	2023	7B-70B	2T	Similar params, 1.4x tokens
Claude 2	2023	Undisclosed	Undisclosed	N/A
Claude 3	2024	Undisclosed	Undisclosed	N/A
Gemini 1.0	2023	Undisclosed	Undisclosed	N/A

Why undisclosed?

Many companies no longer publish exact parameter counts or training data sizes. This is partly competitive strategy and partly because these numbers have become less meaningful as training techniques and data quality have improved. A well-trained smaller model can outperform a poorly trained larger one.

The pattern was clear for several years: bigger models trained on more data performed better. This sparked a race to build ever-larger models. GPT-4, Claude, Gemini, and other modern LLMs followed, each pushing the boundaries of what language models could do.

Research details ahead

The next section discusses scaling research and diminishing returns. For a lighter read, skip to "Training and fine-tuning" below. Core takeaway: bigger models improve performance but with diminishing returns, and recent research shows that training smaller models on more data can be more efficient.

But there's a catch. Scaling doesn't work indefinitely. Research has shown that improvements follow predictable scaling laws, but with diminishing returns. This is captured in papers like "Scaling Laws for Neural Language Models" by Kaplan et al. (2020).

The research shows that model performance improves as a power law with three factors: model size (parameters), dataset size (tokens), and compute (training time). However, the improvement rate slows as you scale. Doubling parameters doesn't double performance. You get less improvement per additional parameter as models grow larger.

There are also practical limits. Larger models require more memory to run, more computation to train, and more energy. At some point, the cost and complexity outweigh the marginal benefits. A 1 trillion parameter model might be only marginally better than a 500 billion parameter model, but it could cost twice as much to train and run.

More recent research suggests that training smaller models on more data, for longer, can be more efficient than simply scaling up parameters. This is the insight behind models like LLaMA, which achieved competitive performance with fewer parameters by using more training data and better training techniques.

The current consensus is that there's no single magic number. The optimal model size depends on your use case, available compute, inference requirements, and how much data you can effectively use for training. Blindly adding parameters doesn't guarantee better results.

These models aren't specialized for specific tasks. They're general-purpose text processors trained on diverse data. The same model that writes poetry can also debug code, explain scientific concepts, or draft business emails. This versatility comes from the breadth of their training data and the flexibility of the transformer architecture.

Training and fine-tuning

Training an LLM happens in stages. The first stage is pre-training, where the model learns language patterns from massive datasets. This is expensive and time-consuming, often requiring months of computation on thousands of GPUs.

During pre-training, the model learns to predict the next word in a sequence. Show it "The cat sat on the" and it learns that "mat," "floor," or "chair" are likely next words. Do this billions of times across diverse text, and the model develops a statistical understanding of language.

Pre-training vs fine-tuning

Pre-training teaches the model general language understanding using massive datasets. Fine-tuning adapts the pre-trained model for specific tasks or behaviors using smaller, targeted datasets. Fine-tuning is faster and cheaper than pre-training from scratch.

After pre-training comes fine-tuning. This adapts the model for specific uses. For example, you might fine-tune a model to be more helpful as a chatbot, to follow instructions better, or to specialize in a particular domain like medicine or law.

Fine-tuning uses smaller datasets with examples of desired behavior. You show the model questions and good answers, tasks and correct completions. The model adjusts its weights to match these examples while retaining its general language understanding from pre-training.

Modern LLMs often go through multiple rounds of fine-tuning. First, they're fine-tuned to follow instructions. Then they might be fine-tuned based on human feedback, learning which responses people prefer. This iterative refinement produces models that feel more helpful and aligned with human expectations.

Context windows and memory constraints

When you interact with an LLM, everything you've said in the conversation exists in the context window. This is the model's working memory, the text it can see and reference when generating a response.

Context windows have limits, measured in tokens. A token is roughly a word or part of a word. Early models had context windows of a few thousand tokens. Modern models have pushed this to hundreds of thousands, even millions, but there's always a limit.

What's a token?

A token is the basic unit of text that LLMs process. Common words are single tokens, while longer or uncommon words might split into multiple tokens. For example, "understanding" might be one token, while "understanding's" might be two. Models count tokens, not characters or words.

Once you exceed the context window, the model starts forgetting earlier parts of the conversation. It can't remember what it can't see. This creates challenges for long conversations or tasks requiring extensive context.

Think of it like having a conversation where you can only remember the last few minutes. You might lose track of earlier topics or repeat yourself. LLMs have the same limitation. They don't truly remember previous conversations unless that text is included in the current context window.

Developers work around this by summarizing conversations, extracting key information, or using external memory systems. But fundamentally, the context window represents a hard limit on how much the model can consider at once.

Training, scaling, and context

Please sign in to tackle the quizzes.

Why we call it AI

After all this, you might wonder why we call these systems artificial intelligence when they're really just pattern-matching prediction machines. The answer is partly historical, partly aspirational, and partly marketing.

The term artificial intelligence dates back to the 1950s, encompassing any computer system that mimics intelligent behavior. Early AI included simple rule-based systems that played chess or proved mathematical theorems. These weren't intelligent in any human sense, but they performed tasks we associate with intelligence.

LLMs fit this broad definition. They perform tasks that seem intelligent: answering questions, writing essays, solving problems. The fact that they do this through statistical patterns rather than reasoning doesn't change the perception that they're doing something intelligent.

Narrow vs general AI

LLMs are narrow AI: they excel at specific tasks (processing and generating text) but lack general intelligence. General AI, which would match or exceed human intelligence across all domains, doesn't exist yet. When people worry about AI taking over, they're usually thinking of general AI, not current LLMs.

There's also the problem that "Large Language Model" is accurate but clunky. "AI" is shorter, more marketable, and more familiar to non-technical audiences. So companies use "AI" even when technically they're describing LLMs or other narrow AI systems.

Understanding this distinction matters. It helps set realistic expectations. LLMs are powerful tools with specific capabilities and limitations. They're not thinking machines, and they're not approaching consciousness. They're sophisticated pattern matchers that happen to be very good at processing language.