The underlying ideas that make AI work — explained simply.
A way for the model to weigh the importance of different parts of the input when processing each token. "Attention is all you need" — the 2017 paper that launched the transformer revolution.
Each token in a sequence attends to every other token, creating rich contextual representations. The core of the transformer architecture.
Running multiple self-attention operations in parallel, each learning different types of relationships. Like having multiple "lenses" to view the input.
Since transformers process all tokens simultaneously (unlike RNNs), position information must be added explicitly so the model knows word order.
After attention, each token passes through a small neural network that transforms its representation. Usually two linear layers with a non-linearity in between.
A technique to stabilize training by normalizing the activations of each layer. Helps gradients flow more smoothly through deep networks.
A mathematical measure of how far the model's predictions are from the correct answers. Training = minimizing this value. For language models, cross-entropy loss is standard.
The optimization algorithm that adjusts model weights in the direction that reduces loss. "Descent" because you're moving down the loss surface toward a minimum.
The most popular optimizer for training deep learning models. Combines momentum (acceleration) with adaptive learning rates (per-parameter tuning).
A vector of partial derivatives showing the direction and rate of steepest increase of the loss. We move in the opposite direction to minimize loss.
Techniques to prevent overfitting: dropout (randomly deactivating neurons), weight decay (penalizing large weights), and early stopping.
Normalizing layer inputs across each mini-batch. Reduces internal covariate shift and allows higher learning rates.
Controls randomness in text generation. Low (0.2) = focused and deterministic. High (0.9) = creative and varied. 1.0 = standard sampling.
At each step, only consider the K most likely next tokens. Reduces weird or irrelevant outputs.
Only consider tokens whose cumulative probability reaches P. More adaptive than Top-K — automatically adjusts the number of candidates.
Always pick the most likely next token. Fastest but can get stuck in repetitive loops. Often produces the most coherent output for factual tasks.
Instead of picking the single best token at each step, keep the top B sequences and pick the best overall. Better quality but slower.
The raw, unnormalized scores the model outputs for each token before softmax. Can be adjusted for bias correction, repetition penalties, and custom sampling.
Measures how "surprised" the model is by test data. Lower is better. A perplexity of 100 means the model is as confused as choosing uniformly from 100 options.
Percentage of correct predictions. Simple but can be misleading for imbalanced datasets.
Precision = of all positive predictions, how many were correct? Recall = of all actual positives, how many did we find?
The harmonic mean of precision and recall. A single metric that balances both.
Metrics for evaluating text generation quality by comparing model output to reference text. BLEU counts n-gram overlap (used for translation). ROUGE is similar but common for summarization.
How many tokens the model generates per second. Measures inference speed. Typical range: 20-100+ TPS depending on model size and hardware.
| Concept | Formula | What it means |
|---|---|---|
| Attention | softmax(QKᵀ/√dₖ)V | Weigh inputs by relevance |
| Cross-Entropy Loss | -Σ yᵢ log(pᵢ) | Penalizes wrong predictions |
| Softmax | eˣⁱ / Σeˣʲ | Converts scores to probabilities |
| ReLU | max(0, x) | Activation: passes positive values only |
| Layer Norm | (x - μ) / σ × γ + β | Normalizes per-sample activations |
| F1 Score | 2 × (P×R)/(P+R) | Harmonic mean of precision & recall |
| Perplexity | 2^(cross-entropy) | Effective branching factor |