All checks were successful
Build and Push Container / build-and-push (push) Successful in 13s
214 lines
10 KiB
HTML
214 lines
10 KiB
HTML
<!DOCTYPE html>
|
||
<html lang="en">
|
||
<head>
|
||
<meta charset="UTF-8">
|
||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||
<title>Math & Concepts - Cheat Sheet</title>
|
||
<link rel="stylesheet" href="../css/style.css">
|
||
</head>
|
||
<body>
|
||
|
||
<nav>
|
||
<div class="nav-inner">
|
||
<a href="../index.html" class="nav-brand">AI Cheat Sheet</a>
|
||
<div class="nav-links">
|
||
<a href="terminology.html">Terminology</a>
|
||
<a href="techniques.html">Techniques</a>
|
||
<a href="use-cases.html">Use Cases</a>
|
||
<a href="model-types.html">Model Types</a>
|
||
<a href="prompts.html">Prompt Guide</a>
|
||
<a href="math.html" class="active">Math & Concepts</a>
|
||
</div>
|
||
<button class="dark-toggle" id="darkToggle" aria-label="Toggle dark mode">🌙</button>
|
||
</div>
|
||
</nav>
|
||
|
||
<script>
|
||
(function(){
|
||
var btn = document.getElementById('darkToggle');
|
||
var saved = localStorage.getItem('theme');
|
||
if(saved === 'dark' || (!saved && window.matchMedia('(prefers-color-scheme: dark)').matches)){
|
||
document.documentElement.setAttribute('data-theme','dark');
|
||
btn.textContent = '☀️';
|
||
}
|
||
btn.addEventListener('click', function(){
|
||
var isDark = document.documentElement.getAttribute('data-theme') === 'dark';
|
||
if(isDark){
|
||
document.documentElement.removeAttribute('data-theme');
|
||
btn.textContent = '🌙';
|
||
localStorage.setItem('theme','light');
|
||
} else {
|
||
document.documentElement.setAttribute('data-theme','dark');
|
||
btn.textContent = '☀️';
|
||
localStorage.setItem('theme','dark');
|
||
}
|
||
});
|
||
})();
|
||
</script>
|
||
|
||
<div class="hero">
|
||
<h1>Math & Concepts</h1>
|
||
<p>The underlying ideas that make AI work — explained simply.</p>
|
||
</div>
|
||
|
||
<div class="container">
|
||
|
||
<h2 class="section-title">Core Concepts</h2>
|
||
<div class="def-card">
|
||
<span class="category">Architecture</span>
|
||
<h3>Attention Mechanism</h3>
|
||
<p>A way for the model to weigh the importance of different parts of the input when processing each token. "Attention is all you need" — the 2017 paper that launched the transformer revolution.</p>
|
||
<div class="example"><strong>Analogy:</strong> When reading a sentence, you naturally pay more attention to certain words. "The cat that chased the mouse hid" — you attend to "cat" when processing "hid".</div>
|
||
</div>
|
||
<div class="def-card">
|
||
<span class="category">Architecture</span>
|
||
<h3>Self-Attention</h3>
|
||
<p>Each token in a sequence attends to every other token, creating rich contextual representations. The core of the transformer architecture.</p>
|
||
<div class="example"><strong>Math:</strong> Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V</div>
|
||
</div>
|
||
<div class="def-card">
|
||
<span class="category">Architecture</span>
|
||
<h3>Multi-Head Attention</h3>
|
||
<p>Running multiple self-attention operations in parallel, each learning different types of relationships. Like having multiple "lenses" to view the input.</p>
|
||
</div>
|
||
<div class="def-card">
|
||
<span class="category">Architecture</span>
|
||
<h3>Positional Encoding</h3>
|
||
<p>Since transformers process all tokens simultaneously (unlike RNNs), position information must be added explicitly so the model knows word order.</p>
|
||
</div>
|
||
<div class="def-card">
|
||
<span class="category">Architecture</span>
|
||
<h3>Feed-Forward Network (FFN)</h3>
|
||
<p>After attention, each token passes through a small neural network that transforms its representation. Usually two linear layers with a non-linearity in between.</p>
|
||
</div>
|
||
<div class="def-card">
|
||
<span class="category">Architecture</span>
|
||
<h3>Layer Normalization</h3>
|
||
<p>A technique to stabilize training by normalizing the activations of each layer. Helps gradients flow more smoothly through deep networks.</p>
|
||
</div>
|
||
|
||
<h2 class="section-title">Training Concepts</h2>
|
||
<div class="def-card">
|
||
<span class="category">Training</span>
|
||
<h3>Loss Function</h3>
|
||
<p>A mathematical measure of how far the model's predictions are from the correct answers. Training = minimizing this value. For language models, cross-entropy loss is standard.</p>
|
||
<div class="example"><strong>Example:</strong> If the correct next word is "cat" but the model assigns it 10% probability, the loss is high. If it assigns 90%, the loss is low.</div>
|
||
</div>
|
||
<div class="def-card">
|
||
<span class="category">Training</span>
|
||
<h3>Gradient Descent</h3>
|
||
<p>The optimization algorithm that adjusts model weights in the direction that reduces loss. "Descent" because you're moving down the loss surface toward a minimum.</p>
|
||
</div>
|
||
<div class="def-card">
|
||
<span class="category">Training</span>
|
||
<h3>Adam Optimizer</h3>
|
||
<p>The most popular optimizer for training deep learning models. Combines momentum (acceleration) with adaptive learning rates (per-parameter tuning).</p>
|
||
</div>
|
||
<div class="def-card">
|
||
<span class="category">Training</span>
|
||
<h3>Gradient</h3>
|
||
<p>A vector of partial derivatives showing the direction and rate of steepest increase of the loss. We move in the opposite direction to minimize loss.</p>
|
||
</div>
|
||
<div class="def-card">
|
||
<span class="category">Training</span>
|
||
<h3>Regularization</h3>
|
||
<p>Techniques to prevent overfitting: dropout (randomly deactivating neurons), weight decay (penalizing large weights), and early stopping.</p>
|
||
</div>
|
||
<div class="def-card">
|
||
<span class="category">Training</span>
|
||
<h3>Batch Normalization</h3>
|
||
<p>Normalizing layer inputs across each mini-batch. Reduces internal covariate shift and allows higher learning rates.</p>
|
||
</div>
|
||
|
||
<h2 class="section-title">Generation & Sampling</h2>
|
||
<div class="def-card">
|
||
<span class="category">Sampling</span>
|
||
<h3>Temperature</h3>
|
||
<p>Controls randomness in text generation. Low (0.2) = focused and deterministic. High (0.9) = creative and varied. 1.0 = standard sampling.</p>
|
||
<div class="example"><strong>Low temp:</strong> Technical documentation, code generation<br>
|
||
<strong>High temp:</strong> Creative writing, brainstorming</div>
|
||
</div>
|
||
<div class="def-card">
|
||
<span class="category">Sampling</span>
|
||
<h3>Top-K Sampling</h3>
|
||
<p>At each step, only consider the K most likely next tokens. Reduces weird or irrelevant outputs.</p>
|
||
</div>
|
||
<div class="def-card">
|
||
<span class="category">Sampling</span>
|
||
<h3>Top-P (Nucleus) Sampling</h3>
|
||
<p>Only consider tokens whose cumulative probability reaches P. More adaptive than Top-K — automatically adjusts the number of candidates.</p>
|
||
<div class="example"><strong>Top-P = 0.9:</strong> Include the smallest set of tokens that together cover 90% probability mass.</div>
|
||
</div>
|
||
<div class="def-card">
|
||
<span class="category">Sampling</span>
|
||
<h3>Greedy Decoding</h3>
|
||
<p>Always pick the most likely next token. Fastest but can get stuck in repetitive loops. Often produces the most coherent output for factual tasks.</p>
|
||
</div>
|
||
<div class="def-card">
|
||
<span class="category">Sampling</span>
|
||
<h3>Beam Search</h3>
|
||
<p>Instead of picking the single best token at each step, keep the top B sequences and pick the best overall. Better quality but slower.</p>
|
||
</div>
|
||
<div class="def-card">
|
||
<span class="category">Sampling</span>
|
||
<h3>Logits</h3>
|
||
<p>The raw, unnormalized scores the model outputs for each token before softmax. Can be adjusted for bias correction, repetition penalties, and custom sampling.</p>
|
||
</div>
|
||
|
||
<h2 class="section-title">Evaluation Metrics</h2>
|
||
<div class="def-card">
|
||
<span class="category">Metrics</span>
|
||
<h3>Perplexity</h3>
|
||
<p>Measures how "surprised" the model is by test data. Lower is better. A perplexity of 100 means the model is as confused as choosing uniformly from 100 options.</p>
|
||
<div class="example"><strong>Example:</strong> Perplexity 5 on a language model means, on average, it's as uncertain as picking from 5 equally likely options at each step.</div>
|
||
</div>
|
||
<div class="def-card">
|
||
<span class="category">Metrics</span>
|
||
<h3>Accuracy</h3>
|
||
<p>Percentage of correct predictions. Simple but can be misleading for imbalanced datasets.</p>
|
||
</div>
|
||
<div class="def-card">
|
||
<span class="category">Metrics</span>
|
||
<h3>Precision & Recall</h3>
|
||
<p>Precision = of all positive predictions, how many were correct? Recall = of all actual positives, how many did we find?</p>
|
||
<div class="example"><strong>Spam filter:</strong> High precision = few legitimate emails flagged. High recall = few spam emails missed.</div>
|
||
</div>
|
||
<div class="def-card">
|
||
<span class="category">Metrics</span>
|
||
<h3>F1 Score</h3>
|
||
<p>The harmonic mean of precision and recall. A single metric that balances both.</p>
|
||
</div>
|
||
<div class="def-card">
|
||
<span class="category">Metrics</span>
|
||
<h3>BLEU / ROUGE</h3>
|
||
<p>Metrics for evaluating text generation quality by comparing model output to reference text. BLEU counts n-gram overlap (used for translation). ROUGE is similar but common for summarization.</p>
|
||
</div>
|
||
<div class="def-card">
|
||
<span class="category">Metrics</span>
|
||
<h3>Tokens per Second (TPS)</h3>
|
||
<p>How many tokens the model generates per second. Measures inference speed. Typical range: 20-100+ TPS depending on model size and hardware.</p>
|
||
</div>
|
||
|
||
<h2 class="section-title">Key Formulas</h2>
|
||
<table class="glossary-table">
|
||
<thead>
|
||
<tr><th>Concept</th><th>Formula</th><th>What it means</th></tr>
|
||
</thead>
|
||
<tbody>
|
||
<tr><td>Attention</td><td>softmax(QKᵀ/√dₖ)V</td><td>Weigh inputs by relevance</td></tr>
|
||
<tr><td>Cross-Entropy Loss</td><td>-Σ yᵢ log(pᵢ)</td><td>Penalizes wrong predictions</td></tr>
|
||
<tr><td>Softmax</td><td>eˣⁱ / Σeˣʲ</td><td>Converts scores to probabilities</td></tr>
|
||
<tr><td>ReLU</td><td>max(0, x)</td><td>Activation: passes positive values only</td></tr>
|
||
<tr><td>Layer Norm</td><td>(x - μ) / σ × γ + β</td><td>Normalizes per-sample activations</td></tr>
|
||
<tr><td>F1 Score</td><td>2 × (P×R)/(P+R)</td><td>Harmonic mean of precision & recall</td></tr>
|
||
<tr><td>Perplexity</td><td>2^(cross-entropy)</td><td>Effective branching factor</td></tr>
|
||
</tbody>
|
||
</table>
|
||
|
||
</div>
|
||
|
||
<footer>AI Cheat Sheet — A learning reference for artificial intelligence</footer>
|
||
|
||
</body>
|
||
</html>
|