Files
alicia-ai-terminology/pages/math.html
ducoterra e55534b0d5
All checks were successful
Build and Push Container / build-and-push (push) Successful in 13s
add darkmode
2026-05-05 06:48:29 -04:00

214 lines
10 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Math & Concepts - Cheat Sheet</title>
<link rel="stylesheet" href="../css/style.css">
</head>
<body>
<nav>
<div class="nav-inner">
<a href="../index.html" class="nav-brand">AI Cheat Sheet</a>
<div class="nav-links">
<a href="terminology.html">Terminology</a>
<a href="techniques.html">Techniques</a>
<a href="use-cases.html">Use Cases</a>
<a href="model-types.html">Model Types</a>
<a href="prompts.html">Prompt Guide</a>
<a href="math.html" class="active">Math & Concepts</a>
</div>
<button class="dark-toggle" id="darkToggle" aria-label="Toggle dark mode">🌙</button>
</div>
</nav>
<script>
(function(){
var btn = document.getElementById('darkToggle');
var saved = localStorage.getItem('theme');
if(saved === 'dark' || (!saved && window.matchMedia('(prefers-color-scheme: dark)').matches)){
document.documentElement.setAttribute('data-theme','dark');
btn.textContent = '☀️';
}
btn.addEventListener('click', function(){
var isDark = document.documentElement.getAttribute('data-theme') === 'dark';
if(isDark){
document.documentElement.removeAttribute('data-theme');
btn.textContent = '🌙';
localStorage.setItem('theme','light');
} else {
document.documentElement.setAttribute('data-theme','dark');
btn.textContent = '☀️';
localStorage.setItem('theme','dark');
}
});
})();
</script>
<div class="hero">
<h1>Math & Concepts</h1>
<p>The underlying ideas that make AI work — explained simply.</p>
</div>
<div class="container">
<h2 class="section-title">Core Concepts</h2>
<div class="def-card">
<span class="category">Architecture</span>
<h3>Attention Mechanism</h3>
<p>A way for the model to weigh the importance of different parts of the input when processing each token. "Attention is all you need" — the 2017 paper that launched the transformer revolution.</p>
<div class="example"><strong>Analogy:</strong> When reading a sentence, you naturally pay more attention to certain words. "The cat that chased the mouse hid" — you attend to "cat" when processing "hid".</div>
</div>
<div class="def-card">
<span class="category">Architecture</span>
<h3>Self-Attention</h3>
<p>Each token in a sequence attends to every other token, creating rich contextual representations. The core of the transformer architecture.</p>
<div class="example"><strong>Math:</strong> Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V</div>
</div>
<div class="def-card">
<span class="category">Architecture</span>
<h3>Multi-Head Attention</h3>
<p>Running multiple self-attention operations in parallel, each learning different types of relationships. Like having multiple "lenses" to view the input.</p>
</div>
<div class="def-card">
<span class="category">Architecture</span>
<h3>Positional Encoding</h3>
<p>Since transformers process all tokens simultaneously (unlike RNNs), position information must be added explicitly so the model knows word order.</p>
</div>
<div class="def-card">
<span class="category">Architecture</span>
<h3>Feed-Forward Network (FFN)</h3>
<p>After attention, each token passes through a small neural network that transforms its representation. Usually two linear layers with a non-linearity in between.</p>
</div>
<div class="def-card">
<span class="category">Architecture</span>
<h3>Layer Normalization</h3>
<p>A technique to stabilize training by normalizing the activations of each layer. Helps gradients flow more smoothly through deep networks.</p>
</div>
<h2 class="section-title">Training Concepts</h2>
<div class="def-card">
<span class="category">Training</span>
<h3>Loss Function</h3>
<p>A mathematical measure of how far the model's predictions are from the correct answers. Training = minimizing this value. For language models, cross-entropy loss is standard.</p>
<div class="example"><strong>Example:</strong> If the correct next word is "cat" but the model assigns it 10% probability, the loss is high. If it assigns 90%, the loss is low.</div>
</div>
<div class="def-card">
<span class="category">Training</span>
<h3>Gradient Descent</h3>
<p>The optimization algorithm that adjusts model weights in the direction that reduces loss. "Descent" because you're moving down the loss surface toward a minimum.</p>
</div>
<div class="def-card">
<span class="category">Training</span>
<h3>Adam Optimizer</h3>
<p>The most popular optimizer for training deep learning models. Combines momentum (acceleration) with adaptive learning rates (per-parameter tuning).</p>
</div>
<div class="def-card">
<span class="category">Training</span>
<h3>Gradient</h3>
<p>A vector of partial derivatives showing the direction and rate of steepest increase of the loss. We move in the opposite direction to minimize loss.</p>
</div>
<div class="def-card">
<span class="category">Training</span>
<h3>Regularization</h3>
<p>Techniques to prevent overfitting: dropout (randomly deactivating neurons), weight decay (penalizing large weights), and early stopping.</p>
</div>
<div class="def-card">
<span class="category">Training</span>
<h3>Batch Normalization</h3>
<p>Normalizing layer inputs across each mini-batch. Reduces internal covariate shift and allows higher learning rates.</p>
</div>
<h2 class="section-title">Generation & Sampling</h2>
<div class="def-card">
<span class="category">Sampling</span>
<h3>Temperature</h3>
<p>Controls randomness in text generation. Low (0.2) = focused and deterministic. High (0.9) = creative and varied. 1.0 = standard sampling.</p>
<div class="example"><strong>Low temp:</strong> Technical documentation, code generation<br>
<strong>High temp:</strong> Creative writing, brainstorming</div>
</div>
<div class="def-card">
<span class="category">Sampling</span>
<h3>Top-K Sampling</h3>
<p>At each step, only consider the K most likely next tokens. Reduces weird or irrelevant outputs.</p>
</div>
<div class="def-card">
<span class="category">Sampling</span>
<h3>Top-P (Nucleus) Sampling</h3>
<p>Only consider tokens whose cumulative probability reaches P. More adaptive than Top-K — automatically adjusts the number of candidates.</p>
<div class="example"><strong>Top-P = 0.9:</strong> Include the smallest set of tokens that together cover 90% probability mass.</div>
</div>
<div class="def-card">
<span class="category">Sampling</span>
<h3>Greedy Decoding</h3>
<p>Always pick the most likely next token. Fastest but can get stuck in repetitive loops. Often produces the most coherent output for factual tasks.</p>
</div>
<div class="def-card">
<span class="category">Sampling</span>
<h3>Beam Search</h3>
<p>Instead of picking the single best token at each step, keep the top B sequences and pick the best overall. Better quality but slower.</p>
</div>
<div class="def-card">
<span class="category">Sampling</span>
<h3>Logits</h3>
<p>The raw, unnormalized scores the model outputs for each token before softmax. Can be adjusted for bias correction, repetition penalties, and custom sampling.</p>
</div>
<h2 class="section-title">Evaluation Metrics</h2>
<div class="def-card">
<span class="category">Metrics</span>
<h3>Perplexity</h3>
<p>Measures how "surprised" the model is by test data. Lower is better. A perplexity of 100 means the model is as confused as choosing uniformly from 100 options.</p>
<div class="example"><strong>Example:</strong> Perplexity 5 on a language model means, on average, it's as uncertain as picking from 5 equally likely options at each step.</div>
</div>
<div class="def-card">
<span class="category">Metrics</span>
<h3>Accuracy</h3>
<p>Percentage of correct predictions. Simple but can be misleading for imbalanced datasets.</p>
</div>
<div class="def-card">
<span class="category">Metrics</span>
<h3>Precision & Recall</h3>
<p>Precision = of all positive predictions, how many were correct? Recall = of all actual positives, how many did we find?</p>
<div class="example"><strong>Spam filter:</strong> High precision = few legitimate emails flagged. High recall = few spam emails missed.</div>
</div>
<div class="def-card">
<span class="category">Metrics</span>
<h3>F1 Score</h3>
<p>The harmonic mean of precision and recall. A single metric that balances both.</p>
</div>
<div class="def-card">
<span class="category">Metrics</span>
<h3>BLEU / ROUGE</h3>
<p>Metrics for evaluating text generation quality by comparing model output to reference text. BLEU counts n-gram overlap (used for translation). ROUGE is similar but common for summarization.</p>
</div>
<div class="def-card">
<span class="category">Metrics</span>
<h3>Tokens per Second (TPS)</h3>
<p>How many tokens the model generates per second. Measures inference speed. Typical range: 20-100+ TPS depending on model size and hardware.</p>
</div>
<h2 class="section-title">Key Formulas</h2>
<table class="glossary-table">
<thead>
<tr><th>Concept</th><th>Formula</th><th>What it means</th></tr>
</thead>
<tbody>
<tr><td>Attention</td><td>softmax(QKᵀ/√dₖ)V</td><td>Weigh inputs by relevance</td></tr>
<tr><td>Cross-Entropy Loss</td><td>-Σ yᵢ log(pᵢ)</td><td>Penalizes wrong predictions</td></tr>
<tr><td>Softmax</td><td>eˣⁱ / Σeˣʲ</td><td>Converts scores to probabilities</td></tr>
<tr><td>ReLU</td><td>max(0, x)</td><td>Activation: passes positive values only</td></tr>
<tr><td>Layer Norm</td><td>(x - μ) / σ × γ + β</td><td>Normalizes per-sample activations</td></tr>
<tr><td>F1 Score</td><td>2 × (P×R)/(P+R)</td><td>Harmonic mean of precision & recall</td></tr>
<tr><td>Perplexity</td><td>2^(cross-entropy)</td><td>Effective branching factor</td></tr>
</tbody>
</table>
</div>
<footer>AI Cheat Sheet &mdash; A learning reference for artificial intelligence</footer>
</body>
</html>