alicia-ai-terminology/pages/math.html

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Math & Concepts - Cheat Sheet</title>
  <link rel="stylesheet" href="../css/style.css">
</head>
<body>

<nav>
  <div class="nav-inner">
    <a href="../index.html" class="nav-brand">AI Cheat Sheet</a>
    <div class="nav-links">
      <a href="terminology.html">Terminology</a>
      <a href="techniques.html">Techniques</a>
      <a href="use-cases.html">Use Cases</a>
      <a href="model-types.html">Model Types</a>
      <a href="prompts.html">Prompt Guide</a>
      <a href="math.html" class="active">Math & Concepts</a>
    </div>
    <button class="dark-toggle" id="darkToggle" aria-label="Toggle dark mode">🌙</button>
  </div>
</nav>

<script>
(function(){
  var btn = document.getElementById('darkToggle');
  var saved = localStorage.getItem('theme');
  if(saved === 'dark' || (!saved && window.matchMedia('(prefers-color-scheme: dark)').matches)){
    document.documentElement.setAttribute('data-theme','dark');
    btn.textContent = '☀️';
  }
  btn.addEventListener('click', function(){
    var isDark = document.documentElement.getAttribute('data-theme') === 'dark';
    if(isDark){
      document.documentElement.removeAttribute('data-theme');
      btn.textContent = '🌙';
      localStorage.setItem('theme','light');
    } else {
      document.documentElement.setAttribute('data-theme','dark');
      btn.textContent = '☀️';
      localStorage.setItem('theme','dark');
    }
  });
})();
</script>

<div class="hero">
  <h1>Math & Concepts</h1>
  <p>The underlying ideas that make AI work — explained simply.</p>
</div>

<div class="container">

  <h2 class="section-title">Core Concepts</h2>
  <div class="def-card">
    <span class="category">Architecture</span>
    <h3>Attention Mechanism</h3>
    <p>A way for the model to weigh the importance of different parts of the input when processing each token. "Attention is all you need" — the 2017 paper that launched the transformer revolution.</p>
    <div class="example"><strong>Analogy:</strong> When reading a sentence, you naturally pay more attention to certain words. "The cat that chased the mouse hid" — you attend to "cat" when processing "hid".</div>
  </div>
  <div class="def-card">
    <span class="category">Architecture</span>
    <h3>Self-Attention</h3>
    <p>Each token in a sequence attends to every other token, creating rich contextual representations. The core of the transformer architecture.</p>
    <div class="example"><strong>Math:</strong> Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V</div>
  </div>
  <div class="def-card">
    <span class="category">Architecture</span>
    <h3>Multi-Head Attention</h3>
    <p>Running multiple self-attention operations in parallel, each learning different types of relationships. Like having multiple "lenses" to view the input.</p>
  </div>
  <div class="def-card">
    <span class="category">Architecture</span>
    <h3>Positional Encoding</h3>
    <p>Since transformers process all tokens simultaneously (unlike RNNs), position information must be added explicitly so the model knows word order.</p>
  </div>
  <div class="def-card">
    <span class="category">Architecture</span>
    <h3>Feed-Forward Network (FFN)</h3>
    <p>After attention, each token passes through a small neural network that transforms its representation. Usually two linear layers with a non-linearity in between.</p>
  </div>
  <div class="def-card">
    <span class="category">Architecture</span>
    <h3>Layer Normalization</h3>
    <p>A technique to stabilize training by normalizing the activations of each layer. Helps gradients flow more smoothly through deep networks.</p>
  </div>

  <h2 class="section-title">Training Concepts</h2>
  <div class="def-card">
    <span class="category">Training</span>
    <h3>Loss Function</h3>
    <p>A mathematical measure of how far the model's predictions are from the correct answers. Training = minimizing this value. For language models, cross-entropy loss is standard.</p>
    <div class="example"><strong>Example:</strong> If the correct next word is "cat" but the model assigns it 10% probability, the loss is high. If it assigns 90%, the loss is low.</div>
  </div>
  <div class="def-card">
    <span class="category">Training</span>
    <h3>Gradient Descent</h3>
    <p>The optimization algorithm that adjusts model weights in the direction that reduces loss. "Descent" because you're moving down the loss surface toward a minimum.</p>
  </div>
  <div class="def-card">
    <span class="category">Training</span>
    <h3>Adam Optimizer</h3>
    <p>The most popular optimizer for training deep learning models. Combines momentum (acceleration) with adaptive learning rates (per-parameter tuning).</p>
  </div>
  <div class="def-card">
    <span class="category">Training</span>
    <h3>Gradient</h3>
    <p>A vector of partial derivatives showing the direction and rate of steepest increase of the loss. We move in the opposite direction to minimize loss.</p>
  </div>
  <div class="def-card">
    <span class="category">Training</span>
    <h3>Regularization</h3>
    <p>Techniques to prevent overfitting: dropout (randomly deactivating neurons), weight decay (penalizing large weights), and early stopping.</p>
  </div>
  <div class="def-card">
    <span class="category">Training</span>
    <h3>Batch Normalization</h3>
    <p>Normalizing layer inputs across each mini-batch. Reduces internal covariate shift and allows higher learning rates.</p>
  </div>

  <h2 class="section-title">Generation & Sampling</h2>
  <div class="def-card">
    <span class="category">Sampling</span>
    <h3>Temperature</h3>
    <p>Controls randomness in text generation. Low (0.2) = focused and deterministic. High (0.9) = creative and varied. 1.0 = standard sampling.</p>
    <div class="example"><strong>Low temp:</strong> Technical documentation, code generation<br>
      <strong>High temp:</strong> Creative writing, brainstorming</div>
  </div>
  <div class="def-card">
    <span class="category">Sampling</span>
    <h3>Top-K Sampling</h3>
    <p>At each step, only consider the K most likely next tokens. Reduces weird or irrelevant outputs.</p>
  </div>
  <div class="def-card">
    <span class="category">Sampling</span>
    <h3>Top-P (Nucleus) Sampling</h3>
    <p>Only consider tokens whose cumulative probability reaches P. More adaptive than Top-K — automatically adjusts the number of candidates.</p>
    <div class="example"><strong>Top-P = 0.9:</strong> Include the smallest set of tokens that together cover 90% probability mass.</div>
  </div>
  <div class="def-card">
    <span class="category">Sampling</span>
    <h3>Greedy Decoding</h3>
    <p>Always pick the most likely next token. Fastest but can get stuck in repetitive loops. Often produces the most coherent output for factual tasks.</p>
  </div>
  <div class="def-card">
    <span class="category">Sampling</span>
    <h3>Beam Search</h3>
    <p>Instead of picking the single best token at each step, keep the top B sequences and pick the best overall. Better quality but slower.</p>
  </div>
  <div class="def-card">
    <span class="category">Sampling</span>
    <h3>Logits</h3>
    <p>The raw, unnormalized scores the model outputs for each token before softmax. Can be adjusted for bias correction, repetition penalties, and custom sampling.</p>
  </div>

  <h2 class="section-title">Evaluation Metrics</h2>
  <div class="def-card">
    <span class="category">Metrics</span>
    <h3>Perplexity</h3>
    <p>Measures how "surprised" the model is by test data. Lower is better. A perplexity of 100 means the model is as confused as choosing uniformly from 100 options.</p>
    <div class="example"><strong>Example:</strong> Perplexity 5 on a language model means, on average, it's as uncertain as picking from 5 equally likely options at each step.</div>
  </div>
  <div class="def-card">
    <span class="category">Metrics</span>
    <h3>Accuracy</h3>
    <p>Percentage of correct predictions. Simple but can be misleading for imbalanced datasets.</p>
  </div>
  <div class="def-card">
    <span class="category">Metrics</span>
    <h3>Precision & Recall</h3>
    <p>Precision = of all positive predictions, how many were correct? Recall = of all actual positives, how many did we find?</p>
    <div class="example"><strong>Spam filter:</strong> High precision = few legitimate emails flagged. High recall = few spam emails missed.</div>
  </div>
  <div class="def-card">
    <span class="category">Metrics</span>
    <h3>F1 Score</h3>
    <p>The harmonic mean of precision and recall. A single metric that balances both.</p>
  </div>
  <div class="def-card">
    <span class="category">Metrics</span>
    <h3>BLEU / ROUGE</h3>
    <p>Metrics for evaluating text generation quality by comparing model output to reference text. BLEU counts n-gram overlap (used for translation). ROUGE is similar but common for summarization.</p>
  </div>
  <div class="def-card">
    <span class="category">Metrics</span>
    <h3>Tokens per Second (TPS)</h3>
    <p>How many tokens the model generates per second. Measures inference speed. Typical range: 20-100+ TPS depending on model size and hardware.</p>
  </div>

  <h2 class="section-title">Key Formulas</h2>
  <table class="glossary-table">
    <thead>
      <tr><th>Concept</th><th>Formula</th><th>What it means</th></tr>
    </thead>
    <tbody>
      <tr><td>Attention</td><td>softmax(QKᵀ/√dₖ)V</td><td>Weigh inputs by relevance</td></tr>
      <tr><td>Cross-Entropy Loss</td><td>-Σ yᵢ log(pᵢ)</td><td>Penalizes wrong predictions</td></tr>
      <tr><td>Softmax</td><td>eˣⁱ / Σeˣʲ</td><td>Converts scores to probabilities</td></tr>
      <tr><td>ReLU</td><td>max(0, x)</td><td>Activation: passes positive values only</td></tr>
      <tr><td>Layer Norm</td><td>(x - μ) / σ × γ + β</td><td>Normalizes per-sample activations</td></tr>
      <tr><td>F1 Score</td><td>2 × (P×R)/(P+R)</td><td>Harmonic mean of precision & recall</td></tr>
      <tr><td>Perplexity</td><td>2^(cross-entropy)</td><td>Effective branching factor</td></tr>
    </tbody>
  </table>

</div>

<footer>AI Cheat Sheet &mdash; A learning reference for artificial intelligence</footer>

</body>
</html>