Architectures and families of AI models — what they are and what they do.
Neural networks based on the transformer architecture, trained on massive text corpora. They predict the next token given a sequence, enabling fluency in language tasks.
Transformers designed to understand input (not generate text). Used for classification, sentiment analysis, and embedding generation.
Transformers designed to generate text autoregressively — the dominant architecture for modern LLMs.
Transformers with both encoder and decoder, used for tasks that transform input to output (translation, summarization).
Neural networks with layers that scan images with small filters, detecting edges, textures, and patterns hierarchically. The backbone of computer vision for years.
Applying the transformer architecture to images by treating image patches as tokens. Often outperforms CNNs at scale.
Models that generate images by iteratively denoising random noise. The architecture behind most state-of-the-art image generators.
Models that process multiple input types — text, images, audio — and can generate outputs across modalities.
Two networks compete: a generator creates fake data, and a discriminator tries to detect fakes. Over time, both improve until the generator is indistinguishable from real data.
Combines autoencoders with discrete codebooks to learn compressed representations. Used as a foundation for autoregressive generation.
Models that learn a reversible transformation between data and noise, enabling exact likelihood computation and fast generation.
Recurrent networks that process sequences step-by-step, maintaining a hidden state. Largely replaced by transformers but still used in some applications.
A model with multiple "expert" subnetworks. A routing mechanism selects which experts to use for each input, enabling large models that are computationally efficient at inference.
Models designed specifically for semantic search — finding the most relevant documents for a query from a large corpus.
Compact language models (under 7B parameters) optimized for edge devices and low-latency applications. Getting remarkably capable.
| Model | Type | Best For |
|---|---|---|
| GPT-4 / GPT-4o | Decoder LLM | General-purpose reasoning, coding, multimodal |
| Claude 3.5 | Decoder LLM | Long-context analysis, coding, writing |
| Gemini 1.5 Pro | Decoder LLM | Massive context windows, multimodal |
| Llama 3 | Decoder LLM | Open-source, self-hosting, fine-tuning |
| Mistral Large | Dense LLM | High-quality reasoning, multilingual |
| Stable Diffusion | Diffusion | Image generation, open-source |
| CLIP | Encoder (Vision+Text) | Image-text matching, embeddings |
| BERT | Encoder | Text classification, search, NLU |
| Whisper | Encoder-Decoder | Speech recognition, transcription |
| TTS models | Decoder | Text-to-speech, voice synthesis |