Model Types

Architectures and families of AI models — what they are and what they do.

Language Models

Transformer

LLM (Large Language Model)

Neural networks based on the transformer architecture, trained on massive text corpora. They predict the next token given a sequence, enabling fluency in language tasks.

Examples: GPT-4, Claude, Gemini, Llama 3, Mistral, Qwen
Transformer

Encoder-Only Models

Transformers designed to understand input (not generate text). Used for classification, sentiment analysis, and embedding generation.

Examples: BERT, RoBERTa, DeBERTa
Transformer

Decoder-Only Models

Transformers designed to generate text autoregressively — the dominant architecture for modern LLMs.

Examples: GPT series, Claude, Llama, Mistral
Transformer

Encoder-Decoder Models

Transformers with both encoder and decoder, used for tasks that transform input to output (translation, summarization).

Examples: T5, BART, Flan-T5

Vision Models

Vision

CNN (Convolutional Neural Network)

Neural networks with layers that scan images with small filters, detecting edges, textures, and patterns hierarchically. The backbone of computer vision for years.

Examples: ResNet, EfficientNet, VGG
Vision

ViT (Vision Transformer)

Applying the transformer architecture to images by treating image patches as tokens. Often outperforms CNNs at scale.

Examples: CLIP, DINOv2, ViT-Base
Vision

Diffusion Models

Models that generate images by iteratively denoising random noise. The architecture behind most state-of-the-art image generators.

Examples: Stable Diffusion, DALL-E 3, Midjourney
Vision

Multimodal Models

Models that process multiple input types — text, images, audio — and can generate outputs across modalities.

Examples: GPT-4V (vision), Claude 3, Gemini, Qwen-VL

Generative Models

Generative

GAN (Generative Adversarial Network)

Two networks compete: a generator creates fake data, and a discriminator tries to detect fakes. Over time, both improve until the generator is indistinguishable from real data.

Example: Creating photorealistic faces that don't exist (StyleGAN).
Generative

VQ-VAE (Vector Quantized VAE)

Combines autoencoders with discrete codebooks to learn compressed representations. Used as a foundation for autoregressive generation.

Example: MusicGen (music generation), SoundStream (audio compression)
Generative

Flow Models

Models that learn a reversible transformation between data and noise, enabling exact likelihood computation and fast generation.

Examples: DALL-E 2 uses flow matching, Glow, RealNVP

Other Architectures

Architecture

RNN / LSTM

Recurrent networks that process sequences step-by-step, maintaining a hidden state. Largely replaced by transformers but still used in some applications.

Use case: Time series prediction, speech recognition
Architecture

Mixture of Experts (MoE)

A model with multiple "expert" subnetworks. A routing mechanism selects which experts to use for each input, enabling large models that are computationally efficient at inference.

Examples: Mixtral 8x7B, Google's PaLM-E
Architecture

Retrieval Models

Models designed specifically for semantic search — finding the most relevant documents for a query from a large corpus.

Examples: BGE, E5, Cohere embed models
Architecture

Small Language Models (SLMs)

Compact language models (under 7B parameters) optimized for edge devices and low-latency applications. Getting remarkably capable.

Examples: Phi-3, Gemma 2B, Qwen 1.5B, MicroLlama

Model Comparison

ModelTypeBest For
GPT-4 / GPT-4oDecoder LLMGeneral-purpose reasoning, coding, multimodal
Claude 3.5Decoder LLMLong-context analysis, coding, writing
Gemini 1.5 ProDecoder LLMMassive context windows, multimodal
Llama 3Decoder LLMOpen-source, self-hosting, fine-tuning
Mistral LargeMoE LLMEfficient inference, multilingual
Stable DiffusionDiffusionImage generation, open-source
CLIPEncoder (Vision+Text)Image-text matching, embeddings
BERTEncoderText classification, search, NLU
WhisperEncoder-DecoderSpeech recognition, transcription
TTS modelsDecoderText-to-speech, voice synthesis