Skip to content

Generative AI Tools

Understanding the stack, tools, and models powering modern Generative AI

What Is Generative AI?

Generative AI is no longer just a chatbot interface with input and output like ChatGPT — it has evolved into agentic tools integrated into apps, and full AI assistant platforms.


From Prompt to Response

The path from your text to the model’s answer, end to end. The model emits one token at a time and feeds it back — the loop is what makes generation autoregressive.

Option A — inline SVG (theme-aware, adapts to light/dark):

Promptyour inputTokenizetext → tokensEmbeddings+ context windowTransformerblocks · attentionNext tokensampling · temp.Detokenize→ Responseautoregressive — repeat per token until stop

Tools / MCP: when the prompt needs exact data or an action, the model emits a tool call instead of prose; the tool (code, search, DB) runs deterministically and its result re-enters the context for the next tokens.


Patterns in High-Dimensional Space

A model is closer to a brain completing patterns than to a database lookup or a calculator. It doesn’t retrieve a stored answer; it recognizes patterns and decides the next token by sampling from a probability distribution it computes on the fly. That makes output non-deterministic (the same prompt can vary; temperature dials the spread) — but the randomness is the last step of a decision, not noise for its own sake.

The impressive part is where that decision happens. Knowledge isn’t stored as text or rules — it’s encoded as high-dimensional vectors (embeddings). Every token becomes a tensor with thousands of dimensions (hidden sizes of ~4k–16k+), and meaning is geometry: directions and distances between vectors encode concepts and relations — the classic king − man + woman ≈ queen. A forward pass is a trajectory through this space — the residual stream carrying the state through each layer — that finally projects to a probability over the vocabulary. The “thought” is that geometric shape settling into a representation; the response is its readout.

On the geometry: the working space is a large, mostly Euclidean tensor that encodes the abstraction of what the model learned. Hierarchical structure (taxonomies, trees) is captured more naturally by hyperbolic geometry — the Poincaré-ball / hyperboloid (Lorentz) models, where volume grows exponentially so trees embed with little distortion. That’s an active research direction (Poincaré embeddings), not the mainstream default — but it’s the right intuition for the “curved high-dimensional surface” the model reasons over.

How It’s Trained

The geometry above is learned, in two stages:

  • Pre-training — self-supervised next-token prediction over trillions of tokens of text/code. No labels: the objective is “predict what comes next,” and from that pressure the model builds its representation of language and the world. Output: a raw base model (knowledgeable, not yet helpful or safe).
  • Post-training (alignment) — turns the base model into a usable assistant:
    • SFT (supervised fine-tuning) — imitate curated instruction/response examples.
    • RLHF — train a reward model on human preference rankings, then optimize the policy against it (PPO).
    • DPO — optimize directly on preference pairs, skipping the separate reward model and RL loop.
    • RLAIF / Constitutional AI — replace human raters with AI feedback against a written set of principles.
    • RLVR (reinforcement learning from verifiable rewards) — reward correct math/code/tool results; this is what drives modern reasoning models.
    • Distillation — train a smaller model to mimic a larger one’s outputs.

Patterns Aren’t Proofs

Because answers are pattern-completions, a confident, fluent sequence can still be wrong — this is hallucination. And even at temperature 0 it’s only near-deterministic (floating-point and hardware drift). So for anything exact — math, dates, lookups, code execution — the model calls deterministic tools: probabilistic reasoning on top, exact computation underneath.


The AI Stack

A frontier AI system is a vertical stack — silicon at the bottom, the app on top. Two ecosystems run in parallel: closed / proprietary ones that trade portability for turnkey convenience and lock-in and open-source layers you can self-host and swap freely. The map below is the tooling at a glance, top of stack to bottom.

Stack Layer Closed-Source Open-Source
Web Apps Chat: ChatGPT Claude Gemini Grok
Media: Magnific ElevenLabs Higgsfield
Chat: DeepSeek Qwen z.ai Kimi
Media: ComfyUI Cloud
Code Editors Antigravity Cursor Windsurf VS Code Zed VSCodium
Agents / CLI Claude Code Devin GitHub Copilot Codex OpenCode Cline Hermes
AI Models LLMs: Claude · GPT-5 · Gemini 3 · Kimi K2 · Grok
Media: Midjourney · Ideogram 3 · Seedance 2 · Veo 3 · ElevenLabs
LLMs: Llama 4 · Qwen3 · DeepSeek-V4 · GLM-5 · MiniMax M2 · MiMo · Mistral
Media: FLUX.3 · LTX 2.3 · Stable Diffusion 3.5 · HunyuanVideo · Wan · Whisper · Kokoro
Model Hosting OpenAI Anthropic Vertex AI Hugging Face OpenRouter Replicate Vast.ai
Runtimes Cloud infrastructure llama.cpp Ollama LM Studio vLLM GPT4All Triton ComfyUI
Vector DB / Memory Pinecone Vertex AI Search Qdrant pgvector MongoDB SQLite
Infra / Containers AWS Azure Google Cloud Docker Podman Terraform

Below the badges: the foundation layers share no clean tooling icons. Open — corpora (FineWeb, RedPajama, The Stack), alignment (TRL, Alignment Handbook), frameworks (PyTorch, JAX, DeepSpeed, Megatron-LM), accelerators (NVIDIA H100/B200, AMD MI300X, Apple Silicon). Closed — undisclosed data, proprietary RLHF / Constitutional AI, internal orchestration, cloud ASICs (TPU, Trainium, Axion). Every open layer can be swapped or self-hosted; closed stacks are vertically integrated — lower friction, heavier lock-in.

On models: every entry is the same transformer (plus diffusion for pixels/audio) — only training domain and size change. Coding and multimodal LLMs are where the labs compete hardest. Run open models locally with Ollama (text), ComfyUI (image/video/audio — the Ollama of pixels), and whisper.cpp (speech).

What Is Mostly Standardized

  • Markdown-first text output
  • SSE + JSON delta streaming
  • Markdown -> AST -> component render path
  • MCP as practical tool-calling standard

The middle layers have converged; real differences concentrate in model behavior, context reliability, product UX, and ecosystem lock-in. For the provider-by-provider comparison — and what each chatbot can actually do (web search, Canvas/Artifacts, Mermaid, maps) — see AI Chatbot Platforms.

The Minimal Stack

Four layers — model, runtime, agent, hardware — but two tools cover the software:

ollama run qwen3.6    # runtime + model
opencode              # optional agent

Prompt → tokens → embeddings → stacked Transformer blocks (self-attention + feed-forward, mostly GEMM); the runtime schedules the ops, the chip runs the multiply-adds. Model = numbers, runtime = recipe, hardware = executor. Everything else is optional.


Context, Memory, Connectors, Skills, Plugins & the Harness

A model on its own only knows its training data and what fits in the current prompt. A handful of mechanisms extend it — and one (the harness) wires them together — turning a chatbot into an agent:

  • Context — the token window the model sees this turn (prompt, files, tool output, history). Finite (often 200K–1M tokens); fill it with what’s relevant, everything else gets truncated or summarized. Think RAM: fast, small, wiped each turn.
  • Memory — state that persists across turns/sessions beyond the window: scratch files, conversation summaries, or a vector DB for retrieval (RAG). Think disk: it persists.
  • Connectors (MCP) — the Model Context Protocol, a standard wire format for exposing tools, data, and actions to any model. One MCP server (GitHub, Postgres, filesystem, Slack…) works across Claude, ChatGPT, and IDEs — no per-app glue. Think I/O bus, and it’s the one that’s converged: write a tool once, every model can call it.
  • Skills — folders of instructions, scripts, and resources the agent loads on demand to do a specific task its way (skills.sh, anthropics/skills, openai/skills). Just files — portable across agents, version-controlled, no code to wire in. Think a how-to manual the agent reads when the task calls for it.
  • Plugins — packaged bundles of tools/skills/prompts you install into a host app, built on MCP or a host’s own API (ChatGPT apps, Claude/IDE plugins, editor extensions). Think installed apps.
  • Harness (agent loop) — the runtime around the model that wires the rest together: it assembles the context, sends it, parses any tool/MCP calls, runs them, feeds results back, and repeats until the task is done. The model predicts tokens; the harness is what turns that into an agent. Think the event loop of the whole system.

What You Actually Pay For

The software is mostly free. Open weights, runtimes (Ollama, llama.cpp, vLLM), MCP, and most of the tooling above cost nothing to download and run. The money goes to compute and the managed service around it:

  • Renting compute — GPUs/TPUs are expensive and a large model needs a lot of them. You either rent by the token (a hosted API) or rent the machine by the hour (Vast.ai, Replicate, cloud GPUs). The bill tracks usage.
  • The managed platform — a subscription (ChatGPT Plus, Claude Pro, Gemini AI Pro) bundles inference plus everything around it: uptime, scaling, guardrails, the app/UI, support, and the ongoing cost of training, post-training (RLHF), and the people who do it.

Run an open model on hardware you already own and the software is free — you only pay the electricity. The moment someone else runs it for you, you’re paying rent on their compute and their service, not for the model itself.

Three pricing shapes: self-host (free software, your hardware) · per-token API (rent compute by usage) · subscription (rent the whole managed platform).

How to Choose

Pick where the work runs, then the smallest model that clears the task:

  • Local-first — privacy, cost control, or offline matter most. Ollama + an open model gets you running in one command; you trade some quality for full control.
  • Cloud-first — setup speed and top-end quality matter most. A frontier API (Claude, GPT-5, Gemini) is the fastest path to the best answers, at the cost of per-token billing and lock-in.
  • Hybrid — the common real-world default: a local loop for the cheap, high-volume work and a cloud fallback for the hard cases.

The Takeaway

The model layer is no longer the bottleneck — open weights trail the frontier by months, not years, and the stack between them has converged on the same handful of standards (MCP, SSE streaming, markdown I/O). What separates a demo from a product now is integration: how you manage context, where you keep memory, which tools you wire in, and how reliably they run. The models are commodities; the system you build around them is the work.

References