Introduction

Most people interact with LLMs through a chat interface. You type, hit send, and get a response. But what's actually happening under the hood? This post walks through the complete lifecycle of a single inference request: from tokenization, through GPU, to token prediction, and response.

The Big Picture

Client Side

Your user types a prompt and hits send. That text—raw, unstructured, in natural language—travels over the network to the application layer. The client doesn't care about tokens, context windows, or GPU memory. It just sends text and waits for text back. Everything else is the system's problem.

Application Layer

The application layer is where text becomes actionable. First, it tokenizes your raw text—converting it into token IDs (integers) that the model understands. Then it builds the context window by combining your message with system instructions and conversation history. All of this happens before the GPU ever gets involved.

Once the context window is ready, the application layer sends those token IDs to the GPU runtime. After inference completes, it does the reverse: converts the output token IDs back to readable text and sends it to the client.

Think of it as the translator and coordinator—it speaks human on one end, speaks tokens on the other.

GPU inference

Startup: Loading the Model

Before any user request arrives, the deployed LLM system loads model weights from disk into System RAM, then copies them to GPU VRAM. These weights stay loaded for the entire lifetime of the deployment.

Per-Request: Memory Allocation

When your tokenized input arrives, the GPU allocates a separate chunk of VRAM for the context window. Your token IDs are copied here. This memory is temporary—it exists only for this request.

The Inference Loop

GPU compute now reads from two places: your tokens from the context window, and the model weights (both in VRAM). It predicts the next token, writes it back to the context window, then repeats. Each iteration builds on the previous tokens. The loop stops when the model generates a stop token—its signal that the response is complete.

Output & Cleanup

Once inference completes, the GPU has generated tokens sitting in VRAM. The GPU runtime extracts these output tokens and passes them back to the application layer (across the API boundary). The application layer then de-tokenizes them—converting token IDs back to readable text—and sends the response to the client. Finally, the GPU deallocates the context window memory. The model weights remain in VRAM, untouched, waiting for the next request.

Context Window

Your context window size isn't determined by a single factor—it's a race between two hard limits.

From the Model Side: The transformer was trained with a maximum sequence length. This limit is baked into the model's mathematical architecture—specifically its positional encoding. You can't exceed it without retraining the entire model.

From the Hardware Side: GPU VRAM has a fixed capacity. Each token requires memory, and each inference iteration uses additional memory for computations. The more tokens you try to process, the more VRAM you need. Run out of memory, and inference fails.

The actual context window = MIN(Model's trained max, VRAM capacity)

For example, if a model was trained with a maximum of 100K tokens, but your GPU only has VRAM to handle 50K tokens, your actual context window is 50K. Conversely, if the model supports 100K but your GPU has capacity for 200K, the model's limit wins—you get 100K.

Why This Matters

Context window size directly affects cost and latency. A larger window means more tokens to process per request—more GPU computation time, more VRAM usage, more infrastructure cost. It also means longer latency for the user. Conversely, a smaller window limits what the model can "see" in a single request, forcing you to either drop context or make multiple requests. Understanding this constraint is critical for designing efficient deployments and managing infrastructure costs.

Note on Hardware: GPU compute capacity and VRAM size are baked into the GPU architecture—you can't separate them. For reference, the NVIDIA B200 (Blackwell, 2024) comes with 192GB of GPU memory and ~20.6 TFLOPS peak FP8 performance—doubling the previous-generation H100's 80GB capacity.

From Tokens to Meaning

You've walked through the entire lifecycle of an LLM inference request. Text gets tokenized into integers, loaded into GPU memory, and fed through billions of matrix multiplications. The model iteratively predicts the next token until it decides to stop, then the result converts back to text.

No magic (maybe!), just Math on silicon, executing in milliseconds. And somehow, that's enough to change humans once again!

P.S. - please keep in mind, since this is a gentle introduction, a lot of things have been simplified at this level to help form a coherent picture.

Token IN, Token OUT