LLM: Understanding Tokens

January 19, 2026 - 12 mins read

Overview

Understand why tokens are the currency of LLMs
Learn how tokenisers split words into tokens
Why tokens are different across LLM providers
How the costs of tokens are calculated
Understand token limits

Introduction

In this post, we look deeper into tokens, and how they relate to Large Language Models (LLMs). If you have read any API documentation or any other technical content relating to LLMs, you will have come across the term tokens. Tokens are fundamental to LLMs, and understanding them is vital when learning how LLMs work under the hood.

What Are Tokens?

Tokens are the currency of LLMs, when you send a text prompt to an LLM such as “Hello World!”, this gets broken down into three tokens (in OpenAI GPT models). These tokens are then fed into other components that make up the LLM architecture during inference.

Input:

“Hello World!”	Tokens [9906, 4435, 0]	$0.00125 / 1k tokens

The input prompt “Hello World!” gets broken down into three tokens and billed at some small amount per 1k tokens. At this point the tokens are fed into the LLM, the LLM does some thinking/reasoning and returns an output.

Output:

“Hi!”	Tokens [13347, 0]	$0.01 / 1k tokens

Here the LLM outputs “Hi!” which is made up of two tokens, these are billed at a different rate to the input tokens.

We can actually see how words are converted into to tokens using the tiktoken library. tiktoken is a tokeniser used by OpenAI’s models.

# setup
uv init
uv add tiktoken
uv run main.py

import tiktoken
encoding = tiktoken.get_encoding("cl100k_base")
print(encoding.encode("Hello World!"))

Output:

[9906, 4435, 0]

This process is also reversible, we can take the encoded input tokens and decode them to get the original words:

print(encoding.decode([9906, 4435, 0]))

Output:

Hello World!

Let’s see some real world text examples with their approximate token counts:

"To be, or not to be, that is the question." = 13 tokens
President Barack Obama's Inaugural Address   = 2845 tokens
The US Declaration of Independence           = 1695 tokens

An important thing to understand is that LLMs do not understand or process words, they only understand and process tokens. Language models do not see text like humans, instead they see a sequence of numbers.

How Words Are Broken into Tokens

Words are broken up into tokens using a tokeniser, the tiktoken library implements the Byte Pair Encoding (BPE) tokeniser. Language models use tokenisers like Byte Pair Encoding (BPE) as a way of converting text into tokens. There are many different types of tokenisers (like SentencePiece or WordPiece) each with their own advantages and disadvantages, and different LLMs use different tokenisers. The goal of the tokeniser is to help the model generalise and better understand grammar.

An important thing to understand, is that there is not a simple one-to-one mapping of words to tokens. The reason for this is that the tokenisation process is a little more intelligent. Let’s have a look at a simple example (using BPE tokeniser):

import tiktoken
encoding = tiktoken.get_encoding("cl100k_base")
for word in ['jumping', 'burning', 'encoding']:
    tokens = encoding.encode(word)
    print(f"{word} -> {tokens}")

Output:

jumping -> [44396, 287]
burning -> [22464, 287]
encoding -> [17600]

From the output we can see that “jumping” is broken up into two tokens, “jump” (token 44396) and “ing” (token 287), the same is true for “burning”. But, “encoding” is converted to a single token (17600), why is this? As stated earlier tokenisers are a little more intelligent than just mapping words to tokens. Tokenisers like BPE, form tokens based on frequency in the training data, this means it takes the context of the data into consideration. In the example above “encoding” appears very frequently in text (especially technical/programming contexts), so it was merged into a single token during training. “jumping” and “burning” are less common, so they remain as two tokens.

Another important thing to understand is that tokenisation is deterministic and consistent i.e “Hello” will always map to the same token ID (e.g. 9906) both during training and inference. The token vocabulary is fixed after training, so the same text always produces the same token IDs when using that model’s tokeniser.

Tokens Are Not Unique Across Different LLMs

Tokens are not unique across LLMs, this is because different models use different tokenisers and are trained on different datasets. The same text will typically be split into different numbers and types of tokens across models like GPT, Claude, and Gemini. “encoding” may or may not appear as a single token as it depends on how frequent it appears in that models training dataset. And, if the word does appear as a single token it will almost certainly have a different token value. So, the same input prompt will use a different number of tokens across LLM providers because different models use different tokenisers and have different training datasets.

Let’s see an example, we will call various LLM API providers and see how they split the input prompt into a different number of tokens.

uv init
uv add openai anthropic google-genai

import anthropic
from openai import OpenAI
import google.genai as genai

def openai_api(prompt: str) -> None:
    print("=== OpenAI ===")
    client = OpenAI(api_key=OPENAI_API_KEY)
    params = {
        "model": "gpt-4.1",
        "messages": [{
            "role": "user",
            "content": prompt,
        }],
        "max_tokens": 500,
    }
    completion = client.chat.completions.create(**params)
    print(f"Response: {completion.choices[0].message.content}")
    print(f"Input Tokens: {completion.usage.prompt_tokens}")
    print(f"Output Tokens: {completion.usage.completion_tokens}")


def claude_api(prompt: str) -> None:
    print("=== Claude API ===")
    claude_client = anthropic.Anthropic(
        api_key=ANTHROPIC_API_KEY
    )
    claude_response = claude_client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=500,
        messages=[
            {"role": "user", "content": prompt}
        ]
    )
    print(f"Response: {claude_response.content[0].text}")
    print(f"Input tokens: {claude_response.usage.input_tokens}")
    print(f"Output tokens: {claude_response.usage.output_tokens}")


def gemini_api(prompt: str) -> None:
    print("=== Gemini API ===")
    client = genai.Client(api_key=GEMINI_API_KEY)
    response = client.models.generate_content(
        model="gemini-3-flash-preview",
        max_tokens=500,
        contents=prompt
    )
    print(f"Response: {response.text}")
    print(f"Input tokens: {response.usage_metadata.prompt_token_count}")
    print(
        f"Output tokens: "
        f"{response.usage_metadata.candidates_token_count}"
    )


prompt = "Explain quantum physics in a tweet length message."
openai_api(prompt)
claude_api(prompt)
gemini_api(prompt)

Output:

$ uv run main.py
=== OpenAI ===
Response: Quantum physics studies how tiny particles like atoms and photons
behave, revealing a weird world where things can be in many states at once and
only become definite when observed.
Input Tokens: 16
Output Tokens: 32

=== Claude API ===
Response: Quantum physics: particles exist in multiple states simultaneously
until observed, can be "entangled" across vast distances, and behave as both
waves and particles. Reality is probabilistic, not deterministic—the universe is
far weirder than it appears! 🌊⚛️
Input tokens: 17
Output tokens: 64

=== Gemini API ===
Response: Quantum physics: At the subatomic level, reality is a blur of
probability. Particles can exist in multiple states at once (superposition),
behave like waves, and stay instantly linked across space (entanglement).
Nothing is certain until it’s observed. ⚛️🌌 #Physics
Input tokens: 10
Output tokens: 57

We make three API calls against three different models with the input prompt "Explain quantum physics in a tweet length message.", giving a total prompt length of nine (eight words plus full stop). From the output above, we can see that this input prompt was converted to sixteen input tokens for OpenAI, seventeen for Claude and ten for Gemini.

Hidden Tokens

If we use tiktoken library again to tokenize the input prompt that we used above, you will notice something odd:

import tiktoken
encoding = tiktoken.get_encoding("cl100k_base")
print(encoding.encode("Explain quantum physics in a tweet length message."))

Output:

[849, 21435, 31228, 22027, 304, 264, 12072, 3160, 1984, 13]

When we made the API call to OpenAI above, it reported using sixteen input tokens (completion.usage.prompt_tokens). But, from the output of the tokeniser (the same tokeniser used by OpenAI models) above, we can see the same input prompt was converted into ten tokens. Why the discrepancy?

This happens because OpenAI API’s token count includes hidden system tokens in addition to our actual prompt. When we use tiktoken locally on just the text “Hello World!”, you get three tokens (the actual content tokens). But when we send it through the Chat Completions API, OpenAI adds additional information, this is true for the other LLMs as well. What additional information are these APIs adding to prompts?

System message tokens - default system context
Message formatting tokens - Special tokens that structure the message as a chat turn (role markers, delimiters, etc.)
Other overhead - Tokens for the conversation structure itself

So the breakdown is roughly:

3 tokens           - Actual "Hello World!" content
~7 tokens          - API overhead (formatting, system context, message structure)
= 10 tokens total  - What the API reports (completion.usage.prompt_tokens)

Understanding hidden tokens is important, especially for things like prompt engineering and when calculating API/model usage costs.

How the Cost of Tokens Are Calculated

As mentioned earlier tokens are the currency of LLMs, an input prompt gets encoded into input tokens and billed at some small amount. The input tokens are fed into the LLM, the LLM does some thinking/reasoning and returns some output tokens. The output tokens are also billed at some small amount but at a different rate to the input tokens.

Let’s see an example of a real world pricing structure, by having a look at the API pricing of GPT-5.2. As of today’s date, the cost of this model is:

GPT-5.2
The best model for coding and agentic tasks across industries

Price

Input:
$1.750 / 1M tokens

Cached input:
$0.175 / 1M tokens

Output:
$14.000 / 1M tokens

source: https://openai.com/api/pricing/

Usage is priced per token, varying by model and type of token. Calculating the exact cost of LLM/API usage can be tricky, as we have seen from the previous section. Where hidden tokens can get added to the input prompt, but also from the above pricing you can see there is a cost for cached input tokens. Cached tokens are tokens that are reused in conversation history and are often billed at a reduced rate.

Also, keep in mind that when you prompt an LLM whether it’s via a chat UI, API, or coding agent. Each new prompt includes the full conversation history (known as the context) because LLMs are stateless, they have no memory between calls. When you use ChatGPT, Claude, or Gemini via the chat UI, it manages the sending of the entire conversation thread with each request. When you use the API you have to manage this yourself.

But, the important thing to know is when you do send your prompt and the context along with it, every message gets tokenized and counts towards your input tokens for each new request. This is why longer conversations cost more, because you’re paying to reprocess the entire context every time you send a new message. As you go about prompting an LLM, the flow of prompts plus contexts would look something like this:

Request 1:

→ User prompt 1 : 10 input tokens

← LLM response 1: 50 output tokens

Request 2:

→ User prompt 2  : 15 tokens
→ User prompt 1  : 10 tokens (from Request 1)
→ LLM response 1 : 50 tokens (from Request 1)

← LLM response 2 : 70 output tokens

Request 3:

→ User prompt 3  : 20 tokens
→ User prompt 2  : 15 tokens (from Request 2)
→ LLM response 2 : 70 output (from Request 2)
→ User prompt 1  : 10 tokens (from Request 1)
→ LLM response 1 : 50 tokens (from Request 1)

← LLM response 3 : 80 output tokens

...

This is why long conversations can get expensive quickly, you’re repeatedly paying for all previous messages (both yours and the LLM’s). Luckily, cached tokens from previous contexts can help keep the costs down.

Let’s calculate the cost of a conversation using the GPT-5.2 model pricing from above:

Prompt 1:

input tokens  = 1000
output tokens = 500

input tokens cost  = input tokens  / 1M * $1.750 = $0.00175
output tokens cost = output tokens / 1M * $14    = $0.007

total cost = $0.00175 + $0.007 = $0.00875

Prompt 2:

input tokens  = 1000
cached tokens = 1500 (cached previous conversation)
output tokens = 500

input tokens cost  = input tokens  / 1M * $1.750 = $0.00175
cached tokens cost = cached tokens / 1M * $0.175 = $0.0002625
output tokens cost = output tokens / 1M * $14    = $0.007

total cost = $0.00175 + $0.0002625 + $0.007 = $0.0090125

Prompt 3:

input tokens  = 1000
cached tokens = 3500 (cached)
output tokens = 500

input tokens cost  = input tokens  / 1M * $1.750 = $0.00175
cached tokens cost = cached tokens / 1M * $0.175 = $0.0006125
output tokens cost = output tokens / 1M * $14    = $0.007

total cost = $0.00175 + $0.0006125 + $0.007 = $0.0093625

Total conversation cost:

$0.00875 + $0.0090125 + $0.0093625 = $0.03

There are various things that you can do to control the costs of LLM usage, such as setting the max_tokens parameter in API calls, shorten or rephrase prompts, summarise or preprocess inputs before sending them.

Token Limits

When you use AI chat via the web/desktop/mobile apps, and have had a really long conversation, you will have seen that the interface will typically warn you when approaching conversation limits.

LLM models have a maximum combined token limit made up of input + output tokens, when this reaches the model’s limits you probably start a new chat. For example, some of Claude’s models have a 200k token limit (depending on usage tiers). Token limits exist because of technical constraints, such as limitations in the transformer architecture that LLMs are built upon. And, hardware memory requirements, in fact doubling the number of tokens in the LLMs context can quadruple memory usage.

One way to look at this is to view the context window as a bucket, and tokens as marbles. There is a certain amount of marbles you can put into the bucket before you fill it up, and it starts overflowing.

Summary

Tokens are the fundamental unit in LLMs.
LLMs do not process words, they process tokens. They do not see text like humans, instead they see a sequence of numbers.
Words are converted into tokens using a tokeniser like Byte Pair Encoding (BPE).
Different LLMs use different tokenisers and have different training sets, so token IDs are not the same across LLMs.
Input and output tokens are billed at different rates and priced per token.
When billing input tokens, the entire context window is fed in alongside the new user prompt.
LLMs have a context window which have token limits. Once a context window exceeds the token limit, you may have to start a new chat.

I hope you found this post helpful.

Thank you for reading.