LLM: Understanding Log Probabilities
Overview
- Learn what token log probabilities are.
- Understand and use the logprobs parameter.
- Learn the practical use cases for log probabilities.
Introduction
In this blog post, we will discuss what LLM log probabilities are, also known as logprobs. We will use the OpenAI API to retrieve the log probabilities of individual tokens output in response to user prompts. We will write some code to understand what we can do when we have this extra information returned from the API. And finally, see some practical use cases of log probabilities.
Quick Recap on Tokens
Tokens are the currency of LLMs, when you send a text prompt to an LLM such as “What is the capital of Canada?” this gets broken down into tokens. These tokens are then fed into other components that make up the LLM architecture during inference.
Input:
Prompt: "What is the capital of Canada?"
↓
Tokens: [4827, 382, 290, 8444, 563, 328, 10351, 30]
The input prompt “What is the capital of Canada?” gets broken down into eight tokens (in OpenAI GPT models). At this point the tokens are fed into the LLM, the LLM does some thinking/reasoning and returns an output.
Output:
Tokens: [976, 9029, 328, 10351, 382, 67810, 13]
↓
Decoded: "The capital of Canada is Ottawa"
Here the LLM outputs “The capital of Canada is Ottawa” which is made up of seven tokens.
Log Probability
When an LLM returns its response to a user after inference, how can we view how likely each individual output token was at being picked and placed in the output as the LLM was building its response. This is done by looking at the log probability of each token returned, the log probability is the probability of each output token occurring in the sequence given the context.
Let’s look at a simple example - imagine we submit the prompt “What is the capital of Canada?”:
Now imagine during inference the model has built up the following output - “The capital of Canada is”. And is determining which token to pick next, based on the context of the input prompt and the output built so far, it has the following tokens to choose from:
Tokens:
Ottawa: 100%
Toronto: 22.21%
Montreal: 10.23%
Vancouver: 8.49%
Alberta: 3.57%
In this case, the most probable token to be selected is Ottawa, as the model is 100% confident that this is the most probable token.
Note: this is a contrived example, in reality the LLM would not be simply be selecting from tokens all representing cities in Canada :)
So, during inference the LLM assigns log probabilities to tokens, and selects the token with the highest probability at each step, and we can extract this information and use it for various use cases.
Code
Let’s look at a code example of how to access the log probabilities of the output tokens. In this example, we will use the OpenAI Chat Completions API to make a request and then print out the logprobs of each token:
# setup
$ mkdir logprobs
$ cd logprobs
$ uv init
$ uv add numpy openai
from openai import OpenAI
import numpy as np
def format(token: str) -> str:
return "<SPACE>" if token.isspace() else token
def main():
client = OpenAI(api_key=API_KEY)
params = {
"model": "gpt-4.1",
"messages": [{
"role": "user",
"content": "How tall is the Eiffel Tower in miles? Be concise"
}],
"max_tokens": 500,
"logprobs": True,
"top_logprobs": 3,
}
completion = client.chat.completions.create(**params)
print("output:", completion.choices[0].message.content)
for item in completion.choices[0].logprobs.content:
print(f"selected token: {format(item.token)}")
for top_logprob in item.top_logprobs:
token = format(top_logprob.token)
probability = np.round(np.exp(top_logprob.logprob)*100, 2)
print(f"{token:<15} {top_logprob.logprob:<10.5f} {probability:<5}%")
print()
if __name__ == "__main__":
main()
In this example, we enable log probabilities by setting the logprobs parameter to True and then we set the top_logprobs to 3, which means we want the top three tokens with the highest probability at this stage in the context.
Let’s run the script and analyse the output:
$ uv run main.py
output: The Eiffel Tower is about **0.19 miles** tall.
selected token: The
The 0.00000 100.0%
Approximately -17.50000 0.0 %
0 -18.50000 0.0 %
selected token: Eiffel
Eiffel -0.00001 100.0%
** -11.62501 0.0 %
height -15.25001 0.0 %
selected token: Tower
Tower 0.00000 100.0%
<SPACE> -19.25000 0.0 %
<SPACE> -22.75000 0.0 %
selected token: is
is 0.00000 100.0%
stands -21.18750 0.0 %
's -21.43750 0.0 %
selected token: about
about -0.16033 85.19%
approximately -1.91033 14.8 %
** -9.53533 0.01 %
selected token: **
<SPACE> -0.47408 62.25%
** -0.97408 37.75%
*** -16.22408 0.0 %
selected token: 0
0 -0.00091 99.91%
1 -7.00092 0.09 %
984 -12.62591 0.0 %
selected token: .
. 0.00000 100.0%
, -24.40625 0.0 %
<SPACE> -25.79688 0.0 %
selected token: 19
19 -0.01254 98.75%
000 -5.38754 0.46 %
2 -6.01254 0.24 %
selected token: miles
miles -0.00000 100.0%
** -16.62500 0.0 %
<SPACE> -19.75000 0.0 %
selected token: **
** -0.00286 99.71%
tall -6.00286 0.25 %
( -7.87786 0.04 %
selected token: tall
tall -0.00034 99.97%
( -8.00034 0.03 %
high -18.12534 0.0 %
selected token: .
. -0.57595 56.22%
( -0.82595 43.78%
(~ -12.07595 0.0 %
In response to the prompt, “How tall is the Eiffel Tower in miles? Be concise”, we got the output, “The Eiffel Tower is about 0.19 miles tall.”. Also, with the API response, we got the log probabilities of each token appearing in the sequence at that stage. For all the output tokens, we display the tokens considered at this stage, its log probability, and the log probability converted to a percentage.
Let’s isolate a single output token, so we can understand the output of the API better:
selected token: Eiffel
Eiffel -0.00001 100.0%
** -11.62501 0.0 %
height -15.25001 0.0 %
Out of all the tokens sampled at this step during inference:
- The model considered all possible tokens it could generate next.
- “Eiffel” had the highest log probability (100%), so it was selected as the actual output.
- The model then extracts the top three tokens
top_logprobs=3that have the highest probability of being selected. - The API returns the top three tokens with their log probabilities - “Eiffel” (-0.00001), “**” (-11.62501), “height” (-15.25001).
This allows users to measure the model’s confidence of the output or explore alternatives the model considered. The model, was confident that “Eiffel” was essentially certain (logprob of 0.0 means ~100% probability), while the alternatives were extremely unlikely (logprob of -11 means probability of e^(-11) ≈ 0%).
Understanding Log Probabilities Value
If you look at the values of the log probabilities, you will notice the values are negative or zero (or in the range [-∞, 0] to be more precise). The logprob value actually stores the natural logarithm of the probability, not the probability itself. The reason for this is that the probabilities for individual tokens are often minuscule (e.g. 0.0000001). Adding, dividing, multiplying these small numbers together can cause numerical underflow. Basically, the numbers become so small that computers cannot represent them accurately. So the model calculates the natural logarithm of the small probabilities:
logprob = math.log(8.939686826368393e-06) = -11.62501
Log probability is a log(p), where p is the probability of a token occurring at a specific position based on the previous tokens in the context. The higher the log probability, the higher the chance that token is selected in that context. The log probability values are either negative or 0.0, where 0.0 corresponds to 100% probability.
Note: Not all APIs expose the logprobs parameter, so make sure you read the relevant API documentation. Here is a table which shows whether popular API providers support the logprobs parameter:
| API | Supports logprobs |
|---|---|
| Google (Gemini) | Yes |
| Anthropic (Claude) | No |
| OpenAI | Yes |
| Azure OpenAI | Yes |
| Hugging Face Inference API | Yes |
Use Cases
Now that we have seen what log probabilities are and how they work, let’s look at the practical use cases for the logprobs parameter:
1. Confidence/Uncertainty Detection
You can use log probabilities to identify how confident a model is when retrieving answers to questions. If you have a Q&A RAG/chat based system, you can detect how confident the model is in its answers and take steps to reduce errors. Let’s look at an example of asking an LLM a series of medical questions and seeing how confident it is in its answers.
from openai import OpenAI
import numpy as np
PROMPT = """Medical Text:
A 45-year-old male presents to the emergency department with sudden onset chest
pain radiating to the left arm, accompanied by diaphoresis and shortness of
breath. The pain began 2 hours ago while he was mowing the lawn. He has a
history of hypertension and hyperlipidemia, currently managed with lisinopril
and atorvastatin. His father died of a myocardial infarction at age 52. Vital
signs show BP 150/95, HR 102, RR 22, O2 sat 94% on room air. Physical exam
reveals an anxious-appearing male with cool, clammy skin. Cardiac auscultation
is normal without murmurs. An ECG is ordered.
Based on the clinical presentation above, {question}
"""
def main():
questions = [
"what is the most likely diagnosis?",
"what specific medication and exact dosage should be administered first in the emergency department?"
]
client = OpenAI(api_key=API_KEY)
for question in questions:
params = {
"model": "gpt-4.1",
"messages": [{
"role": "user",
"content": PROMPT.format(question=question),
}],
"max_tokens": 500,
"logprobs": True,
"top_logprobs": 3,
}
completion = client.chat.completions.create(**params)
print(f"Question: {question}")
logprobs = [token.logprob for token in
completion.choices[0].logprobs.content]
avg_logprob = np.mean(logprobs)
confidence = np.exp(avg_logprob)
percentage = np.round(confidence*100, 2)
print(f"Confidence: {confidence:.6f} ({percentage:.2f}%)")
print(f"Avg. logprob: {avg_logprob:.4f}")
if __name__ == "__main__":
main()
Output:
uv run main.py
Question: Based on the clinical presentation above, what is the most likely diagnosis?
Confidence: 0.668006 (66.80%)
(Avg. logprob -0.4035)
Question: Based on the clinical presentation above, what specific medication
and exact dosage should be administered first in the emergency department?
Confidence: 0.000000 (0.00%)
(Avg. logprob -50.3084)
We ask the LLM two questions about the medical text, in response to the first question the model is moderately confident in its answer. The model recognises the tell-tale signs of MI (Myocardial Infarction, or heart attack) presentation, but it is not 100% certain. For the second question, the model is very low in confidence, because the model is less certain about dosages, different guidelines and less certainty about exact medical protocols. The model is essentially guessing, so is very uncertain in its answer.
2. Token Healing
Token healing is a technique that uses log probabilities to handle prompts that end mid-sequence. For example, if we submit the prompt ‘Is the Eiffel Tower located in Par’, the tokenizer might split it awkwardly (e.g., ‘Par’ becomes one token, forcing the model to start generation with a new complete token).
With token healing, you:
- Remove the last token from your prompt (‘Par’)
- Regenerate from the shortened prompt (“Is the Eiffel Tower located in “)
- Use logprobs to examine alternative tokens and their probabilities
- Select the token that naturally completes the original ending (e.g., ‘Paris’)
This lets you produce more coherent outputs by respecting natural token boundaries rather than forcing the model to continue from arbitrary cutoff points.
3. Detecting Hallucinations
If you are struggling with hallucinated answers to questions in your application, you can leverage log probabilities to potentially detect this. We saw this above when we calculated confidence scores to the answers we got in response to questions. Low logprobs on factual claims can indicate hallucinations.
4. Calculate Perplexity
Another use case of log probabilities is that they can be used to calculate perplexity. Perplexity measures how “surprised” or “confused” a model is. You could send the same prompt into gtp-4.1 and gpt-4o-mini and calculate the perplexity to get a measure of how surprised/confused each model is for the same prompt. You can calculate perplexity as follows:
for model in models:
...
logprobs = [token.logprob for token in response.choices[0].logprobs.content]
avg_logprob = np.mean(logprobs)
perplexity = np.exp(-avg_logprob)
print(f"{model}: Perplexity = {perplexity:.2f}")
When calculating perplexity, lower perplexity means the model is more confident/less surprised by its own output. And higher perplexity means the model is struggling/uncertain.
We can use perplexity to help us with:
- Model Selection - pick the best model for our domain.
- Quality Assessment - Higher perplexity means lower quality results.
- Domain fit - compare different models across different topics (engineering, medical, literature, etc.) to see where each model excels.
Note: You should not just use perplexity on its own to judge a model, but pair it with other metrics to build a better understanding.
There are many more uses case for log probabilities, but I find the above use cases the most interesting.
Summary
In this post, we learned:
-
Log probabilities are probabilities assigned to output tokens. The Log probability of an output token indicates the chance of that token occurring in a sequence given the context.
-
The logprob value stores the natural logarithm of the probability, not the probability itself.
-
Log probability is a
log(p), wherepis the probability of a token occurring at a specific position based on the previous tokens in the context. -
APIs have to request logprobs by setting
logprobs=Trueand requesting the number of most likely tokens to return via thetop_logprobsparameter (these parameters are for OpenAI and would be different across LLM APIs). -
Log probabilities are only available via APIs and not via the chat web/mobile apps.
-
Some practical use cases of log probabilities and how they can be used to enhance applications or model selection.
Thank you for reading.