LLM: Understanding The Temperature Parameter

- 13 mins read

Overview

  • How tokens are the fundamental unit of LLMs.
  • Understand the token vocabulary and role it plays in LLMs.
  • Learn what the LLM temperature parameter is and how it works.
  • Understand why LLMs are non-deterministic.
  • The use cases where developers may adjust the temperature parameter.

Introduction

During inference, a user can pass in various parameters into an LLM to guide its behaviour and the output it returns. One of those parameters is the temperature parameter, when this parameter is set to a low or high value it controls the randomness of the LLMs output. In this post we will dive deeper into the temperature parameter and explain what it is, its purpose, and have a look at what happens when you set the temperature to different values during LLM inference.

Before learning more about the temperature, let’s quickly recap on what LLM tokens are and what a token vocabulary is.

Tokens

Tokens are the currency of LLMs, when you send a text prompt to an LLM such as “Hello World!” this gets broken down into three tokens (in OpenAI GPT models). These tokens are then fed into other components that make up the LLM architecture during inference.

Input:

“Hello World!” Tokens [2344, 3125, 12] $0.00125 / 1k tokens

The input prompt “Hello World!” gets broken down into three tokens and billed at some small amount per 1k tokens. At this point the tokens are fed into the LLM, the LLM does some thinking/reasoning and returns an output.

Output:

“Hi!” Tokens [2514, 12] $0.01 / 1k tokens

Here the LLM outputs “Hi!” which is made up of two tokens, these are billed at a different rate to the input tokens.

Token Vocabulary

As well as user submitted tokens being fed into an LLM during inference, an LLM also has its own set of tokens called a token vocabulary. A token vocabulary is the complete set of all possible tokens an LLM can generate.

Before an LLM undergoes the training phase in its development, a token vocabulary needs to be built which provides the foundation for the LLMs training and is used during its inference stages.

The token vocabulary is built before the preprocessing stage of the LLM. Developers take a large corpus of raw text and run a tokenization algorithm (BPE, WordPiece, etc.) on it. The corpus of raw text comes from web data (from crawling the open web i.e. news sites, Reddit, Wikipedia), books, scientific/academic papers, code repositories and other sources such as PDF files.

So at a high level, the process order is: corpus of raw text → tokenizer training → token vocabulary created → model training. At its core, the token vocabulary is a dictionary mapping of tokens to IDs, Here is an example:

Token (text)         Token ID
------------------------------
'Hello'              9906
'World'              10343
' Hello'             22691
'hello'              15339
'ing'                287
'un'                 359
'.'                  13
'\n'                 198
'42'                 2983

This is just a small simple example, in reality vocabularies are much larger (100k-200k tokens) and include special tokens like <pad>, <eos>, <unk>. The most important thing is that once the token vocabulary is created, the token vocabulary is then fixed and used during inference with the temperature parameter (as well as other parameters) to drive the output of the LLM.

What Is LLM Temperature?

Now that we know what tokens and a token vocabulary is, let’s finally learn what the LLM temperature parameter is and how it is used.

The temperature parameter is used to control the amount of randomness in token selection from the LLMs token vocabulary. When the temperature is set to a high value, it makes all tokens more equally likely to be selected, while lower values (closer to 0) make the model favour its highest-confidence tokens. The result of setting the temperature is that it allows us to direct the output of an LLM to be more diverse and creative or be more predictable and focused.

Let’s look at a code example to get a better understanding, of how changing the temperature to low and high values affects the LLMs output:

Note: The OpenAI documentation states that the temperature can be set in the range (0,2).

from openai import OpenAI

temperature = 0

client = OpenAI(api_key="<API-KEY>")

response = client.responses.create(
    model="gpt-4.1",
    temperature=temperature,
    input="Write a one-sentence bedtime story about a unicorn."
)

print(response.output_text)
$ python3 script.py

Under a sky full of twinkling stars, a gentle unicorn named Luna tiptoed through
a moonlit meadow, sprinkling dreams of magic and kindness wherever she went.

Under a sky full of twinkling stars, a gentle unicorn named Luna tiptoed through
a moonlit meadow, sprinkling dreams of magic and kindness wherever she went.

Under a sky full of twinkling stars, a gentle unicorn named Luna tiptoed through
a moonlit meadow, leaving trails of sparkling dreams for all the children fast asleep.

Under a sky full of twinkling stars, a gentle unicorn named Luna tiptoed through
a moonlit meadow, leaving trails of sparkling dreams for all the children fast asleep.

Under a blanket of twinkling stars, a gentle unicorn named Luna tiptoed through
a moonlit meadow, sprinkling dreams of magic and wonder for all the sleeping children.

We set the temperature to a low value and run the script five times. Notice how the output of each call to the API is very similar. There is very little variation as we set the temperature low, the output is more predictable and focused.

Let’s change the temperature to a higher value and rerun the script again:

temperature = 1.8
$ python3 script.py

A gentle unicorn with a silver mane tiptoed through a moonlit meadow, gently
sprinkling good dreams upon sleeping children as the stars whispered soft lullabies
from above.

At the edge of a magical forest, a gentle unicorn sprinkled starlight on the
flowers every night, guiding sweet dreams to every child.

One windy night, a gentle unicorn sprinkled starlight across the fields, smiling
as every shining beam filled children’s dreams with wondrous magic.

Under a moonlit sky, a gentle unicorn sprinkled silver stardust across dreamland,
tucking every child into a magical, peaceful sleep.

Under a velvet sky sprinkled with stardust, a gentle unicorn named Luna tiptoed
through a silver meadow, guiding sleepy dreamers to lands where every wish comes
true.

Notice how the output of each call to the API is very different. There is a lot of variation as we set the temperature high and the output is more diverse and creative.

So, the temperature is like a knob we can use for controlling a model’s behaviour when it’s picking which tokens to select and return in its output.

temperature < 1: Greedy, focused outputs
temperature = 1: Standard softmax
temperature > 1: More creative/random outputs

LLMs Are Non-Deterministic

If you have any experience with prompting LLMs, you will know that if you enter the same prompt, you will get different output from the LLM. This is because LLMs they are non-deterministic, LLMs do not just predict a single token. Instead, they predict probabilities for what the next token could be, with each token in the LLM’s vocabulary getting assigned a probability. Those token probabilities are then sampled to determine what the next selected token will be.

You may think that setting the temperature to zero would make the LLM deterministic, because the model would always pick the token with the highest probability. But, if you have a closer look at the output of the script when we set the temperature to zero:

Under a sky full of twinkling stars, a gentle unicorn named Luna tiptoed through
a moonlit meadow, sprinkling dreams of magic and kindness wherever she went.

Under a sky full of twinkling stars, a gentle unicorn named Luna tiptoed through
a moonlit meadow, sprinkling dreams of magic and kindness wherever she went.

Under a sky full of twinkling stars, a gentle unicorn named Luna tiptoed through
a moonlit meadow, leaving trails of sparkling dreams for all the children fast asleep.

Under a sky full of twinkling stars, a gentle unicorn named Luna tiptoed through
a moonlit meadow, leaving trails of sparkling dreams for all the children fast asleep.

Under a blanket of twinkling stars, a gentle unicorn named Luna tiptoed through
a moonlit meadow, sprinkling dreams of magic and wonder for all the sleeping children.

Notice that the first two outputs from the model are the same, third and fourth are the same, and the fifth output is unique. Why is this? If multiple tokens have the same highest predicted probability, then the sampling method performs a tie-break and returns one of the tokens. Which means you do not always get the same output when you set the temperature to zero, which in turn makes the LLMs non-deterministic.

As a simple example, imagine we have sampled the following tokens:

"Under a "

And we are sampling the next token:

sky: 93%
blanket: 93%
moon: 43%
star: 56%

The LLM sampling process would perform a tie-break between the highest probability tokens:

tiebreak("sky", "blanket")

The exact method of how tie breaking is implemented is not clear and appears to be implementation dependent.

How the Temperature Parameter Works?

To understand the temperature parameter more thoroughly, we will go through a step-by-step process as to what is happening during inference with respect to the temperature parameter, by going through a simple example.

1. User Submits a Request Which Includes a Prompt and Temperature Variable

response = client.responses.create(
    model="gpt-4.1",
    temperature=1.8,
    input="Write a one-sentence bedtime story about a unicorn."
)

2. LLM Breaks the Prompt into Tokens

"Write a one-sentence bedtime story about a unicorn."
                         ↓
[8144, 264, 832, 1355, 18886, 89607, 3446, 922, 264, 82930, 13]`

3. Model Computes A Logit (raw score) For Every Single Token In Its Vocabulary

Note: A logit is a raw unnormalised score output by a Neural Network (NN). For LLMs, after they finish processing the input tokens, the model’s final NN layer produces logits for each token in the vocabulary.

{
  "hello": 5.2,
  "world": 3.1,
  "ing": 2.8,
  " ": 1.5,
  "cat": -0.3,
  ... (100K+ more tokens)
}

When computing the logit for every single token in its vocabulary, the input prompt tokens (as well as many other parameters) are used to calculate the score of each token. Which means this step happens every time the model does inference.

4. Temperature Parameter Scales the Logits Mathematically

At low temperatures, logits get amplified, which means high scores get much higher and low scores get much lower. The higher scoring token dominates, which means similar tokens will get picked, resulting in more predictable output. At high temperatures, logits get compressed which means all scores flatten out, and many tokens become equally likely to be picked, resulting in more random output.

Imagine the model’s logits for the next token are:

"happy": 8.0
"sad": 2.0
"angry": 1.0

If temperature = 0.1 then the logits are scaled with this formula:

logit / temperature = scaled_logit

So using the logit values from the model:

8.0 / 0.1 = 80.0
2.0 / 0.1 = 20.0
1.0 / 0.1 = 10.0

If temperature = 2.0:

8.0 / 2.0 = 4.0
2.0 / 2.0 = 1.0
1.0 / 2.0 = 0.5

Notice the difference in the scaled logits when the temperature is low versus when it is high. When the temperature is low the scaled logits higher values are amplified and when the temperature is high the scaled logits are much closer together.

5. Softmax Function Converts the Scaled Temperature Values into Probabilities

Note: softmax is a function that is used to take logits i.e. the raw scores computed by a model and turning them into probabilities. The function’s goal is to make sure the output values are in the range (0, 1) which makes them interpretable as probabilities.

Once the model has scaled the logits according to the temperature, the next step is to apply the softmax function to the scaled logits. The softmax function converts the scaled logits to probabilities using the formula e^x / sum(e^x for all tokens). Below, we apply softmax to the scaled logits at both low and high temperatures to see its affect.

Temperature = 0.1:

8.0 / 0.1 = 80.0
2.0 / 0.1 = 20.0
1.0 / 0.1 = 10.0

e^80 = huge number
e^20 = huge number
e^10 = huge number

"happy": e^80 / (e^80 + e^20 + e^10) ≈ 0.99 (99%)
"sad": e^20 / (e^80 + e^20 + e^10) ≈ 0.008 (0.8%)
"angry": e^10 / (e^80 + e^20 + e^10) ≈ 0.002 (0.2%)

At the low temperature, dividing by 0.1 amplifies logit differences, making the highest logit dominate (99% for “happy”). The model becomes more confident/deterministic.

Temperature = 2.0:

8.0 / 2.0 = 4.0
2.0 / 2.0 = 1.0
1.0 / 2.0 = 0.5

e^4.0 ≈ 54.6
e^1.0 ≈ 2.72
e^0.5 ≈ 1.65

"happy": 54.6 / (54.6 + 2.72 + 1.65) ≈ 0.92 (92%)
"sad": 2.72 / (54.6 + 2.72 + 1.65) ≈ 0.04 (4%)
"angry": 1.65 / (54.6 + 2.72 + 1.65) ≈ 0.03 (3%)

At the higher temperature, logits stay closer to original values, probabilities are more uniform. More randomness.

6. Sampling Next Token Based on Probabilities

Finally, after softmax converts to probabilities, The model samples one token based on those probabilities.

At temperature 0.1 with these probabilities:

"happy": 99%
"sad": 0.8%
"angry": 0.2%

"happy" would almost certainly get selected. With 99% probability, it gets picked the vast majority of the time. “Sad” and “angry” have only a tiny 0.8% and 0.2% chance respectively of being selected.

At temperature 2.0 with these probabilities:

"happy": 92%
"sad": 4%
"angry": 3%

"happy" has a 92% chance of being selected, which is much higher than “sad” and “angry”. The higher temperature makes the distribution more uniform (less dominated by the highest logit), but “happy” is still strongly favoured. It just gives “sad” and “angry” a better chance than they had at temperature 0.1 (where they were essentially 0%).

Remember, this is just sampling one token and adding it to the output. Once added to the output, it samples the next token and adds it to the output, this repeats until it hits a stop token and the output returned. Just to clarify this process, imagine we ask an LLM to complete the next few words of the following poem “roses are red”, the (extremely simplified) process would look something like this:

Feed into NN layers: [roses are red]
Pick most probable token from token vocabulary, lets say 'violets'
Feed into NN layers: [roses are red violets]
Pick most probable token from token vocabulary, lets say 'are'
Feed into NN layers: [roses are red violets are]
Pick most probable token from token vocabulary, lets say 'blue'
! Stop !
Output: roses are red violets are blue

Temperature Parameter Use Cases

Now that we have seen what the temperature parameter is and how it works, what are some of the use cases of the parameter. The temperature is only exposed in the API and not the web/mobile apps of the LLM providers (OpenAI, Clause, Perplexity, etc.). This is so developers can tweak the models to suit their end application requirements.

Here are practical use cases where developers might adjust temperature:

Low temperature (0-0.3):

  • Customer support chatbots - consistent, factual answers to common questions
  • Code generation - deterministic output for reproducible code
  • Data extraction - reliable structured responses (pulling info from documents)
  • Medical/legal advice - accuracy and consistency matter more than variety

Medium temperature (0.5-0.7):

  • General Q&A assistants - balanced between consistency and natural variation
  • Content summarization - faithful to source while sounding natural

High temperature (0.8-1.0+):

  • Creative writing - poetry, fiction, brainstorming
  • Chatbots for entertainment - conversational, unpredictable responses

Dynamic adjustment:

Some apps adjust temperature based on context:

  • A writing assistant might use low temperature for technical sections, high for creative sections
  • A code editor uses low temperature for completions, higher for “suggest alternatives”
  • A chatbot uses low temperature for factual queries, higher for open-ended questions

Summary

In this blog post, we learned about how prompts are split into tokens and fed into an LLM and help calculate the probabilities of the tokens that get sampled from the token vocabulary. We saw code examples on how to set the temperature and how setting the temperature to low and high values drives the output of the model. Without temperature control, you’re stuck with either deterministic outputs (always picking the most likely token) or completely random outputs. Temperature lets you dial in the right balance between consistency and creativity.

Thank you for reading.