All concepts

Tokens and Embeddings

How text becomes ordered rows of numbers before attention begins.

Also called: tokenisation, prompt embeddingsFurther Reading

LLMs do not work directly on text. They work on tokens , which are pieces of text in a fixed order, and on embeddings , which are vectors of numbers attached to those tokens.

Text

the small cat

Tokens in order

0 the 1 small 2 cat

Embedding rows

i = 0 (0.41, -0.12, 0.87, -0.33)

i = 1 (0.05, 0.62, -0.14, 0.48)

i = 2 (-0.36, 0.91, 0.11, -0.07)

The order is preserved all the way through: token 0 becomes position i = 0, token 1 becomes i = 1, and so on. Attention later uses those positions when it decides what each position may look back at.

For this assignment, you can treat tokens almost like words:

token 0 comes first
token 1 comes next
later positions may only look back, not forward

In this assignment, you first practise the representation idea with a simplified embedding step.

Stage 1 builds one-hot vectors for the tokens as a simple stand-in for embeddings. That is the representation exercise students do directly. After that, the later attention stages use richer embedding rows as the prompt matrix that gets projected into $\mathbf{Q}$, $\mathbf{K}$, and $\mathbf{V}$.

The prompt matrix can be read as:

n rows
one row per prompt token
d numbers in each row

token 0 0.41 -0.12 0.87 -0.33 token 1 0.05 0.62 -0.14 0.48 token 2 -0.36 0.91 0.11 -0.07

This is why the scaffold naturally uses a 2D array: one axis for token positions and one axis for embedding components.

That is why the rest of the assignment is mostly about array loops and matrix-style computations rather than text processing.

Why the spec says token instead of word

Models do not literally operate on words. A token might be:

a whole word
part of a word
punctuation
even whitespace in some systems

For this assignment, you can safely pretend tokens are words because the important fact is just that token positions have an order and the attention rules depend on that order.

Tokens and Embeddings

n rows by d columns

Related explainers

Q, K, V Projections

Attention

`n` rows by `d` columns