COMP10002 Foundations of Algorithms

Assignment 1: Attention Is All You Need

A web-first version of the assignment spec. You should be able to read this page straight through; the linked explainers are there only when you want more intuition. If there is ever an inconsistency between this main page and a linked explainer, this page is most likely the correct information, so treat the other pages as supplemental unless otherwise advised.

Semester 1, 2026Due: Monday 27th April, 6PM AESTVersion: 1.0

Important notes

Submission is a single file: a1.c, through gradescope.
Start from the provided scaffold and fill in the TODO functions.
Do not change the stage header lines or print helpers.
Do not add extra printf debugging output.
Your code must compile on the marking system or the testcase component cannot run.

Compilation errors matter

Because the testcase mark is autograded, your program must compile on the marking system. If it does not compile, it cannot be run and you will receive zero for correctness.

How to use this page

You can read this page from top to bottom like the assignment handout. If a term such as softmax, projection, or KV cache slows you down, use the inline link for a deeper explanation, then come back here.

Downloadable PDF

Prefer a printable version? Download the compiled project spec PDF. It bundles this main spec, the concept explainers, and the optional support pages as appendices.

What is this assignment about?

This assignment is inspired by how modern Large Language Models (LLMs) such as chatbots work.

An LLM is trained to predict the next token in a sequence. Conceptually, given a prompt

\[\texttt{token}_0, \texttt{token}_1, \dots, \texttt{token}_{n-1},\]

the model produces a probability distribution for the next token $\texttt{token}_n$.

Think of it like autocomplete: based on what you typed so far, what is most likely to come next?

What a real LLM would do, at a high level

You do not need the exact ML details, but a real LLM roughly:

tokenises text into token IDs
looks up token embeddings
runs many Transformer layers
uses attention inside those layers
produces scores over a vocabulary
chooses the next token, appends it, and repeats

What this assignment does instead

You are not implementing a full chatbot. You are implementing a core part used inside them: a simplified single-head self-attention block plus a final KV-cache generation step.

If you want the plain-English picture of what attention is doing before you get into the loops, read the Attention explainer.

To keep the programming focused and testable:

you begin with a small Stage 1 representation exercise that turns a token list into one-hot vectors
the later attention stages use the provided prompt and generated embeddings directly as numeric rows
you are given the parameter matrices Wq, Wk, and Wv directly as numbers
you compute attention scores, weights, and outputs exactly
you apply masking rules and stable softmax
you do not predict English words

If the maths still feels abstract, that is fine. The implementation requirements are stated explicitly below, and the linked explainer pages fill in the intuition when you need it.

Where the numbers in this assignment come from

A real language model starts with text, not matrices. Very roughly, the pipeline is:

split the text into tokens
turn each token into an embedding vector
run attention over those vectors to build context-aware outputs
use the final vector to help predict what token should come next

This assignment still simplifies a lot, but it lets you practise the representation idea separately from the later attention maths. You do not implement a full tokenizer, embedding-table lookup, vocabulary-scoring layer, or training process.

The project therefore has two related but separate “before attention” pieces:

a short token list used in Stage 1, where you build and print a one-hot embedding table
the prompt and generated embedding rows used by the later attention stages

That means Stage 1 is a warm-up representation exercise on tokens, while Stages 2 to 6 operate on the provided numeric embedding matrices.

That means:

each row of the prompt matrix is one token embedding
n is the number of prompt tokens
d is the number of components in each embedding vector
the prompt positions still have an order, so “earlier token” and “later token” matter

The core job of attention is then: for one position, look back at the earlier positions that are allowed, decide which ones matter most, and blend their information into a new output row.

Recommended visual introduction

If you want one strong visual introduction before or during the assignment, watch 3Blue1Brown: Attention in transformers.

It is especially useful for building intuition about what single-head self-attention is doing before you get lost in the matrix arithmetic. Just note that they show embedding vectors as columns instead of rows like this assignment!

The whole job in one pass

Your program follows one continuous pipeline:

read the token list, mask, embedding rows, and parameter matrices from the input file
build and print the Stage 1 one-hot embeddings from the token list
project the provided prompt embeddings into query, key, and value matrices
compare prompt queries against prompt keys to get attention scores
turn those scores into attention weights with masking and stable softmax
use the weights to blend the value vectors into prompt outputs
extend the same logic to generated tokens while reusing cached keys and values

If you keep that pipeline in mind, the stage descriptions later on should read as one algorithm being built step by step, not as unrelated tasks.

Mental Model For The Stages

Keep this mental model in mind while reading the stages:

a prompt position starts as “just this token’s embedding”
by the end of Stage 5, that row has become “this token, rewritten using the earlier rows that mattered to it”
Stage 6 applies the same attention idea during generation, but reuses cached keys and values instead of rebuilding old work

If you lose the thread while coding, come back to these two questions:

which earlier rows is this position allowed to use?
how much should each allowed row contribute to the new output row?

In a full model there would be more machinery around this block, such as tokenisation , embedding lookup , many repeated layers, and a final vocabulary-scoring step. Here you only implement the attention part in the middle.

Optional background and visual explainers

If you want more background, worked examples, or visual explanations, these pages are useful:

Concept explainers for attention, projections, masking, softmax, and related ideas
Transformer explainer for a one-page overview of the larger model architecture
Concepts for both the required explainers and the optional demos, tooling, and background pages
The Illustrated Transformer for a visual walkthrough of the model structure
Attention Is All You Need for the original Transformer paper

Those pages support this spec. They are not required pre-reading.

Glossary and notation

This section collects the vocabulary and notation that help you read the assignment, before you get into the stage-by-stage implementation details.

Glossary

Term	Meaning
Token	A piece of text, sometimes a whole word and sometimes part of a word.
Prompt	The starting tokens you already have before generating new tokens.
Generate	Produce new tokens one at a time after the prompt.
Vector	A fixed-length list of numbers. In C, a 1D array of `double`.
Matrix	A table of numbers. In C, a 2D array of `double`.
Embedding	A vector representing a token as numbers.
Dot product	Multiply matching positions and add them up.
Score	A number saying how relevant one token is to another.
Softmax	A function that turns scores into weights that sum to 1.
Mask	A rule that says “ignore this position”.
Causal mask	The mask that stops row `i` from using any future position `j > i`.
Padding mask	The mask that ignores prompt positions marked as padding, usually where `mask[j] == 0`.
KV cache	Stored past keys and values used during generation.

Notation and conventions

Throughout this spec:

We number tokens starting at 0 when describing arrays, so prompt indices are $0,1,\dots,n-1$.
A vector of length d has components $0, \dots, d-1$. For clarity we will show vectors as bold with an arrow above and use round brackets.
For example $\vec{\mathbf{v}} = \begin{pmatrix}1 & 2\end{pmatrix}$, would be double v[] = {1 , 2} in C.
For clarity we will show matrices as bold and write it using square brackets.
For example $\mathbf{M} = \begin{bmatrix}1 & 2\end{bmatrix}$ would be double M[][] = { {1 , 2} } in C. If we write $\vec{\mathbf{M}}_i$ we mean the vector corresponding to row $i$ in $\mathbf{M}$, in C that is M[i]
”$\sum$” means “add up a bunch of terms” (like a loop). For example, $\sum_{t=0}^{d-1} a[t]\cdot b[t] = a[0]b[0] + \dots + a[d-1]b[d-1]$ means “for each t, multiply, then add the results”.
Subscripts like $Q_{i,t}$ for a matrix $\mathbf{Q}$ mean “row $i$, component $t$”. In C, that is Q[i][t].
Masked means “treated as if it does not exist”: its weight must be 0.0, and in Bonus Stage 5 you must not compute a dot product for it.

Dot products, vectors and matrices

If we have vectors $\vec{\mathbf{a}} = \begin{pmatrix} 1 & 2\end{pmatrix}$ and $\vec{\mathbf{b}} = \begin{pmatrix} 3 & 4\end{pmatrix}$, then the dot product is given by

\[\vec{\mathbf{a}} \bullet \vec{\mathbf{b}} = \begin{pmatrix} 1 & 2\end{pmatrix} \bullet \begin{pmatrix} 3 & 4\end{pmatrix}= 1\cdot 3 + 2\cdot 4 = 11.\]

Because we will be working with matrices, it is convenient if we define our dot product as a matrix multiplication between 2 matrices. For a matrix multiplication to be defined the number of columns in the first matrix, and the number of rows in the second matrix must be equal. The dimension of the resulting matrix, will be the number of rows of the first matrix and the number columns of the second matrix. In the case below it will be 1 by 1 (just a number, exactly as our dot product between vectors requires).

\[\vec{\mathbf{a}} \bullet \vec{\mathbf{b}} = \mathbf{a} \mathbf{b}^T =\begin{bmatrix} 1 & 2\end{bmatrix} \begin{bmatrix} 3 \\ 4\end{bmatrix}\]

By the standard rules of matrix multiplication, this gives us the same result as our dot product. So this is equivalent to the dot product between the vector corresponding to the row of the first matrix, and the column of the second matrix.

In C our matrices $\mathbf{a}$ and $\mathbf{b}$ would be double a_matrix[][] = { {1, 2} } and double b_matrix[] = { {3, 4} }. Then $\mathbf{b}^T$ would be the 2D matrix double bT_matrix[][] = { {3}, {4} }. Here $\mathbf{b}^T$ means the transpose of $\mathbf{b}$, where transpose just means “flip it across the diagonal so rows become columns and vice versa”, which is required to ensure the matrix multiplication is defined.

Note that we can’t directly extract a ‘column vector’ from the $\mathbf{b}^T$ matrix, like we can with a row (eg with a_matrix[0]), so it can be convenient to do operations in terms of matrices.

In this assignment, dot products appear inside matrix-style calculations, so it is useful to remember that each output component in a projection is “one row dotted with one column”.

If you want help reading the compact notation used in the Transformer paper and later papers, read the paper-notation explainer. If you want a visual explanation of row-vector times matrix multiplication, read the projection explainer.

Getting started

This section is the fast setup pass: what you have, what the scaffold already does, and what constraints matter before you start coding.

You have been given

a scaffold a1.c with a working main and TODO functions
sample inputs in test_data/
matching expected outputs in test_data/

The scaffold already

prints the required stage headers and row formatting
calls your functions in the correct order
allocates arrays at safe maximum sizes

You should

fill in only the functions marked /* TODO: ... */
keep all arrays within the provided maximum sizes
use double for calculations
include <math.h> and compile with -lm

You must not

change the stage header lines
add extra output
change print_vector

Inspect the scaffold first

Before you start coding, spend some time familiarising yourself with a1.c:

what arrays already exist
what each function is meant to compute
what gets printed

How to get going

Start by reading a1.c, then implement the stages in order.

Begin with read_input(...) , because every later stage depends on having the input stored correctly. Then get the Stage 1 embedding output working, then compute Stage 2 projections, Stage 3 scores, Stage 4 weights, Stage 5 outputs, and finally Stage 6 generation with the KV cache.

After each stage, compile and run test0 before moving on.

If you need more intuition while coding, use the linked explainers at the point where they become relevant: one-hot encoding for Stage 1, tokens and embeddings for the input story, projections for Stage 2, dot products and masking for Stage 3, softmax for Stage 4, attention output for Stage 5, and KV cache for Stage 6.

If reading a1.c itself is slowing you down, use the Annotated scaffold. The function pulls in the stage sections below jump straight to the matching scaffold section there.

Stage descriptions

The most useful way to read this section is as one algorithm unfolding in order. Stage 0 loads the token list and numeric snapshot. Stage 1 builds and prints one-hot embeddings. Stage 2 turns the provided prompt embeddings into Q, K, and V. Stages 3 to 5 perform attention over the prompt. Stage 6 then applies the same ideas during generation using the cache.

Each stage names the scaffold function you should implement first when there is one. Stage 1 is the small embedding-construction stage that now sits in front of the attention pipeline.

Stage 0: Reading the input

If you want to see what a real input file looks like before reading the format rules, open What an input file looks like.

Implement this scaffold function first: read_input(...) . Every later stage depends on storing the input snapshot correctly.

Read from stdin in this order:

n d g text_len. The three attention sizes, followed by the number of tokens in the Stage 1 text.
text_len whitespace-separated tokens. This is the Stage 1 text that you turn into the one-hot embedding table.
n integers for mask. These say which prompt positions are real and which ones are padding.
n prompt embeddings of length d. These rows are the numeric prompt matrix used by the later attention stages.
g generated embeddings of length d. These rows are used later in the generation stage.
d lines for Wqthe query projection matrix.
d lines for Wkthe key projection matrix.
d lines for Wvthe value projection matrix.

Bounds

Symbol	Meaning	Range
`n`	number of prompt tokens	`1 <= n <= 64`
`d`	embedding dimension	`1 <= d <= 64`
`g`	number of generated tokens	`0 <= g <= 32`
`text_len`	number of tokens in the Stage 1 text	`1 <= text_len <= 64`

Use scanf. Do not open files in your C program.

Whitespace-separated input means spaces and newlines are equivalent for scanf.

Stage 1: Creating embeddings

This stage is the small representation exercise before the attention maths begins. You take the input text, treat each whitespace-separated piece as a token, build the unique-token list, sort it lexicographically, and then print one one-hot vector per token in that sorted list.

The scaffold function for this stage is create_embeddings(...) . If you use qsort(...), the comparison helper sits right beside it in the scaffold.

The core job in this stage is:

sort the token array stored by Stage 0 lexicographically
remove duplicate tokens from the sorted array
give each token a vector that is all zero except for one 1 based on the index in the sorted array
print the token together with its one-hot embedding as per the Output Format

That is the one-hot encoding idea. It is a simple bridge from categories to numbers, and the one-hot explainer shows it visually.

If this stage makes sense in principle but you have not really worked with C strings yet, use Strings in C. It is meant as a short crash course on the string operations this stage relies on.

This stage is the simple practice version of token representation.

Important Stage 1 Difference

The Stage 1 vector length is the number of unique tokens in the sorted list, not the prompt embedding dimension d used in the later attention stages.

You can use either qsort(...) or your own sorting code for the lexicographic ordering step.

How this stage relates to the later attention stages

The later attention stages do not depend on these Stage 1 vectors as input. This stage is there to make the idea of “text becomes vectors” concrete before the rest of the assignment switches to the provided numeric embedding matrices.

The important representation rule is:

the sorted unique-token array is the embedding table
the token’s position in that sorted array decides where the 1 goes
every other position in that embedding is 0

That means the embedding dimension for this stage is “number of unique tokens in the sorted list”.

Worked example: a toy embedding table

If the sorted token list is

algorithms, best, foundations, of, subject, the, to, welcome

the one-hot vectors are:

algorithms

best

welcome

This is not how modern trained models usually build embeddings, but it is a clean way to show how text can become vectors before attention begins.

For a smaller three-token case, the printed output would look like:

Stage 1: Create Embeddings
"algorithms" -> (1 0 0)
"subject" -> (0 1 0)
"the" -> (0 0 1)

Stage 2: Projections

In an LLM once the prompt text has been tokenised, it will then be represented as a sequence of embedding vectors.

In the stage, we provide you a sequence of embedding vectors corresponding to a tokenised prompt, and not rely on the previous stage to keep it seperate. If we did use the previous stage, for a given prompt, you would look up the embedding for each token, and pass an array of those embeddings to this stage. Notably the embedding for every token is the same regardless of where in the prompt it is, despite the location of words, in relation to others, altering the meaning. This is where attention comes in!

Once the input is stored, our first computation to implement attention is to build the prompt query, key, and value matrices .

Before reading on make sure you are familar with the notation

The input prompt embeddings are the rows of a matrix $\mathbf{X}$. To get queries, keys, and values, you project the same prompt matrix three times with three different matrices:

\[\mathbf{Q} = \mathbf{X}\mathbf{W_q} \qquad \mathbf{K} = \mathbf{X}\mathbf{W_k} \qquad \mathbf{V} = \mathbf{X}\mathbf{W_v}\]

What Q, K, and V are, and why are there three of them.

How do these matrices link to the C Scaffold?

$\mathbf{X}$ is the prompt embedding matrix: prompt. Each row of the matrix corresponds to one prompt token embedding vector.
$\mathbf{W_q}, \mathbf{W_k}, \mathbf{W_v}$ are the weight matrices. In C these are Wq, Wk, Wv stored as 2D arrays.
$\mathbf{Q},\mathbf{K}, \mathbf{V}$ are the prompt projection vector matrices. In C these are q_matrix, k_matrix, v_matrix stored as 2D arrays.

Implement the scaffold function for this stage: compute_projection(...) .

If $\mathbf{X}$ is the input matrix, $\mathbf{W}$ is one of the projection matrices, and $\mathbf{Y}$ is the projected output matrix, then:

\[Y_{i,j} = \sum_{u=0}^{d-1} X_{i,u}\,W_{u,j} \qquad \mathbf{Y} = \begin{bmatrix} Y_{0,0} & \dots & Y_{0,d-1} \\ \vdots & \ddots & \vdots \\ Y_{n-1,0} & \dots & Y_{n-1,d-1} \end{bmatrix}\]

Be careful to reset the sum to 0 for each output component.

This function is called three times to compute prompt $\mathbf{Q}$, $\mathbf{K}$, and $\mathbf{V}$

Worked example: one token embedding vector

For a single token vector $\vec{\mathbf{x}}$ of length d, the projected vector $\vec{\mathbf{y}}$ is defined component by component by:

\[\vec{\mathbf{x}} = \begin{pmatrix}x_0 & \dots & x_{d-1}\end{pmatrix}, \qquad y_j = \sum_{u=0}^{d-1} x_u W_{u,j}, \qquad \vec{\mathbf{y}} = \begin{pmatrix}y_0 & \dots & y_{d-1}\end{pmatrix}.\]

For an example with d = 2, we can equivalently write our component wise sum as a matrix multiplication. We will define the matrix $\mathbf{X} = \begin{bmatrix}\vec{\mathbf{x}}\end{bmatrix}$ so we can write things in terms of matrix multiplication:

\[\mathbf{X} = \begin{bmatrix} x_0 & x_1 \end{bmatrix} \qquad \mathbf{W} = \begin{bmatrix} W_{0,0} & W_{0,1}\\ W_{1,0} & W_{1,1} \end{bmatrix}\]

Then

\[\begin{align*} \mathbf{Y} &= \begin{bmatrix} x_{0} & x_{1} \end{bmatrix} \begin{bmatrix} W_{0,0} & W_{0,1}\\ W_{1,0} & W_{1,1} \end{bmatrix} = \begin{bmatrix} \begin{bmatrix} x_{0} & x_{1} \end{bmatrix} \begin{bmatrix} W_{0,0} \\ W_{1,0} \end{bmatrix} & \begin{bmatrix} x_0 & x_1 \end{bmatrix} \begin{bmatrix} W_{0,1} \\ W_{1,1} \end{bmatrix} \end{bmatrix} \\ &= \begin{bmatrix} \begin{pmatrix} x_{0} & x_{1} \end{pmatrix} \bullet \begin{pmatrix} W_{0,0} & W_{1,0} \end{pmatrix} & \begin{pmatrix} x_0 & x_1 \end{pmatrix} \bullet \begin{pmatrix} W_{0,1} & W_{1,1} \end{pmatrix} \end{bmatrix} \\ &= \begin{bmatrix} (x_0 \cdot W_{0,0} + x_1 \cdot W_{1,0}) & (x_0 \cdot W_{0,1} + x_1 \cdot W_{1,1}) \end{bmatrix} \end{align*}\]

This is the same loop pattern your code uses for every row of the prompt matrix.

Worked example: two token embedding vectors

In our previous example we treated one input vector as the row of a matrix, but we can extend this to have multiple input vectors, each as a different row, and our matrix multiplication is still defined, just the resulting matrix with have more rows in it!

To get the query, key and value token matrices, we will do this projection operation three times with three different weight matrices:

\[\mathbf{Q} = \mathbf{X}\mathbf{W_q},\quad \mathbf{K} = \mathbf{X}\mathbf{W_k},\quad \mathbf{V} =\mathbf{X}\mathbf{W_v}.\]

We can now consider a whole matrix of n prompt embedding vectors (each is a row of $\mathbf{X}$) of dimension d = 2 at once

\[\begin{align*} \mathbf{Y} &= \mathbf{X} \mathbf{W} = \begin{bmatrix} X_{0,0} & X_{0,1} \\ X_{1,0} & X_{1,1} \end{bmatrix} \begin{bmatrix} W_{0,0} & W_{0,1}\\ W_{1,0} & W_{1,1} \end{bmatrix} = \begin{bmatrix} \begin{bmatrix} X_{0,0} & X_{0,1} \end{bmatrix} \begin{bmatrix} W_{0,0} \\ W_{1,0} \end{bmatrix} & \begin{bmatrix} X_{0,0} & X_{0,1} \end{bmatrix} \begin{bmatrix} W_{0,1} \\ W_{1,1} \end{bmatrix} \\ \begin{bmatrix} X_{1,0} & X_{1,1} \end{bmatrix} \begin{bmatrix} W_{0,0} \\ W_{1,0} \end{bmatrix} & \begin{bmatrix} X_{1,0} & X_{1,1} \end{bmatrix} \begin{bmatrix} W_{0,1} \\ W_{1,1} \end{bmatrix} \end{bmatrix} \\ &= \begin{bmatrix} (X_{0,0} \cdot W_{0,0} + X_{0,1} \cdot W_{1,0}) & (X_{0,0} \cdot W_{0,1} + X_{0,1} \cdot W_{1,1}) \\ (X_{1,0} \cdot W_{0,0} + X_{1,1} \cdot W_{1,0}) & (X_{1,0} \cdot W_{0,1} + X_{1,1} \cdot W_{1,1}) \\ \end{bmatrix} \end{align*}\]

In this example we have n = 2 token embedding vectors stored in our token embedding matrix: $\begin{pmatrix} X_{0,0} & X_{0,1}\end{pmatrix}$ and $\begin{pmatrix} X_{1,0} & X_{1,1}\end{pmatrix}$. We can see that in the output matrix, each row corresponds to the projected vector for the given token embedding vector in that row of the input. The computation in each row is independant of the other rows, so we can calculate the result for each row seperately, but it is notationally compact to store the vectors as rows of a matrix, and convenient when it comes to programming aswell!

Stage 3: Attention scores (prompt)

With $\mathbf{Q}$ and $\mathbf{K}$ available, the next step is to measure how strongly each key vector corresponding to prompt token $j$ should attend to each query vector corresponding prompt token $i$. For this task you will be computing the $\mathbf{score}$ matrix, using the projections from the previous stage.

You should implement the scaffold function: compute_attention_scores_or_weights_prompt(...)

For each query vector corresponding to prompt token $i$ ($\vec{\mathbf{Q}_i}$) and key vector corresponding to prompt token $j$ ($\vec{\mathbf{K}_j}$), we compute an attention score. The score is a scaled dot product:

\[\text{score}_{i,j} = \frac{\vec{\mathbf{Q}_i} \bullet \vec{\mathbf{K}_j}}{\sqrt{d}} = \frac{\sum_{u=0}^{d-1} Q_{i,u}K_{j,u}}{\sqrt{d}} \qquad \textbf{score} = \frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d}} = \begin{bmatrix} \text{score}_{0,0} & \dots & \text{score}_{0,n-1} \\ \vdots & \ddots & \vdots \\ \text{score}_{n-1,0} & \dots & \text{score}_{n-1,n-1} \end{bmatrix}\]

When computing scores, you should apply two different masks:

causal mask: if j > i, mask that position by setting the score to $-\infty$
padding mask:
- if mask[i] == 0, the whole score row becomes $-\infty$.
- if mask[j] == 0, mask that position by setting the score to $-\infty$

In C, the score of $-\infty$ will be stored as -INFINITY using the INFINITY constant defined in <math.h>.

Why is the masked score $-\infty$?

We set the score to $-\infty$ because score is the dot product between two vectors, and a small dot product means the vectors are very different. By setting the score to $-\infty$, when we compute the weight in the next stage, these masked values will have weight 0.0.

Why do we use the same mask for both dimensions?

We use the same padding mask for both dimensions, as both dimensions are representing the same token, but for different remixed vectors!

What is the score formula really doing?

If you want the slower derivation, small examples, and a visual story for why the masks are there, use the dot-product explainer and masking explainer.

Worked example: one attention score

Lets consider an example with n = 3 and d = 2:

\[\mathbf{Q} = \begin{bmatrix} Q_{0,0} & Q_{0,1} \\ Q_{1,0} & Q_{1,1} \\ Q_{2,0} & Q_{2,1} \end{bmatrix} \qquad \mathbf{K} = \begin{bmatrix} K_{0,0} & K_{0,1} \\ K_{1,0} & K_{1,1} \\ K_{2,0} & K_{2,1} \end{bmatrix}\]

Here each row $\vec{\mathbf{Q}_i}$ and $\vec{\mathbf{K}_j}$ of $\mathbf{Q}$ and $\mathbf{K}$ correspond to a vector projected from a token embedding vector. Suppose that the original token embedding vector for rows 0, 1, 2 in $\mathbf{X}$ from the previous part, represented the tokens Foundations, of and Algorithms respectively, this is what the rows of $\mathbf{Q}$ and $\mathbf{K}$ also represent.

For example, if we were calculating $\text{score}_{1,0}$, this is the attention score of query vector 1 of with key vector 0 Algorithms, or more precisely how much key vector 0 attends to query vector 1.

\[\text{score}_{1,0} = \frac{\begin{bmatrix} Q_{1,0} & Q_{1,1} \end{bmatrix} \bullet \begin{bmatrix} K_{0,0} & K_{0,1} \end{bmatrix}}{\sqrt{2}} = \frac{Q_{1,0} \cdot K_{0,0} + Q_{1,1} \cdot K_{0,1}}{\sqrt{2}}\]

Note the same as the previous question, we can also compute the score matrix by consideration of a matrix multiplication (for notational convenience). Here we would need to transpose the second matrix, so we have a defined matrix multiplication where the number of columns of the first matrix must equal number of rows of the second matrix. Mathematically this is equivalent, and how research papers will represent the operation, but you may find one way or another easier to understand when you come to implement it in code.

\[\textbf{score} = \frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d}} = \frac{1}{\sqrt{d}}\begin{bmatrix} Q_{0,0} & Q_{0,1} \\ Q_{1,0} & Q_{1,1} \\ Q_{2,0} & Q_{2,1} \end{bmatrix} \begin{bmatrix} K_{0,0} & K_{1,0} & K_{2,0} \\ K_{0,1} & K_{1,1} & K_{2,1} \\ \end{bmatrix}\]

Stage 4: Attention weights (prompt)

The Stage 3 scores are not the final answer. They must be normalised into weights.

For this stage, you should modify the scaffold function from stage 3 compute_attention_scores_or_weights_prompt(...) so that it can apply softmax.

Hence, when apply_softmax is 1 (this stage), you should take the scores and apply stable softmax for each row in the output matrix.

\[\textbf{weight} = \begin{bmatrix} \text{weight}_{0,0} & \dots & \text{weight}_{0,n-1} \\ \vdots & \ddots & \vdots \\ \text{weight}_{n-1,0} & \dots & \text{weight}_{n-1,n-1} \end{bmatrix} = \begin{bmatrix} \text{softmax}(\overrightarrow{\mathbf{score}}_0) \\ \vdots \\ \text{softmax}(\overrightarrow{\mathbf{score}}_{n-1}) \end{bmatrix}\] \[\text{weight}_{i,j} = \frac{e^{\text{score}_{i,j}-\max(\overrightarrow{\textbf{score}}_i)}}{\sum_{u=0}^{n-1} e^{\text{score}_{i,u}-\max(\overrightarrow{\textbf{score}}_i)}}\]

If a row has no unmasked columns, output 0.0 for each weight in that row, do not divide by zero!

Worked example: scores become weights

If one row of unmasked scores is

(2, 0, -1)

then softmax turns that row into weights roughly like

(0.84, 0.11, 0.05)

The largest score becomes the largest weight, but every unmasked position still gets a non-negative share. Stable softmax produces the same final weights as ordinary softmax; it just subtracts the row maximum first so the exponentials stay numerically safe.

Stage 5: Attention output (prompt)

Once you have the attention weights, you can compute the actual prompt outputs. Implement this scaffold function: compute_attention_output_prompt(...) .

Compute the attention output vectors as a matrix $\mathbf{output}$ for each prompt token using:

\[\text{output}_{i,j} = \overrightarrow{\textbf{weight}}_i \bullet \overrightarrow{(\mathbf{V}^T)}_j = \sum_{u=0}^{n-1} \text{weight}_{i,u}\,V_{u,j} \qquad \textbf{output} = (\textbf{weight})\textbf{V}= \begin{bmatrix} \text{output}_{0,0} & \dots & \text{output}_{0,d-1} \\ \vdots & \ddots & \vdots \\ \text{output}_{n-1,0} & \dots & \text{output}_{n-1,d-1} \end{bmatrix}\]

Note that $\overrightarrow{(\mathbf{V}^T)}_j$ just means, take the vector corresponding to column $j$ of matrix $\mathbf{V}$. Writing $\overrightarrow{\mathbf{V}}_j$ would refer to row $j$. Doing the transpose first, turns the columns into rows, so we can extract row $j$ to get our vector corresponding to the column.

This is a weighted sum of value vectors using the attention weights you just computed.

Worked example: Three tokens embedded in two dimensions

Output example with $n = 3$ and $d = 2$

\[\textbf{weight} = \begin{bmatrix} \text{weight}_{0,0} & \text{weight}_{0,1} & \text{weight}_{0,2} \\ \text{weight}_{1,0} & \text{weight}_{1,1} & \text{weight}_{1,2} \\ \text{weight}_{2,0} & \text{weight}_{2,1} & \text{weight}_{2,2} \end{bmatrix} \qquad \mathbf{V} = \begin{bmatrix} V_{0,0} & V_{0,1} \\ V_{1,0} & V_{1,1} \\ V_{2,0} & V_{2,1} \end{bmatrix}\]

Using the same example as before, suppose that the original token embedding vector for row 0, 1, 2 in $\mathbf{X}$ represented the tokens Foundations, of and Algorithms. The $\mathbf{weight}$ matrix would be 3 rows by 3 columns, and the value $\mathbf{V}$ matrix would be 3 rows by 2 columns.

The output vector for token 2, which incorporates the context of the prompt tokens preceding it directly in the embedding would be:

\[\begin{align*} \overrightarrow{\mathbf{output}}_2 &= \begin{pmatrix} \begin{bmatrix} \text{weight}_{2,0} &&\text{weight}_{2,1} &&\text{weight}_{2,2} \end{bmatrix} \begin{bmatrix} V_{0,0} \\ V_{1,0} \\ V_{2,0} \end{bmatrix} & \begin{bmatrix} \text{weight}_{2,0} &&\text{weight}_{2,1} &&\text{weight}_{2,2} \end{bmatrix} \begin{bmatrix} V_{0,1} \\ V_{1,1} \\ V_{2,1} \end{bmatrix} \end{pmatrix}\\ &= \begin{pmatrix} (\text{weight}_{2,0} V_{0,0} + \text{weight}_{2,1} V_{1,0} + \text{weight}_{2,2} V_{2,0}) & (\text{weight}_{2,0} V_{0,1} + \text{weight}_{2,1} V_{1,1} + \text{weight}_{2,2} V_{2,1}) \end{pmatrix} \end{align*}\]

If one weight is near 1, the output will look very similar to that one value vector, if weights are spread out, the output is a blend.

Much like the previous parts, we can also write this computation very compactly using a matrix multiplication, which will give us a matrix where each row corresponds to the output vector for a given token

\[\mathbf{output} = (\mathbf{weight})\mathbf{V} = \begin{bmatrix} \text{weight}_{0,0} & \text{weight}_{0,1} & \text{weight}_{0,2} \\ \text{weight}_{1,0} & \text{weight}_{1,1} & \text{weight}_{1,2} \\ \text{weight}_{2,0} & \text{weight}_{2,1} & \text{weight}_{2,2} \end{bmatrix} \begin{bmatrix} V_{0,0} & V_{0,1} \\ V_{1,0} & V_{1,1} \\ V_{2,0} & V_{2,1} \end{bmatrix}\]

How should I interpret the output vector?

This is the point of attention: one prompt token’s output vector becomes a blend of value vectors based on which key vectors attended to a given query vector the most.

If one weight dominates, the output will look very similar to one value vector. If several weights matter, the output becomes a mixture of several vectors. The attention explainer shows this more visually.

Stage 6: Generated outputs with a KV cache

In this task you will be using the prompt and the generated tokens (provided as part of the input) and a KV cache to compute attention outputs. You will not be generating the tokens as these are provided.

How does this link to a full LLM?

In a full LLM what would happen would be we would first compute attention for all the prompt tokens. Then we would pass this along to other components inside the LLM which will use it to generate the next token. Once we have this new generated token we again need to compute attention, and pass it along, so that a new token can be generated, repeating this as many times as desired. Hence, while we have given all the generated tokens already as input to your code, in a real system, we would only get one at a time.

The final stage applies the same attention logic during generation, but now the program reuses past keys and values through the cache.

Implement this scaffold function: compute_generation_with_cache(...) .

For each generated token $t$ which follows $n$ prompt tokens, you compute:

$\vec{\mathbf{Q}}_{n+t}$: the query vector for the token numbered $n+t$.
$\vec{\mathbf{K}}_{n+t}$ and $\vec{\mathbf{V}}_{n+t}$: the key and value vectors for the token numbered $n+t$, which should be appended to the KV cache.
score the new query $\vec{\mathbf{Q}}_{n+t}$ against all cached keys $\vec{\mathbf{K}}_{i}$ for $0 \le i \le n+t$, remembering to mask any prompt cache entries ($0 \leq i < n$) where mask[i] == 0 (same as one row of stage 3)
apply stable softmax over the cached scores to turn them into weights (same as one row of stage 4)
compute the output as a weighted sum of cached values (same as one row of stage 5)

There is no separate future-token causal mask in Stage 6 because the cache only contains past items plus the current generated token. The current generated token is included, so it can attend to itself.

Dot Products Computed

You must also report the exact number of dot products computed for each generated token.

In this stage, define “a dot product computed”, as computing the dot product between any two vectors (which includes any matrix multiplication between a given row and column of a matrix), for any weight matrix, and any unmasked prompt or any generated token embedding or projection vectors.

That means:

For each generated token, you should compute each unmasked score exactly once (store it in an array, reuse it for both max and denominator).
The total count you return should equal the total number of unmasked cached key positions processed across all generated tokens.
If a prompt cache entry is masked (mask[j] == 0), it contributes 0 to the count.

Worked example: Stage 6 dot-product counts

In the visible test0.txt case, the prompt mask is 1 1 0, d = 2, and there are two generated tokens (g = 2).

3d = 6 dot products compute the new query, key, and value vectors
3 dot products score against unmasked cached keys
d = 2 dot products compute the output vector

So the printed count is 11.

For the second generated token:

6 dot products compute the new query, key, and value vectors
4 dot products score against unmasked cached keys
2 dot products compute the output vector

So the printed count is 12.

Why does the cache help here?

The cache exists so you do not keep recomputing old keys and values during generation. The KV-cache explainer shows the cache growth step by step and explains how the per-generated-token dot-product count should be interpreted.

Want A Concrete End-To-End Example?

Now that you have seen the whole pipeline, the Toy demo is a good way to make it concrete.

It lets you generate a small numeric input from real text, run your program on it, and then read back a token-level explanation of the resulting attention weights. It is optional, but it may help you connect the stages to something more human-readable.

The structured version also uses a small one-hot encoding region for entity identity, which is why some of the toy vectors look especially interpretable.

Testing, Marking, and Output Rules

This section collects the parts of the spec that are directly useful while checking your work: testing, common failure modes, marking, and the exact output contract.

Marking and feedback

This assignment has:

an autograded testcase component
human-marked components for code style and code structure/approach
a short written complexity analysis

The stage breakdown in the handout is:

Stage	Testcases	Style	Structure / Approach	Complexity Analysis	Total
1	0.5	0.5	0.5	0.5	2
2	1	0.5	1	0.5	3
3	2	0.5	1	0.5	4
4	1	0.5	1	0.5	3
5	2	0.5	1	0.5	4
6	2	0.5	1	0.5	4

Code style expectations

define constants before use instead of relying on magic numbers or strings
keep `#define` names uppercase
include function prototypes for all functions
use descriptive names for variables and functions
comment on the approach behind sections of code without restating what is clear from the syntax
keep bracket placement, indentation, and spacing consistent
use blank lines to separate sections and improve readability
keep lines of code below 100 characters

Code structure expectations

avoid duplicated code segments
avoid unnecessary global variables by passing data into functions
use helper functions where that improves structure
keep functions reasonably short and not overly complex
choose a straightforward algorithmic approach
avoid unnecessary duplication of data
avoid deeply nested control flow where possible

Complexity Analysis

Fill in the Complexity Analysis block near the top of a1.c.

In plain English, give the Big-O time complexity in terms of n, d, and g for:

Stage 1 create embeddings
Stage 2 projections
Stage 3 prompt attention scores
Stage 4 prompt attention weights
Stage 5 prompt attention outputs
Stage 6 generation with cache

It is enough to count dominant loop work and ignore constant factors. You do not need to analyse memory usage. The important part is not just stating the Big-O, but briefly explaining how you got it. Analyse the code you actually wrote, not a more optimised version you could have written instead.

Testing your code

Use these sample tests to check your own a1.c as you build it.

Inside the provided bundle you have:

test_data/test0.txt with expected output test_data/test0-out.txt
test_data/test1.txt with expected output test_data/test1-out.txt
test_data/test2.txt with expected output test_data/test2-out.txt
test_data/test3.txt with expected output test_data/test3-out.txt for a larger end-to-end practice case

Compile:

clang -Wall -Wextra -Wno-newline-eof -Wno-unused-parameter -pedantic -std=c17 -lm -o a1 a1.c

Run a test:

./a1 < test_data/test0.txt

Compare output:

./a1 < test_data/test0.txt > myout.txt
diff -u test_data/test0-out.txt myout.txt

Common pitfalls

Forgetting the causal mask in Stages 3 to 5. For row i, positions j > i must be masked.
Forgetting the padding mask in Stages 3 to 5.
Forgetting that if mask[i] == 0, the whole position i should contribute a zero vector in the later stages.
Not using stable softmax.
Dividing by zero when all columns are masked.
In Stage 6, counting dot products incorrectly.

Exact output format

You must print exactly the following stage headers and data, in order:

Stage 1: one-hot embeddings for the sorted unique tokens
Stage 2: prompt Q, K, and V projections, each an n x d table
Stage 3: attention scores for the prompt, an n x n table
Stage 4: attention weights for the prompt, an n x n table
Stage 5: attention outputs for the prompt, an n x d table
Stage 6: one generated-output line and one dot-product-count line for each generated token

The exact stage header lines are:

Stage 1: Create Embeddings
Stage 2: Projections
Stage 3: Attention Scores (Prompt)
Stage 4: Attention Weights (Prompt)
Stage 5: Attention Output (Prompt)
Stage 6: Generated Outputs

The Stage 2 block format is:

Q Projection:
<n lines of d numbers>
K Projection:
<n lines of d numbers>
V Projection:
<n lines of d numbers>

The Stage 6 line format is:

Gen 0: <d numbers>
Dot products computed: <an integer>
Gen 1: <d numbers>
Dot products computed: <an integer>
...

Numbers must be printed with exactly three digits after the decimal point, separated by a single space. There must be no extra spaces at line ends and no extra debug output anywhere.

The scaffold’s printing functions also clamp extremely small values to 0.000. That is part of the expected output behaviour, so it is safest to leave the provided printing code alone.