COMP10002 Foundations of Algorithms
Assignment 1: Attention Is All You Need
A web-first version of the assignment spec. You should be able to read this page straight through; the linked explainers are there only when you want more intuition. If there is ever an inconsistency between this main page and a linked explainer, this page is most likely the correct information, so treat the other pages as supplemental unless otherwise advised.
Important notes
- Submission is a single file:
a1.c, through gradescope. - Start from the provided scaffold and fill in the
TODOfunctions. - Do not change the stage header lines or print helpers.
- Do not add extra
printfdebugging output. - Your code must compile on the marking system or the testcase component cannot run.
Compilation errors matter
Because the testcase mark is autograded, your program must compile on the marking system. If it does not compile, it cannot be run and you will receive zero for correctness.
How to use this page
You can read this page from top to bottom like the assignment handout. If a term such as softmax, projection, or KV cache slows you down, use the inline link for a deeper explanation, then come back here.
Downloadable PDF
Prefer a printable version? Download the compiled project spec PDF. It bundles this main spec, the concept explainers, and the optional support pages as appendices.
What is this assignment about?
This assignment is inspired by how modern Large Language Models (LLMs) such as chatbots work.
An LLM is trained to predict the next token in a sequence. Conceptually, given a prompt
\[\texttt{token}_0, \texttt{token}_1, \dots, \texttt{token}_{n-1},\]the model produces a probability distribution for the next token $\texttt{token}_n$.
Think of it like autocomplete: based on what you typed so far, what is most likely to come next?
What a real LLM would do, at a high level
You do not need the exact ML details, but a real LLM roughly:
- tokenises text into token IDs
- looks up token embeddings
- runs many Transformer layers
- uses attention inside those layers
-
produces scores over a vocabulary
- chooses the next token, appends it, and repeats
What this assignment does instead
You are not implementing a full chatbot. You are implementing a core part used inside them: a simplified single-head self-attention block plus a final KV-cache generation step.
If you want the plain-English picture of what attention is doing before you get into the loops, read the Attention explainer.
To keep the programming focused and testable:
- you begin with a small Stage 1 representation exercise that turns a token list into one-hot vectors
- the later attention stages use the provided prompt and generated embeddings directly as numeric rows
- you are given the parameter matrices
Wq,Wk, andWvdirectly as numbers - you compute attention scores, weights, and outputs exactly
- you apply masking rules and stable softmax
- you do not predict English words
If the maths still feels abstract, that is fine. The implementation requirements are stated explicitly below, and the linked explainer pages fill in the intuition when you need it.
Where the numbers in this assignment come from
A real language model starts with text, not matrices. Very roughly, the pipeline is:
- split the text into tokens
- turn each token into an embedding vector
- run attention over those vectors to build context-aware outputs
- use the final vector to help predict what token should come next
This assignment still simplifies a lot, but it lets you practise the representation idea separately from the later attention maths. You do not implement a full tokenizer, embedding-table lookup, vocabulary-scoring layer, or training process.
The project therefore has two related but separate “before attention” pieces:
- a short token list used in Stage 1, where you build and print a one-hot embedding table
- the prompt and generated embedding rows used by the later attention stages
That means Stage 1 is a warm-up representation exercise on tokens, while Stages 2 to 6 operate on the provided numeric embedding matrices.
That means:
- each row of the prompt matrix is one token embedding
nis the number of prompt tokensdis the number of components in each embedding vector- the prompt positions still have an order, so “earlier token” and “later token” matter
The core job of attention is then: for one position, look back at the earlier positions that are allowed, decide which ones matter most, and blend their information into a new output row.
Recommended visual introduction
If you want one strong visual introduction before or during the assignment, watch 3Blue1Brown: Attention in transformers.
It is especially useful for building intuition about what single-head self-attention is doing before you get lost in the matrix arithmetic. Just note that they show embedding vectors as columns instead of rows like this assignment!
The whole job in one pass
Your program follows one continuous pipeline:
- read the token list, mask, embedding rows, and parameter matrices from the input file
- build and print the Stage 1 one-hot embeddings from the token list
- project the provided prompt embeddings into query, key, and value matrices
- compare prompt queries against prompt keys to get attention scores
- turn those scores into attention weights with masking and stable softmax
- use the weights to blend the value vectors into prompt outputs
- extend the same logic to generated tokens while reusing cached keys and values
If you keep that pipeline in mind, the stage descriptions later on should read as one algorithm being built step by step, not as unrelated tasks.
Mental Model For The Stages
Keep this mental model in mind while reading the stages:
- a prompt position starts as “just this token’s embedding”
- by the end of Stage 5, that row has become “this token, rewritten using the earlier rows that mattered to it”
- Stage 6 applies the same attention idea during generation, but reuses cached keys and values instead of rebuilding old work
If you lose the thread while coding, come back to these two questions:
- which earlier rows is this position allowed to use?
- how much should each allowed row contribute to the new output row?
In a full model there would be more machinery around this block, such as tokenisation , embedding lookup , many repeated layers, and a final vocabulary-scoring step. Here you only implement the attention part in the middle.
Optional background and visual explainers
If you want more background, worked examples, or visual explanations, these pages are useful:
- Concept explainers for attention, projections, masking, softmax, and related ideas
- Transformer explainer for a one-page overview of the larger model architecture
- Concepts for both the required explainers and the optional demos, tooling, and background pages
- The Illustrated Transformer for a visual walkthrough of the model structure
- Attention Is All You Need for the original Transformer paper
Those pages support this spec. They are not required pre-reading.
Glossary and notation
This section collects the vocabulary and notation that help you read the assignment, before you get into the stage-by-stage implementation details.
Glossary
| Term | Meaning |
|---|---|
| Token | A piece of text, sometimes a whole word and sometimes part of a word. |
| Prompt | The starting tokens you already have before generating new tokens. |
| Generate | Produce new tokens one at a time after the prompt. |
| Vector | A fixed-length list of numbers. In C, a 1D array of double. |
| Matrix | A table of numbers. In C, a 2D array of double. |
| Embedding | A vector representing a token as numbers. |
| Dot product | Multiply matching positions and add them up. |
| Score | A number saying how relevant one token is to another. |
| Softmax | A function that turns scores into weights that sum to 1. |
| Mask | A rule that says “ignore this position”. |
| Causal mask | The mask that stops row i from using any future position j > i. |
| Padding mask | The mask that ignores prompt positions marked as padding, usually where mask[j] == 0. |
| KV cache | Stored past keys and values used during generation. |
Notation and conventions
Throughout this spec:
-
We number tokens starting at 0 when describing arrays, so prompt indices are $0,1,\dots,n-1$.
-
A vector of length
dhas components $0, \dots, d-1$. For clarity we will show vectors as bold with an arrow above and use round brackets.
For example $\vec{\mathbf{v}} = \begin{pmatrix}1 & 2\end{pmatrix}$, would bedouble v[] = {1 , 2}in C. -
For clarity we will show matrices as bold and write it using square brackets.
For example $\mathbf{M} = \begin{bmatrix}1 & 2\end{bmatrix}$ would bedouble M[][] = { {1 , 2} }in C. If we write $\vec{\mathbf{M}}_i$ we mean the vector corresponding to row $i$ in $\mathbf{M}$, in C that isM[i] -
”$\sum$” means “add up a bunch of terms” (like a loop). For example, $\sum_{t=0}^{d-1} a[t]\cdot b[t] = a[0]b[0] + \dots + a[d-1]b[d-1]$ means “for each
t, multiply, then add the results”. - Subscripts like $Q_{i,t}$ for a matrix $\mathbf{Q}$ mean “row $i$, component $t$”. In C, that is
Q[i][t]. Maskedmeans “treated as if it does not exist”: its weight must be0.0, and in Bonus Stage 5 you must not compute a dot product for it.
Dot products, vectors and matrices
If we have vectors $\vec{\mathbf{a}} = \begin{pmatrix} 1 & 2\end{pmatrix}$ and $\vec{\mathbf{b}} = \begin{pmatrix} 3 & 4\end{pmatrix}$, then the dot product is given by
\[\vec{\mathbf{a}} \bullet \vec{\mathbf{b}} = \begin{pmatrix} 1 & 2\end{pmatrix} \bullet \begin{pmatrix} 3 & 4\end{pmatrix}= 1\cdot 3 + 2\cdot 4 = 11.\]Because we will be working with matrices, it is convenient if we define our dot product as a matrix multiplication between 2 matrices. For a matrix multiplication to be defined the number of columns in the first matrix, and the number of rows in the second matrix must be equal. The dimension of the resulting matrix, will be the number of rows of the first matrix and the number columns of the second matrix. In the case below it will be 1 by 1 (just a number, exactly as our dot product between vectors requires).
\[\vec{\mathbf{a}} \bullet \vec{\mathbf{b}} = \mathbf{a} \mathbf{b}^T =\begin{bmatrix} 1 & 2\end{bmatrix} \begin{bmatrix} 3 \\ 4\end{bmatrix}\]By the standard rules of matrix multiplication, this gives us the same result as our dot product. So this is equivalent to the dot product between the vector corresponding to the row of the first matrix, and the column of the second matrix.
In C our matrices $\mathbf{a}$ and $\mathbf{b}$ would be double a_matrix[][] = { {1, 2} } and double b_matrix[] = { {3, 4} }.
Then $\mathbf{b}^T$ would be the 2D matrix double bT_matrix[][] = { {3}, {4} }.
Here $\mathbf{b}^T$ means the transpose of $\mathbf{b}$, where transpose just means “flip it across the diagonal so rows become columns and vice versa”, which is required
to ensure the matrix multiplication is defined.
Note that we can’t directly extract a ‘column vector’ from the $\mathbf{b}^T$ matrix, like we can with a row (eg with a_matrix[0]), so it can be convenient to do operations in terms of matrices.
In this assignment, dot products appear inside matrix-style calculations, so it is useful to remember that each output component in a projection is “one row dotted with one column”.
If you want help reading the compact notation used in the Transformer paper and later papers, read the paper-notation explainer. If you want a visual explanation of row-vector times matrix multiplication, read the projection explainer.
Getting started
This section is the fast setup pass: what you have, what the scaffold already does, and what constraints matter before you start coding.
You have been given
- a scaffold
a1.cwith a workingmainandTODOfunctions - sample inputs in
test_data/ - matching expected outputs in
test_data/
The scaffold already
- prints the required stage headers and row formatting
- calls your functions in the correct order
- allocates arrays at safe maximum sizes
You should
- fill in only the functions marked
/* TODO: ... */ - keep all arrays within the provided maximum sizes
- use
doublefor calculations - include
<math.h>and compile with-lm
You must not
- change the stage header lines
- add extra output
- change
print_vector
Inspect the scaffold first
Before you start coding, spend some time familiarising yourself with a1.c:
- what arrays already exist
- what each function is meant to compute
- what gets printed
How to get going
Start by reading a1.c, then implement the stages in order.
Begin with
read_input(...)
void read_input(int *n, int *d, int *g, int mask[MAX_TOKENS],
double prompt[MAX_TOKENS][MAX_D],
double gen[MAX_GEN][MAX_D],
double wq[MAX_D][MAX_D], double wk[MAX_D][MAX_D],
double wv[MAX_D][MAX_D]);
Open in the annotated scaffold ↗
, because every later stage depends on having the input stored correctly. Then get the Stage 1 embedding output working, then compute Stage 2 projections, Stage 3 scores, Stage 4 weights, Stage 5 outputs, and finally Stage 6 generation with the KV cache.
After each stage, compile and run test0 before moving on.
If you need more intuition while coding, use the linked explainers at the point where they become relevant: one-hot encoding for Stage 1, tokens and embeddings for the input story, projections for Stage 2, dot products and masking for Stage 3, softmax for Stage 4, attention output for Stage 5, and KV cache for Stage 6.
If reading a1.c itself is slowing you down, use the Annotated scaffold. The function pulls in the stage sections below jump straight to the matching scaffold section there.
Stage descriptions
The most useful way to read this section is as one algorithm unfolding in order. Stage 0 loads the token list and numeric snapshot. Stage 1 builds and prints one-hot embeddings. Stage 2 turns the provided prompt embeddings into Q, K, and V. Stages 3 to 5 perform attention over the prompt. Stage 6 then applies the same ideas during generation using the cache.
Each stage names the scaffold function you should implement first when there is one. Stage 1 is the small embedding-construction stage that now sits in front of the attention pipeline.
Stage 0: Reading the input
If you want to see what a real input file looks like before reading the format rules, open What an input file looks like.
Implement this scaffold function first:
read_input(...)
void read_input(int *n, int *d, int *g, int *text_len,
char embedding_table[MAX_TEXT_SIZE][MAX_TOKEN_LENGTH + 1],
int mask[MAX_TOKENS],
double prompt[MAX_TOKENS][MAX_D],
double gen[MAX_GEN][MAX_D],
double wq[MAX_D][MAX_D], double wk[MAX_D][MAX_D],
double wv[MAX_D][MAX_D]);
Open in the annotated scaffold ↗
. Every later stage depends on storing the input snapshot correctly.
Read from stdin in this order:
n d g text_len. The three attention sizes, followed by the number of tokens in the Stage 1 text.text_lenwhitespace-separated tokens. This is the Stage 1 text that you turn into the one-hot embedding table.nintegers formask. These say which prompt positions are real and which ones are padding.nprompt embeddings of lengthd. These rows are the numeric prompt matrix used by the later attention stages.ggenerated embeddings of lengthd. These rows are used later in the generation stage.dlines forWqthe query projection matrix.dlines forWkthe key projection matrix.dlines forWvthe value projection matrix.
Bounds
| Symbol | Meaning | Range |
|---|---|---|
n |
number of prompt tokens | 1 <= n <= 64 |
d |
embedding dimension | 1 <= d <= 64 |
g |
number of generated tokens | 0 <= g <= 32 |
text_len |
number of tokens in the Stage 1 text | 1 <= text_len <= 64 |
Use scanf. Do not open files in your C program.
Whitespace-separated input means spaces and newlines are equivalent for scanf.
Stage 1: Creating embeddings
This stage is the small representation exercise before the attention maths begins. You take the input text, treat each whitespace-separated piece as a token, build the unique-token list, sort it lexicographically, and then print one one-hot vector per token in that sorted list.
The scaffold function for this stage is
create_embeddings(...)
int compare2Dchar(const void *a, const void *b);
void create_embeddings(char embedding_table[MAX_TEXT_SIZE][MAX_TOKEN_LENGTH + 1],
int text_len);
Open in the annotated scaffold ↗
. If you use qsort(...), the comparison helper sits right beside it in the scaffold.
The core job in this stage is:
-
sort the token array stored by Stage 0 lexicographically
- remove duplicate tokens from the sorted array
- give each token a vector that is all zero except for one
1based on the index in the sorted array - print the token together with its one-hot embedding as per the Output Format
That is the one-hot encoding idea. It is a simple bridge from categories to numbers, and the one-hot explainer shows it visually.
If this stage makes sense in principle but you have not really worked with C strings yet, use Strings in C. It is meant as a short crash course on the string operations this stage relies on.
This stage is the simple practice version of token representation.
Important Stage 1 Difference
The Stage 1 vector length is the number of unique tokens in the sorted list, not the prompt embedding dimension d used in the later attention stages.
You can use either qsort(...) or your own sorting code for the lexicographic ordering step.
How this stage relates to the later attention stages
The later attention stages do not depend on these Stage 1 vectors as input. This stage is there to make the idea of “text becomes vectors” concrete before the rest of the assignment switches to the provided numeric embedding matrices.
The important representation rule is:
- the sorted unique-token array is the embedding table
- the token’s position in that sorted array decides where the
1goes - every other position in that embedding is
0
That means the embedding dimension for this stage is “number of unique tokens in the sorted list”.
Worked example: a toy embedding table
If the sorted token list is
algorithms, best, foundations, of, subject, the, to, welcome
the one-hot vectors are:
This is not how modern trained models usually build embeddings, but it is a clean way to show how text can become vectors before attention begins.
For a smaller three-token case, the printed output would look like:
Stage 1: Create Embeddings
"algorithms" -> (1 0 0)
"subject" -> (0 1 0)
"the" -> (0 0 1)
Stage 2: Projections
In an LLM once the prompt text has been tokenised, it will then be represented as a sequence of embedding vectors.
In the stage, we provide you a sequence of embedding vectors corresponding to a tokenised prompt, and not rely on the previous stage to keep it seperate. If we did use the previous stage, for a given prompt, you would look up the embedding for each token, and pass an array of those embeddings to this stage. Notably the embedding for every token is the same regardless of where in the prompt it is, despite the location of words, in relation to others, altering the meaning. This is where attention comes in!
Once the input is stored, our first computation to implement attention is to build the prompt query, key, and value matrices .
Before reading on make sure you are familar with the notation
The input prompt embeddings are the rows of a matrix $\mathbf{X}$. To get queries, keys, and values, you project the same prompt matrix three times with three different matrices:
\[\mathbf{Q} = \mathbf{X}\mathbf{W_q} \qquad \mathbf{K} = \mathbf{X}\mathbf{W_k} \qquad \mathbf{V} = \mathbf{X}\mathbf{W_v}\]What Q, K, and V are, and why are there three of them.
How do these matrices link to the C Scaffold?
- $\mathbf{X}$ is the prompt embedding matrix:
prompt. Each row of the matrix corresponds to one prompt token embedding vector. - $\mathbf{W_q}, \mathbf{W_k}, \mathbf{W_v}$ are the weight matrices. In C these are
Wq,Wk,Wvstored as 2D arrays. - $\mathbf{Q},\mathbf{K}, \mathbf{V}$ are the prompt projection vector matrices. In C these are
q_matrix,k_matrix,v_matrixstored as 2D arrays.
Implement the scaffold function for this stage:
compute_projection(...)
void compute_projection(int count, int d, double src[MAX_TOKENS][MAX_D],
double w[MAX_D][MAX_D],
double dest[MAX_TOKENS][MAX_D]);
Open in the annotated scaffold ↗
.
If $\mathbf{X}$ is the input matrix, $\mathbf{W}$ is one of the projection matrices, and $\mathbf{Y}$ is the projected output matrix, then:
\[Y_{i,j} = \sum_{u=0}^{d-1} X_{i,u}\,W_{u,j} \qquad \mathbf{Y} = \begin{bmatrix} Y_{0,0} & \dots & Y_{0,d-1} \\ \vdots & \ddots & \vdots \\ Y_{n-1,0} & \dots & Y_{n-1,d-1} \end{bmatrix}\]Be careful to reset the sum to 0 for each output component.
This function is called three times to compute prompt $\mathbf{Q}$, $\mathbf{K}$, and $\mathbf{V}$
Worked example: one token embedding vector
For a single token vector $\vec{\mathbf{x}}$ of length d, the projected vector $\vec{\mathbf{y}}$ is defined component by component by:
For an example with d = 2, we can equivalently write our component wise sum as a matrix multiplication.
We will define the matrix $\mathbf{X} = \begin{bmatrix}\vec{\mathbf{x}}\end{bmatrix}$ so we can write things in terms of matrix multiplication:
Then
\[\begin{align*} \mathbf{Y} &= \begin{bmatrix} x_{0} & x_{1} \end{bmatrix} \begin{bmatrix} W_{0,0} & W_{0,1}\\ W_{1,0} & W_{1,1} \end{bmatrix} = \begin{bmatrix} \begin{bmatrix} x_{0} & x_{1} \end{bmatrix} \begin{bmatrix} W_{0,0} \\ W_{1,0} \end{bmatrix} & \begin{bmatrix} x_0 & x_1 \end{bmatrix} \begin{bmatrix} W_{0,1} \\ W_{1,1} \end{bmatrix} \end{bmatrix} \\ &= \begin{bmatrix} \begin{pmatrix} x_{0} & x_{1} \end{pmatrix} \bullet \begin{pmatrix} W_{0,0} & W_{1,0} \end{pmatrix} & \begin{pmatrix} x_0 & x_1 \end{pmatrix} \bullet \begin{pmatrix} W_{0,1} & W_{1,1} \end{pmatrix} \end{bmatrix} \\ &= \begin{bmatrix} (x_0 \cdot W_{0,0} + x_1 \cdot W_{1,0}) & (x_0 \cdot W_{0,1} + x_1 \cdot W_{1,1}) \end{bmatrix} \end{align*}\]This is the same loop pattern your code uses for every row of the prompt matrix.
Worked example: two token embedding vectors
In our previous example we treated one input vector as the row of a matrix, but we can extend this to have multiple input vectors, each as a different row, and our matrix multiplication is still defined, just the resulting matrix with have more rows in it!
To get the query, key and value token matrices, we will do this projection operation three times with three different weight matrices:
\[\mathbf{Q} = \mathbf{X}\mathbf{W_q},\quad \mathbf{K} = \mathbf{X}\mathbf{W_k},\quad \mathbf{V} =\mathbf{X}\mathbf{W_v}.\]We can now consider a whole matrix of n prompt embedding vectors (each is a row of $\mathbf{X}$) of dimension d = 2 at once
In this example we have n = 2 token embedding vectors stored in our token embedding matrix: $\begin{pmatrix} X_{0,0} & X_{0,1}\end{pmatrix}$ and $\begin{pmatrix} X_{1,0} & X_{1,1}\end{pmatrix}$.
We can see that in the output matrix, each row corresponds to the projected vector for the given token embedding vector in that row of the input.
The computation in each row is independant of the other rows, so we can calculate the result for each row seperately, but it is notationally compact to store
the vectors as rows of a matrix, and convenient when it comes to programming aswell!
Stage 3: Attention scores (prompt)
With $\mathbf{Q}$ and $\mathbf{K}$ available, the next step is to measure how strongly each key vector corresponding to prompt token $j$ should attend to each query vector corresponding prompt token $i$. For this task you will be computing the $\mathbf{score}$ matrix, using the projections from the previous stage.
You should implement the scaffold function:
compute_attention_scores_or_weights_prompt(...)
void compute_attention_scores_or_weights_prompt(
int n, int d, const int mask[MAX_TOKENS],
double q[MAX_TOKENS][MAX_D], double k[MAX_TOKENS][MAX_D],
double scores_or_weights[MAX_TOKENS][MAX_TOKENS],
int apply_softmax);
Open in the annotated scaffold ↗
For each query vector corresponding to prompt token $i$ ($\vec{\mathbf{Q}_i}$) and key vector corresponding to prompt token $j$ ($\vec{\mathbf{K}_j}$), we compute an attention score. The score is a scaled dot product:
\[\text{score}_{i,j} = \frac{\vec{\mathbf{Q}_i} \bullet \vec{\mathbf{K}_j}}{\sqrt{d}} = \frac{\sum_{u=0}^{d-1} Q_{i,u}K_{j,u}}{\sqrt{d}} \qquad \textbf{score} = \frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d}} = \begin{bmatrix} \text{score}_{0,0} & \dots & \text{score}_{0,n-1} \\ \vdots & \ddots & \vdots \\ \text{score}_{n-1,0} & \dots & \text{score}_{n-1,n-1} \end{bmatrix}\]When computing scores, you should apply two different masks:
- causal mask: if
j > i, mask that position by setting the score to $-\infty$ - padding mask:
- if
mask[i] == 0, the whole score row becomes $-\infty$. - if
mask[j] == 0, mask that position by setting the score to $-\infty$
- if
In C, the score of $-\infty$ will be stored as -INFINITY using the INFINITY constant defined in <math.h>.
Why is the masked score $-\infty$?
We set the score to $-\infty$ because score is the dot product between two vectors, and a small dot product means the vectors are very different. By setting the score to $-\infty$, when we compute the weight in the next stage, these masked values will have weight 0.0.
Why do we use the same mask for both dimensions?
We use the same padding mask for both dimensions, as both dimensions are representing the same token, but for different remixed vectors!
What is the score formula really doing?
If you want the slower derivation, small examples, and a visual story for why the masks are there, use the dot-product explainer and masking explainer.
Worked example: one attention score
Lets consider an example with n = 3 and d = 2:
Here each row $\vec{\mathbf{Q}_i}$ and $\vec{\mathbf{K}_j}$ of $\mathbf{Q}$ and $\mathbf{K}$ correspond to a vector projected from a token embedding vector.
Suppose that the original token embedding vector for rows 0, 1, 2 in $\mathbf{X}$ from the previous part,
represented the tokens Foundations, of and Algorithms respectively, this is what the rows of $\mathbf{Q}$ and $\mathbf{K}$ also represent.
For example, if we were calculating $\text{score}_{1,0}$, this is the attention score of query vector 1 of with key vector 0 Algorithms, or more
precisely how much key vector 0 attends to query vector 1.
Note the same as the previous question, we can also compute the score matrix by consideration of a matrix multiplication (for notational convenience). Here we would need to transpose the second matrix, so we have a defined matrix multiplication where the number of columns of the first matrix must equal number of rows of the second matrix. Mathematically this is equivalent, and how research papers will represent the operation, but you may find one way or another easier to understand when you come to implement it in code.
\[\textbf{score} = \frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d}} = \frac{1}{\sqrt{d}}\begin{bmatrix} Q_{0,0} & Q_{0,1} \\ Q_{1,0} & Q_{1,1} \\ Q_{2,0} & Q_{2,1} \end{bmatrix} \begin{bmatrix} K_{0,0} & K_{1,0} & K_{2,0} \\ K_{0,1} & K_{1,1} & K_{2,1} \\ \end{bmatrix}\]Stage 4: Attention weights (prompt)
The Stage 3 scores are not the final answer. They must be normalised into weights.
For this stage, you should modify the scaffold function from stage 3
compute_attention_scores_or_weights_prompt(...)
void compute_attention_scores_or_weights_prompt(
int n, int d, const int mask[MAX_TOKENS],
double q[MAX_TOKENS][MAX_D], double k[MAX_TOKENS][MAX_D],
double scores_or_weights[MAX_TOKENS][MAX_TOKENS],
int apply_softmax);
Open in the annotated scaffold ↗
so that it can apply softmax.
Hence, when apply_softmax is 1 (this stage), you should take the scores and apply stable softmax for each row in the output matrix.
If a row has no unmasked columns, output 0.0 for each weight in that row, do not divide by zero!
Worked example: scores become weights
If one row of unmasked scores is
(2, 0, -1)
then softmax turns that row into weights roughly like
(0.84, 0.11, 0.05)
The largest score becomes the largest weight, but every unmasked position still gets a non-negative share. Stable softmax produces the same final weights as ordinary softmax; it just subtracts the row maximum first so the exponentials stay numerically safe.
Stage 5: Attention output (prompt)
Once you have the attention weights, you can compute the actual prompt outputs. Implement this scaffold function:
compute_attention_output_prompt(...)
void compute_attention_output_prompt(
int n, int d,
double weights[MAX_TOKENS][MAX_TOKENS],
double v[MAX_TOKENS][MAX_D], double out[MAX_TOKENS][MAX_D]);
Open in the annotated scaffold ↗
.
Compute the attention output vectors as a matrix $\mathbf{output}$ for each prompt token using:
\[\text{output}_{i,j} = \overrightarrow{\textbf{weight}}_i \bullet \overrightarrow{(\mathbf{V}^T)}_j = \sum_{u=0}^{n-1} \text{weight}_{i,u}\,V_{u,j} \qquad \textbf{output} = (\textbf{weight})\textbf{V}= \begin{bmatrix} \text{output}_{0,0} & \dots & \text{output}_{0,d-1} \\ \vdots & \ddots & \vdots \\ \text{output}_{n-1,0} & \dots & \text{output}_{n-1,d-1} \end{bmatrix}\]Note that $\overrightarrow{(\mathbf{V}^T)}_j$ just means, take the vector corresponding to column $j$ of matrix $\mathbf{V}$. Writing $\overrightarrow{\mathbf{V}}_j$ would refer to row $j$. Doing the transpose first, turns the columns into rows, so we can extract row $j$ to get our vector corresponding to the column.
This is a weighted sum of value vectors using the attention weights you just computed.
Worked example: Three tokens embedded in two dimensions
Output example with $n = 3$ and $d = 2$
\[\textbf{weight} = \begin{bmatrix} \text{weight}_{0,0} & \text{weight}_{0,1} & \text{weight}_{0,2} \\ \text{weight}_{1,0} & \text{weight}_{1,1} & \text{weight}_{1,2} \\ \text{weight}_{2,0} & \text{weight}_{2,1} & \text{weight}_{2,2} \end{bmatrix} \qquad \mathbf{V} = \begin{bmatrix} V_{0,0} & V_{0,1} \\ V_{1,0} & V_{1,1} \\ V_{2,0} & V_{2,1} \end{bmatrix}\]Using the same example as before, suppose that the original token embedding vector for row 0, 1, 2 in $\mathbf{X}$
represented the tokens Foundations, of and Algorithms.
The $\mathbf{weight}$ matrix would be 3 rows by 3 columns, and the value $\mathbf{V}$ matrix would be 3 rows by 2 columns.
The output vector for token 2, which incorporates the context of the prompt tokens preceding it directly in the embedding would be:
\[\begin{align*} \overrightarrow{\mathbf{output}}_2 &= \begin{pmatrix} \begin{bmatrix} \text{weight}_{2,0} &&\text{weight}_{2,1} &&\text{weight}_{2,2} \end{bmatrix} \begin{bmatrix} V_{0,0} \\ V_{1,0} \\ V_{2,0} \end{bmatrix} & \begin{bmatrix} \text{weight}_{2,0} &&\text{weight}_{2,1} &&\text{weight}_{2,2} \end{bmatrix} \begin{bmatrix} V_{0,1} \\ V_{1,1} \\ V_{2,1} \end{bmatrix} \end{pmatrix}\\ &= \begin{pmatrix} (\text{weight}_{2,0} V_{0,0} + \text{weight}_{2,1} V_{1,0} + \text{weight}_{2,2} V_{2,0}) & (\text{weight}_{2,0} V_{0,1} + \text{weight}_{2,1} V_{1,1} + \text{weight}_{2,2} V_{2,1}) \end{pmatrix} \end{align*}\]If one weight is near 1, the output will look very similar to that one value vector, if weights are spread out, the output is a blend.
Much like the previous parts, we can also write this computation very compactly using a matrix multiplication, which will give us a matrix where each row corresponds to the output vector for a given token
\[\mathbf{output} = (\mathbf{weight})\mathbf{V} = \begin{bmatrix} \text{weight}_{0,0} & \text{weight}_{0,1} & \text{weight}_{0,2} \\ \text{weight}_{1,0} & \text{weight}_{1,1} & \text{weight}_{1,2} \\ \text{weight}_{2,0} & \text{weight}_{2,1} & \text{weight}_{2,2} \end{bmatrix} \begin{bmatrix} V_{0,0} & V_{0,1} \\ V_{1,0} & V_{1,1} \\ V_{2,0} & V_{2,1} \end{bmatrix}\]How should I interpret the output vector?
This is the point of attention: one prompt token’s output vector becomes a blend of value vectors based on which key vectors attended to a given query vector the most.
If one weight dominates, the output will look very similar to one value vector. If several weights matter, the output becomes a mixture of several vectors. The attention explainer shows this more visually.
Stage 6: Generated outputs with a KV cache
In this task you will be using the prompt and the generated tokens (provided as part of the input) and a KV cache to compute attention outputs. You will not be generating the tokens as these are provided.
How does this link to a full LLM?
In a full LLM what would happen would be we would first compute attention for all the prompt tokens. Then we would pass this along to other components inside the LLM which will use it to generate the next token. Once we have this new generated token we again need to compute attention, and pass it along, so that a new token can be generated, repeating this as many times as desired. Hence, while we have given all the generated tokens already as input to your code, in a real system, we would only get one at a time.
The final stage applies the same attention logic during generation, but now the program reuses past keys and values through the cache.
Implement this scaffold function:
compute_generation_with_cache(...)
long compute_generation_with_cache(
int n, int d, int g, int t, const int mask[MAX_TOKENS],
double gen[MAX_GEN][MAX_D], double wq[MAX_D][MAX_D],
double wk[MAX_D][MAX_D], double wv[MAX_D][MAX_D],
double k_cache[MAX_TOKENS + MAX_GEN][MAX_D],
double v_cache[MAX_TOKENS + MAX_GEN][MAX_D],
double output[MAX_D]);
Open in the annotated scaffold ↗
.
For each generated token $t$ which follows $n$ prompt tokens, you compute:
- $\vec{\mathbf{Q}}_{n+t}$: the query vector for the token numbered $n+t$.
- $\vec{\mathbf{K}}_{n+t}$ and $\vec{\mathbf{V}}_{n+t}$: the key and value vectors for the token numbered $n+t$, which should be appended to the KV cache.
- score the new query $\vec{\mathbf{Q}}_{n+t}$ against all cached keys $\vec{\mathbf{K}}_{i}$ for $0 \le i \le n+t$, remembering to mask any prompt cache entries ($0 \leq i < n$) where
mask[i] == 0(same as one row of stage 3) - apply stable softmax over the cached scores to turn them into weights (same as one row of stage 4)
- compute the output as a weighted sum of cached values (same as one row of stage 5)
There is no separate future-token causal mask in Stage 6 because the cache only contains past items plus the current generated token. The current generated token is included, so it can attend to itself.
Dot Products Computed
You must also report the exact number of dot products computed for each generated token.
In this stage, define “a dot product computed”, as computing the dot product between any two vectors (which includes any matrix multiplication between a given row and column of a matrix), for any weight matrix, and any unmasked prompt or any generated token embedding or projection vectors.
That means:
- For each generated token, you should compute each unmasked score exactly once (store it in an array, reuse it for both max and denominator).
- The total count you return should equal the total number of unmasked cached key positions processed across all generated tokens.
- If a prompt cache entry is masked (
mask[j] == 0), it contributes 0 to the count.
Worked example: Stage 6 dot-product counts
In the visible test0.txt case, the prompt mask is 1 1 0, d = 2, and there are two generated tokens (g = 2).
3d = 6dot products compute the new query, key, and value vectors3dot products score against unmasked cached keysd = 2dot products compute the output vector
So the printed count is 11.
For the second generated token:
6dot products compute the new query, key, and value vectors4dot products score against unmasked cached keys2dot products compute the output vector
So the printed count is 12.
Why does the cache help here?
The cache exists so you do not keep recomputing old keys and values during generation. The KV-cache explainer shows the cache growth step by step and explains how the per-generated-token dot-product count should be interpreted.
Want A Concrete End-To-End Example?
Now that you have seen the whole pipeline, the Toy demo is a good way to make it concrete.
It lets you generate a small numeric input from real text, run your program on it, and then read back a token-level explanation of the resulting attention weights. It is optional, but it may help you connect the stages to something more human-readable.
The structured version also uses a small one-hot encoding region for entity identity, which is why some of the toy vectors look especially interpretable.
Testing, Marking, and Output Rules
This section collects the parts of the spec that are directly useful while checking your work: testing, common failure modes, marking, and the exact output contract.
Marking and feedback
This assignment has:
- an autograded testcase component
- human-marked components for code style and code structure/approach
- a short written complexity analysis
The stage breakdown in the handout is:
| Stage | Testcases | Style | Structure / Approach | Complexity Analysis | Total |
|---|---|---|---|---|---|
| 1 | 0.5 | 0.5 | 0.5 | 0.5 | 2 |
| 2 | 1 | 0.5 | 1 | 0.5 | 3 |
| 3 | 2 | 0.5 | 1 | 0.5 | 4 |
| 4 | 1 | 0.5 | 1 | 0.5 | 3 |
| 5 | 2 | 0.5 | 1 | 0.5 | 4 |
| 6 | 2 | 0.5 | 1 | 0.5 | 4 |
Code style expectations
- define constants before use instead of relying on magic numbers or strings
- keep `#define` names uppercase
- include function prototypes for all functions
- use descriptive names for variables and functions
- comment on the approach behind sections of code without restating what is clear from the syntax
- keep bracket placement, indentation, and spacing consistent
- use blank lines to separate sections and improve readability
- keep lines of code below 100 characters
Code structure expectations
- avoid duplicated code segments
- avoid unnecessary global variables by passing data into functions
- use helper functions where that improves structure
- keep functions reasonably short and not overly complex
- choose a straightforward algorithmic approach
- avoid unnecessary duplication of data
- avoid deeply nested control flow where possible
Complexity Analysis
Fill in the Complexity Analysis block near the top of a1.c.
In plain English, give the Big-O time complexity in terms of n, d, and g for:
- Stage 1 create embeddings
- Stage 2 projections
- Stage 3 prompt attention scores
- Stage 4 prompt attention weights
- Stage 5 prompt attention outputs
- Stage 6 generation with cache
It is enough to count dominant loop work and ignore constant factors. You do not need to analyse memory usage. The important part is not just stating the Big-O, but briefly explaining how you got it. Analyse the code you actually wrote, not a more optimised version you could have written instead.
Testing your code
Use these sample tests to check your own a1.c as you build it.
Inside the provided bundle you have:
test_data/test0.txtwith expected outputtest_data/test0-out.txttest_data/test1.txtwith expected outputtest_data/test1-out.txttest_data/test2.txtwith expected outputtest_data/test2-out.txttest_data/test3.txtwith expected outputtest_data/test3-out.txtfor a larger end-to-end practice case
Compile:
clang -Wall -Wextra -Wno-newline-eof -Wno-unused-parameter -pedantic -std=c17 -lm -o a1 a1.c
Run a test:
./a1 < test_data/test0.txt
Compare output:
./a1 < test_data/test0.txt > myout.txt
diff -u test_data/test0-out.txt myout.txt
Common pitfalls
- Forgetting the causal mask in Stages 3 to 5. For row
i, positionsj > imust be masked. - Forgetting the padding mask in Stages 3 to 5.
- Forgetting that if
mask[i] == 0, the whole positionishould contribute a zero vector in the later stages. - Not using stable softmax.
- Dividing by zero when all columns are masked.
- In Stage 6, counting dot products incorrectly.
Exact output format
You must print exactly the following stage headers and data, in order:
- Stage 1: one-hot embeddings for the sorted unique tokens
- Stage 2: prompt
Q,K, andVprojections, each ann x dtable - Stage 3: attention scores for the prompt, an
n x ntable - Stage 4: attention weights for the prompt, an
n x ntable - Stage 5: attention outputs for the prompt, an
n x dtable - Stage 6: one generated-output line and one dot-product-count line for each generated token
The exact stage header lines are:
Stage 1: Create Embeddings Stage 2: Projections Stage 3: Attention Scores (Prompt) Stage 4: Attention Weights (Prompt) Stage 5: Attention Output (Prompt) Stage 6: Generated Outputs
The Stage 2 block format is:
Q Projection: <n lines of d numbers> K Projection: <n lines of d numbers> V Projection: <n lines of d numbers>
The Stage 6 line format is:
Gen 0: <d numbers> Dot products computed: <an integer> Gen 1: <d numbers> Dot products computed: <an integer> ...
Numbers must be printed with exactly three digits after the decimal point, separated by a single space. There must be no extra spaces at line ends and no extra debug output anywhere.
The scaffold’s printing functions also clamp extremely small values to 0.000. That is part of the expected output behaviour, so it is safest to leave the provided printing code alone.