Please read the general instructions on this page first, and then check the individual tasks for more details. For each task you can download a zip file that contains the code templates you can use for development.
Task | Attempts | Points | Max | Rating | Rec. | Deadline for full points |
---|---|---|---|---|---|---|
LLM9a: CPU optimization | ||||||
Implement efficient processing of an LLM prompt using all available single-precision processing power of the CPU. |
||||||
– | – | 5 + 2 | ★★ | R | 2024-08-28 at 23:59:59 |
In this exercise, you will optimize an implementation of a Large Language Model, in this particular case models of the LLaMA family. A short description of the connection between parallel computation and the success of transformer-based language models such as LLaMA is given below.
Generally, the work of an LLM can be split into two phases: processing the user prompt and generating new tokens based on the prompt. In this exercise, we concentrate on the former. Already with only the user prompt processing, we can get interesting applications, like predicting how surprising a given piece of text is; see demo section below for more information.
For this exercise, we have prepared a baseline implementation of the LLaMA 2 model, based on Andrej Karpathy's llama2.c. Your task is to identify the bottlenecks, and apply the techniques you have learned in this course to speed up prompt processing.
You need to implement the following function:
void llm(LLamaConfig config, LLamaParameters params, const std::vector& tokens, std::vector & logits);
The parameters are as follows:
config
A struct that describes the configuration of the LLaMA model, in particular the shapes of the weight
matrices. See llm.h
for the definition of this struct.params
This struct contains the actual weights of the model.tokens
The sequence of tokens that serve as the prompt for the model.logits
This array needs to be filled in with the predicted scores (logits).You can assume that the configuration always describes a valid network (i.e., all sizes > 0). You may assume that
network dimensions (config.dim
and config.hidden_dim
) will be a multiple of 16.
The exercise comes with a baseline implementation that handles single-threaded, scalar processing of the LLaMA model. For each token, this consists of the following steps
dim
-dimensional hidden state of the network.rms_attention
.query
, key
, and value
vectors for the attention mechanism using a matrix multiplication
with query_weight_matrix
, key_weight_matrix
, and value_weight_matrix
respectively.query
and key
using rotary positional embeddings.
key
and value
at each past token
using an inner product. Convert these into scores that sum to one using the
softmax function.
value
vectors using the scores as weights.out_weight_matrix
, add the result to the hidden state.rmsnorm
.hidden_dim
by multiplying with the
weight matrices feed_forward_w1
and feed_forward_w3
. Combine the two vectors using SwiGLU
pointwise operation.dim
using weight matrix feed_forward_w2
, add the result to the hidden state.rmsnorm
with scaling parameters RmsFinal
.num_tokens × dim
output weight matrix TokenOutputMatrix
softmax
.
We provide implementations of these utilities in the llm.h
header. You are free to use these functions,
or provide your own implementation if you think this will improve performance.
Our tests contain setups in which specific submodules of the Transformer are disabled. These might help tracking down implementation errors
Familiarize yourself with the provided reference implementation. Try to localize where most of the time is spent,
and which parts of the computation are easily parallelizable. Just adding #pragma omp parallel for
to the
right places will give you at least one point for this exercise.
Further improvements are possible if you apply ILP and vectorization, but to achieve full points in this exercise, you need to come up with a more efficient parallelization strategy.
While generally, their ability to generate human-like text is what makes LLMs impressive, there are also interesting things that can be done purely by prompt processing. In particular, since we can compare the predicted probability distributions with the actual tokens in the text, we can find out which parts of the prompt were most surprising to the network.
In the exercise template, we provide a demo tool, invoked with ./grading demo
, that visualizes how surprising each token in the given text is.
For example, running ./grading demo "Once upon a time, there was a test story."
produces the following output:
Once upon a time, there was a test story.
Note that this involves downloading a pre-trained network from online.
Modern machine learning can achieve amazing results when given a large amount of labeled training data. However, most existing data is unlabeled, and still contains valuable information that can be learned within its structure. This is what an autoregressive language model is trying to capture: Given a sequence of words, predict which word is going to be the next one. More generally, the text is split into tokens, which could be words, syllables, or even single letters. All that is needed for training a task like this is a large corpus of text, which can be gathered by, e.g., scraping Wikipedia.
Possibly the simplest way to model this task is to apply a Markov assumption: The next element in the sequence only depends on the current element. In that case, the language model is just a huge probability table — given this current token, these are the probabilities for the next token. While this is extremely efficient, it is also quite limited in its modeling capacity. It can be used, for example, to generate fake words that capture the characteristics of an actual language, but it cannot generate coherent sentences.
To increase expressibility, one could lengthen the context window of the Markov model. Instead of considering just the
last token, consider the last n
tokens. Unfortunately, this makes the memory consumption (and required training data)
grow exponentially with the context size. An alternative is to compress the context into a state
vector,
and make predictions based on that state. Conceptually
for token in sequence do state = update_state(state, token) prediction = output_prediction(state)
To further improve expressibility, in typical deep-learning fashion, one can than stack multiple of these processes
together in layers, such that the state[l, p]
of layer l
at location p
in the sequence,
is a function of the state of the previous time-step, and of the previous layer at the same time step,
state[l, p] = update_state(state[l, p-1], state[l-1, p])
.
There are several challenges when trying to use this approach for longer sequences; in particular, the function update_state
needs to be carefully chosen to allow signals to propagate over long time distances. If you are interested in more,
Andrej Karpathy has a nice summary on the success
of recurrent neural networks.
Unfortunately, one challenge remains: How to train these networks with the vast amounts of data available? The structure of the network update means that there exist dependencies both towards the preceding layer, and towards the previous state within the same layer. This is in contrast to the more recent Transformer architecture, in which the next state is calculated based on all past states of only the previous layer, removing one axis of dependencies. Consequently, while processing different layers has to be done sequentially, all the positions within a layer can be handled in parallel. As a result, we now have large language models trained on trillions of tokens.
The phenomenon that the success and failure of a machine learning method can be strongly dependent on the capabilities of the current hardware has become known as the Hardware Lottery.