Please read the general instructions on this page first, and then check the individual tasks for more details. For each task you can download a zip file that contains the code templates you can use for development.
Task | Attempts | Points | Max | Rating | Rec. | Deadline for full points |
---|---|---|---|---|---|---|
I8MM2: CPU baseline | ||||||
Implement a simple sequential baseline solution. Do not try to use any form of parallelism yet; try to make it work correctly first. |
||||||
– | – | 3 + 0 | ★ | R | 2025-10-08 at 23:59:59 | |
I8MM3: fast CPU | ||||||
Implement a fast CPU solution using multithreading. Use instruction-level parallelism, vectorization, as well as register and cache reuse. |
||||||
– | – | 5 + 0 | ★+ | R | 2025-10-22 at 23:59:59 | |
I8MM4: GPU baseline | ||||||
Implement a GPU baseline solution. |
||||||
– | – | 3 + 0 | ★ | R | 2025-11-05 at 23:59:59 | |
I8MM5: fast GPU | ||||||
Implement a fast GPU solution using regular (portable) CUDA. |
||||||
– | – | 3 + 0 | ★★ | R | 2025-11-19 at 23:59:59 | |
I8MM6: Tensorcore baseline | ||||||
Implement a tensorcore baseline. Do not try to use any forms of data-reuse yet. This is a technique exercise, a valid solution must make use of the tensor cores. |
||||||
– | – | 4 + 0 | ★★ | R | 2025-12-03 at 23:59:59 | |
I8MM9a: CPU AVX512-VNNI | ||||||
Implement a fast CPU solution using AVX512-VNNI instructions. |
||||||
– | – | 4 + 2 | ★★ | R | 2025-12-03 at 23:59:59 | |
I8MM9b: SIMD GPU | ||||||
Implement a fast GPU solution using specialized SIMD instructions available on Turing.
This is a technique exercise, a valid solution must make use of the |
||||||
– | – | 4 + 2 | ★★ | R | 2025-12-03 at 23:59:59 | |
I8MM9c: fast Tensorcore | ||||||
Implement a fast tensorcore solution. |
||||||
– | – | 7 + 2 | ★★★ | R | 2025-12-03 at 23:59:59 |
You are given an m × k matrix and a k × n matrix consisting of 8-bit integers. Your task is to calculate the m × n product of the two matrices.
You need to implement the following function:
void gemm(int m, int n, int k, const int8_t* A, const int8_t* B, int32_t* C);
Here A
and B
are pointers to the input matrices, with m
rows and k
columns for A
, and
k
rows and n
columns for B
.
For all 0 <= y < m
and 0 <= x < k
, the element at row y
and column x
of matrix A
is stored in A[x + y*k]
.
For all 0 <= y < k
and 0 <= x < n
, the element at row y
and column x
of matrix B
is stored in B[x + y*n]
.
The function has to solve the following task:
for all i
and j
with 0 <= i < m
and 0 <= j < n
,
calculate the inner product between row i
of A and column j
of the B, and store the result in C[j + i*n]
.
The arrays data
and result
are already allocated by whoever calls this function; you do not need to do any memory management related to these arrays.
For the tasks that are to be solved on the CPU, A
, B
, and C
point to CPU memory,
for tasks to be solved on the GPU they point to device memory.
You should not assume that C
contains any valid values at the point of call. In particular, it is not guaranteed to be initialized with zeros.
The reduction dimension k
is guaranteed to be less than 65536, so that all results can be represented as 32-bit signed integers.
While floating-point and integer matrix multiplication appear very similar, at the mirco-architectural level, there is one crucial difference: When multiplying two 32-bit floating-point numbers, the result is again a 32-bit floating-point number, that can be added to a 32-bit floating-point number. In contrast, the product of two 8-bit integers is a 16-bit integer, and if you want to add multiple of these products, the accumulator needs to be a 32-bit integer.
There cannot be a SIMD instruction that takes two vector-registers of packed 8-bit integers and
accumulates to a third register (like, e.g., VFMADDPS
); the destination register is simply to small
to accumulate all 64 products, assuming 512-bit wide registers. Instead, the hardware implements inner-product
like operations: Take pairs (or groups of 4) of 8-bit integers in one operand, multiply each with the corresponding
8-bit integer in the second operand, sum the individual products and accumulate into the destination operand. This way,
the destination can contain fewer, but higher bit-width integers.
In generic AVX-512, there is one instruction for doing an 8-bit inner product over pairs of numbers with 16-bit accumulation.
This is not particularly useful, because accumulation needs to happen in 32 bits to prevent overflows.
However, a similar instruction exists for an inner product over two 16-bit numbers, with accumulation in 32 bit.
Expanding the 8-bit numbers to 16 bit and then using _mm512_madd_epi16
can be a viable strategy.
For the VNNI task, note that the available instruction _mm512_dpbusds_epi32
only
allows multiplying one signed operand and one unsigned operand. In order to reap the speed benefits of this instruction,
you thus need implement pre- and postprocessing that maps signed integer matrix multiplication to signed time unsigned
matrix multiplication. The __dp4a
intrinsic in CUDA directly supports signed times signed multiplication.
A simple mental model for the basic operation of a tensorcore is that it extends the vector operations of regular SIMD processing to instructions that operate on fixed-size matrix fragments. When using 8-bit integer operands, each warp of the GPU can process the product of two 16 × 16 fragments in a single instruction. The C++ interface to these instructions is documented in the Cuda Programming Guide.
As such, you can consider the input and output matrices as built up out of 16 × 16 tiles, and the algorithm can be implemented the same way as a scalar matrix multiplication, except each element is now a matrix fragment. In particular, optimizations like register reuse (now on the level of entire fragments), shared memory, and the choice of the right data layout, remain critical for good performance.