Skip to content

Latest commit

 

History

History

cuTile Code Samples

This repository contains various examples demonstrating the use of cuTile for implementing high-performance GPU kernels in Python. cuTile simplifies writing CUDA kernels by providing a Pythonic interface for GPU programming concepts like tiling, shared memory, and warp-level operations.

Each sample showcases a fundamental operation, implemented directly using cuTile kernel.

Samples Included

Batch Matrix Multiplication (BatchMatMul.py)

Purpose: Implements batched matrix multiplication (C = A * B) for 3D tensors.

Key Concepts: 3D grid launches, ct.mma for efficient matrix multiply-accumulate.

Dependencies: torch, math, numpy

Fast Fourier Transform (FFT) (FFT.py)

Purpose: Implements a Batched 1D FFT using a multi-dimensional factorization approach.

Key Concepts: Tensor factorization, complex arithmetic, pre-computed rotation (W) and twiddle (T) factors.

Dependencies: torch, math

Matrix Multiplication (MatMul.py)

Purpose: Implements standard (non-batched) matrix multiplication (C = A * B) for 2D matrices.

Key Concepts: Tiled processing, efficient inner loop computation.

Dependencies: torch, math

Matrix Transposition (Transpose.py)

Purpose: Demonstrates transposing a 2D matrix.

Key Concepts: Tiled processing, index swapping for transposition.

Dependencies: torch, math

Attention Fused Multi-Head Attention (AttentionFMHA.py)

Purpose: Demonstrates a fused multi-head attention operation, common in transformer models.

Key Concepts: Casual and Non-Casual Attention

Dependencies: torch, math, numpy