Start Writing Kernels Faster Before Writing Faster Kernels

[Hieu’s personal blog]

In this blog post, I introduce a simple tool to accelerate and simplifythe process of writing and exeperimenting with kernels.


While Python is the de-facto language for AI/ML, Python programs can be painfully slow. As such, AI/ML practitioners who develop latency sensitive applications often resort to writing kernels to accelerate their workloads.

Now, unless you work for Google, your kernels are probably written in CUDA, exposed via a C/C++ interface, and binded to Python as a package via PyBind11. This process, despite being conceptually simple, usually results in an appalling amount of boilerplate complexity. Here, “boiplerplate complexity” loosely means the code, structure of files, libraries, etc. that you need to install correctly before you can write the first line of code in your kernel. Let us look at two examples.

This boilerplate complexity obscures the core logic of compiling, linking, and binding kernels. This obscurity turns a conceptually simple process into something opaque and untracable one. In turn, a steep entrance barrier is imposed for anyone aspired to develop or research kernels.

EasyPybind – a simple tool to kickstart writing kernels

Writing kernels is a complex business. But the complexity should go into designing, implementing, optimizing, and testing the kernels, rather than into the first step: to kickstart writing the kernels.

In fact, this first step is simple enough to be automated. And I have written a simple, minimalistic tool called EasyPybind to do exactly that.


EasyPybind is available on GitHub. It can be installed via:

$ pip install easy_pybind
$ easy-pybind create --help
# usage: easy_pybind create [-h] --module-name MODULE_NAME [...]

Usage example

Once installed, the command:

$ easy-pybind create --module-name="cu_example" --cuda

will generate a directory that looks like this:

├─ .gitignore            # ignore build artifacts
├─              # build the module
├─              # clean up the build
├─ src/
│  ├─       # entry to the module
│  ├─  # implementation of the module in CUDA
│  └─ cu_kernel_impl.h
├─     # if you have --with-pytest
└─               # if you have --with-pymain

The file contains a rather simple function that adds two integers. The function is introduced in the header file cu_kernel_impl.h, and is exposed through a C/C++ interface in

The script will compile the module and link it with the CUDA runtime. If you look into the default, you will see it has 3 components:

Running ./ results in the file in the same directory with a or a file. As long a you have this file, you can do:

import cu_kernel
cu_kernel.add(1, 2)

To write more complex kernels, you would start from changing its interface in, and the propagate the changes to cu_kernel_impl.h and

If you want to change your compilation process, for instance, to add some CUDA libraries, you have all the raw skeleton in If you want to add flags like -NDEBUG, there goes the nvcc call in I usually add:


to use my beloved CUTLASS library by NVIDIA.


I have been using EasyPybind in my own research to develop kernels. I like it quite a lot, so I decided to share it with the world.

I do not intend to turn it into a large and untractable project which I don’t have time to develop and maintain. But if you have any suggestions, please let me know. I am available at my email hyhieu [at] gmail [dot] com.