An end-to-end optimizing compiler for PyTorch, forged in the fires of C++, CUDA, and a frankly absurd amount of obsession.

Why a Ferrari Feels Slow in City Traffic
PyTorch is a masterpiece of flexibility. But for raw inference speed, that flexibility comes at a cost: Python overhead, framework abstractions, and—most critically—constant, slow round-trips to the GPU's main memory. I got fixated on this "memory wall." Instead of trying to tune the Ferrari, I decided to build my own hyperloop. Inferno was born from that obsession.
Building a Compiler in Three Acts
You can't optimize what you can't see. I built a frontend using a custom TorchFX tracer that acts like an X-ray, peering inside nn.Modules to see their atomic operations. It translates the model's soul into my own Intermediate Representation (IR)—a clean, universal blueprint that I control completely.
--- Inferno IR: Graph('MyFusionModel') ---
Inputs:
- Tensor('x', shape=(256, 512), dtype=torch.float32)
Parameters:
- Tensor('weight1', shape=(512, 128), dtype=torch.float32)
Nodes:
- Op('matmul', type='matmul', inputs=['x', 'weight1'], outputs=['matmul'])
- Op('relu', type='relu', inputs=['matmul'], outputs=['relu'])
...--- Inferno IR: Graph('MyFusionModel') ---
Inputs:
- Tensor('x', shape=(256, 512), dtype=torch.float32)
Parameters:
- Tensor('weight1', shape=(512, 128), dtype=torch.float32)
Nodes:
- Op('fused_matmul_relu', type='fused_gemm_relu', inputs=['x', 'weight1'], outputs=['relu'])
...This is where the magic happens. A vanilla translation is boring. I wrote an optimization pass—the Fusion Pass—that acts like a ruthless editor. It scans the IR for clumsy patterns like MatMul -> ReLU and forges them into a single, elegant fused_gemm_relu operation, fundamentally rewriting the model for efficiency.
With the optimized blueprint, it was time to build. A Code Generator writes high-performance C++/CUDA source. A JIT Compiler forges it into a runnable library on the fly. An Execution Engine seamlessly wires this fire-breathing kernel back into PyTorch. The user calls a Python function, and this entire factory spins up and delivers a result.
#include <torch/extension.h>
#include <cublas_v2.h>
// ... headers ...
// Forward declaration from our custom .cu file
void fused_gemm_relu_forward_cuda(cublasHandle_t, torch::Tensor, torch::Tensor, torch::Tensor);
// Main forward function for the compiled model
torch::Tensor MyFusionModel_forward(
torch::Tensor x, torch::Tensor weight1, torch::Tensor weight2
) {
cublasHandle_t handle = get_cublas_handle();
// Intermediate tensors are declared once
auto relu = torch::empty({256, 128}, x.options());
auto matmul_1 = torch::empty({256, 64}, x.options());
// THE MAGIC: A direct call to our hyper-optimized fused kernel
fused_gemm_relu_forward_cuda(handle, x, weight1, relu);
// Fallback for the non-fused operation
matmul_1 = torch::matmul(relu, weight2);
return matmul_1;
}
// Pybind11 wrapper to expose this to Python
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
m.def("forward", &MyFusionModel_forward, "Inferno compiled forward pass");
}Just a single Decorator and your'e done!
Build a Pytorch Model and just add a single Decorator to it to fuse the operations, optimize the graph, and compile it to a C++/CUDA kernel. Whole Inferno Compilation Pipeline is just a single Decorator. Just add it on your model and you're done!
from inferno import compile
@compile(inputs=inputs, kernel_filepath=KERNEL_FILEPATH)
class FusionModel(nn.Module):
def __init__(self, input_size, output_size):
super().__init__()
self.weight = nn.Parameter(torch.randn(input_size, output_size))
self.weight2 = nn.Parameter(torch.randn(output_size, output_size))
def forward(self, x):
# This pattern: matmul + relu can be fused
x = torch.matmul(x, self.weight)
x = F.relu(x)
return x
Talk is Cheap. Data is Everything.
Peak Speedup
2.01x Faster
On memory-bound workloads, model compiled by Inferno more than doubled the performance of native PyTorch.
ncu Deep-DiveKernel Launches
50% Reduction
Fusion confirmed. One monolithic kernel launch instead of two separate, inefficient ones.
DRAM Memory Traffic
37% Reduction
The smoking gun. By avoiding a round-trip to slow DRAM, we slashed memory traffic from 76.8 KB to 48.1 KB.
What I Forged in the Fire
New Game+
This project was my bootcamp for my ultimate goal: to become a go-to expert in Edge AI. The architecture is built. Now, the next quests are clear: implement a Vulkan backend for cross-platform support and a full INT8 quantization pass to truly conquer the edge. The story of Inferno has just begun.