Inferno 🔥

An end-to-end optimizing compiler for PyTorch, forged in the fires of C++, CUDA, and a frankly absurd amount of obsession.

Inferno project cover image

The Core Problem

Why a Ferrari Feels Slow in City Traffic

PyTorch is a masterpiece of flexibility. But for raw inference speed, that flexibility comes at a cost: Python overhead, framework abstractions, and—most critically—constant, slow round-trips to the GPU's main memory. I got fixated on this "memory wall." Instead of trying to tune the Ferrari, I decided to build my own hyperloop. Inferno was born from that obsession.

The Architecture of Madness

Building a Compiler in Three Acts

Act I: The Frontend

You can't optimize what you can't see. I built a frontend using a custom TorchFX tracer that acts like an X-ray, peering inside nn.Modules to see their atomic operations. It translates the model's soul into my own Intermediate Representation (IR)—a clean, universal blueprint that I control completely.

--- Inferno IR: Graph('MyFusionModel') ---
Inputs:
  - Tensor('x', shape=(256, 512), dtype=torch.float32)
Parameters:
  - Tensor('weight1', shape=(512, 128), dtype=torch.float32)
Nodes:
  - Op('matmul', type='matmul', inputs=['x', 'weight1'], outputs=['matmul'])
  - Op('relu', type='relu', inputs=['matmul'], outputs=['relu'])
...
--- Inferno IR: Graph('MyFusionModel') ---
Inputs:
  - Tensor('x', shape=(256, 512), dtype=torch.float32)
Parameters:
  - Tensor('weight1', shape=(512, 128), dtype=torch.float32)
Nodes:
  - Op('fused_matmul_relu', type='fused_gemm_relu', inputs=['x', 'weight1'], outputs=['relu'])
...

Act II: The Optimizer

This is where the magic happens. A vanilla translation is boring. I wrote an optimization pass—the Fusion Pass—that acts like a ruthless editor. It scans the IR for clumsy patterns like MatMul -> ReLU and forges them into a single, elegant fused_gemm_relu operation, fundamentally rewriting the model for efficiency.

Act III: The Backend

With the optimized blueprint, it was time to build. A Code Generator writes high-performance C++/CUDA source. A JIT Compiler forges it into a runnable library on the fly. An Execution Engine seamlessly wires this fire-breathing kernel back into PyTorch. The user calls a Python function, and this entire factory spins up and delivers a result.

#include <torch/extension.h>
#include <cublas_v2.h>
// ... headers ...

// Forward declaration from our custom .cu file
void fused_gemm_relu_forward_cuda(cublasHandle_t, torch::Tensor, torch::Tensor, torch::Tensor);

// Main forward function for the compiled model
torch::Tensor MyFusionModel_forward(
    torch::Tensor x, torch::Tensor weight1, torch::Tensor weight2
) {
    cublasHandle_t handle = get_cublas_handle();

    // Intermediate tensors are declared once
    auto relu = torch::empty({256, 128}, x.options());
    auto matmul_1 = torch::empty({256, 64}, x.options());

    // THE MAGIC: A direct call to our hyper-optimized fused kernel
    fused_gemm_relu_forward_cuda(handle, x, weight1, relu);

    // Fallback for the non-fused operation
    matmul_1 = torch::matmul(relu, weight2);

    return matmul_1;
}

// Pybind11 wrapper to expose this to Python
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("forward", &MyFusionModel_forward, "Inferno compiled forward pass");
}

Working

Just a single Decorator and your'e done!

Build a Pytorch Model and just add a single Decorator to it to fuse the operations, optimize the graph, and compile it to a C++/CUDA kernel. Whole Inferno Compilation Pipeline is just a single Decorator. Just add it on your model and you're done!

from inferno import compile

@compile(inputs=inputs, kernel_filepath=KERNEL_FILEPATH)
class FusionModel(nn.Module):
    def __init__(self, input_size, output_size):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(input_size, output_size))
        self.weight2 = nn.Parameter(torch.randn(output_size, output_size))
        
    def forward(self, x):
        # This pattern: matmul + relu can be fused
        x = torch.matmul(x, self.weight)
        x = F.relu(x)
        return x

The Moment of Truth

Talk is Cheap. Data is Everything.

Peak Speedup

2.01x Faster

On memory-bound workloads, model compiled by Inferno more than doubled the performance of native PyTorch.

But Why Is It Faster? The ncu Deep-Dive

Kernel Launches

50% Reduction

Fusion confirmed. One monolithic kernel launch instead of two separate, inefficient ones.

DRAM Memory Traffic

37% Reduction

The smoking gun. By avoiding a round-trip to slow DRAM, we slashed memory traffic from 76.8 KB to 48.1 KB.

The Real Payoff

What I Forged in the Fire

  • Systems Design: Architected and built a multi-stage, end-to-end compiler pipeline from first principles.
  • Low-Level GPU Programming: Authored and debugged custom CUDA kernels, wrestled with cuBLAS, and mastered the GPU memory hierarchy.
  • Compiler Theory in Practice: Designed a custom IR and implemented graph-rewriting optimization passes and a JIT backend.
  • Scientific Benchmarking: Moved beyond "it's faster" to proving why with professional tools like Nsight Compute.

The Road Ahead

New Game+

This project was my bootcamp for my ultimate goal: to become a go-to expert in Edge AI. The architecture is built. Now, the next quests are clear: implement a Vulkan backend for cross-platform support and a full INT8 quantization pass to truly conquer the edge. The story of Inferno has just begun.