GPU & Low-Level Compute
- CUDA programming & custom kernel development
- Tensor Core utilization & warp-level tuning
- Memory-traffic reduction & profiler-driven optimization
I build high-performance systems for AI — compilers, kernels, inference engines, and the low-level infrastructure that lets models scale. My work lives where math meets metal: CUDA kernels, IR optimizations, quantization pipelines, and high-efficiency distributed runtimes. I focus on squeezing every last drop of performance out of hardware, reducing memory traffic, reshaping computation graphs, and designing the machinery that turns abstract models into real, blazing-fast systems. I’m obsessed with the boundary between algorithms and architecture — the place where precision, throughput, and engineering discipline collide.
I own the vertical slice that makes modern AI actually run — from hardware-aware kernels and compiler passes to production inference runtimes and observability.
Stack visualization showing the vertical slice I own — from metal (GPU & compiler) to serving and platform.

Inferno is a compiler designed to translate standard torch.nn.Module objects into highly-optimized, low-level CUDA kernels. By performing graph-level optimizations and employing a Just-in-Time (JIT) compilation backend, it systematically eliminates framework abstractions and Python-related overhead to achieve significant performance improvements.

Inferyx is a no-BS simulation of a real-world AI inference system — built from the ground up to mirror the insane complexity and pressure of production-scale ML deployment. From batching and retry queues to observability and dynamic worker pools, Inferyx doesn’t play around. Built to demonstrate real-world AI Systems Engineering skills — infra, latency, load, and failure handling — not just another model deployment.

A high-performance, hand-crafted quantization engine forged in raw CUDA and C++ that utilizes register tiling, static calibration, and a 'Moneyball' hybrid precision strategy to squeeze flagship-level throughput out of consumer hardware—outperforming industry giants like bitsandbytes without losing a single drop of model accuracy.