About Me

I Architect the Engines Behind Intelligence.

I build high-performance systems for AI — compilers, kernels, inference engines, and the low-level infrastructure that lets models scale. My work lives where math meets metal: CUDA kernels, IR optimizations, quantization pipelines, and high-efficiency distributed runtimes. I focus on squeezing every last drop of performance out of hardware, reducing memory traffic, reshaping computation graphs, and designing the machinery that turns abstract models into real, blazing-fast systems. I’m obsessed with the boundary between algorithms and architecture — the place where precision, throughput, and engineering discipline collide.

Skills & Systems

(Systems-first • production-ready)

I own the vertical slice that makes modern AI actually run — from hardware-aware kernels and compiler passes to production inference runtimes and observability.

GPU & Low-Level Compute

CUDA programming & custom kernel development
Tensor Core utilization & warp-level tuning
Memory-traffic reduction & profiler-driven optimization

CUDANsightTensor CorescuBLAS

Compiler & IR Optimization

IR rewriting & graph-level transforms
Operator fusion (MatMul+ReLU, etc.)
MLIR / LLVM-based optimization pipelines

MLIRLLVMTritonTorchInductor

AI Inference & Serving

Async batching, dynamic worker pools
High-throughput inference runtimes
Quantization strategies (INT8 / FP16)

FastAPIRedisDockergRPC

Backend & Infra

Production APIs & containerized services
Caching, queuing and fault-tolerant design
CI/CD for performance-sensitive systems

FastAPIDockerKubernetesRedis

Monitoring & Observability

Instrumentation & metrics-driven debugging
Performance dashboards & alerting
SLO/SLI mindset for latency-sensitive systems

PrometheusGrafanaJaeger

GPU & Compiler Layer

CUDA

LLVM/MLIR

Kernels

Quantization

Inference & Runtime

Inference Engines

Async Batching

Worker Pools

Platform & Services

APIs

Caching

Monitoring

Stack visualization showing the vertical slice I own — from metal (GPU & compiler) to serving and platform.

Featured Projects:

View all

Inferno - An Optimizing Compiler for PyTorch

Inferno is a compiler designed to translate standard torch.nn.Module objects into highly-optimized, low-level CUDA kernels. By performing graph-level optimizations and employing a Just-in-Time (JIT) compilation backend, it systematically eliminates framework abstractions and Python-related overhead to achieve significant performance improvements.

Inferyx - A Production-Grade AI Inference Engine

Inferyx is a no-BS simulation of a real-world AI inference system — built from the ground up to mirror the insane complexity and pressure of production-scale ML deployment. From batching and retry queues to observability and dynamic worker pools, Inferyx doesn’t play around. Built to demonstrate real-world AI Systems Engineering skills — infra, latency, load, and failure handling — not just another model deployment.

Custom CUDA Quantisation

A high-performance, hand-crafted quantization engine forged in raw CUDA and C++ that utilizes register tiling, static calibration, and a 'Moneyball' hybrid precision strategy to squeeze flagship-level throughput out of consumer hardware—outperforming industry giants like bitsandbytes without losing a single drop of model accuracy.

Community

GitHub Kaggle

Social Media

Twitter Linkedin