Inferyx - A Production-Grade AI Inference Engine

Ever Wonder What Happens After You Hit /predict?

Behind every slick AI feature you use, whether it's generating an image, answering a question, or tagging a document, there's a storm of moving parts under the hood. An API call that looks like magic on the surface kicks off a messy, complex dance of queues, batching, timeouts, failures, retries, and metrics.

I wanted to know what that chaos really looked like.

So instead of just reading about it, I built it.

This project is a from-scratch simulation of a real-world AI inference system, complete with queueing, dynamic batching, worker orchestration, retry logic, caching, observability, and full containerization. No tutorials, no scaffolding — just a clean slate and a goal: recreate the madness of production-scale ML inference as faithfully as possible.

Why I Built It

Most AI projects revolve around the model — training it, fine-tuning it, squeezing out better accuracy. But what happens when the model's done and deployed? What happens when it faces real traffic, when latency matters, and when the system needs to be reliable at scale?

That part — the post-model infrastructure — is where things get real.

I wanted to understand what it actually takes to serve a model like it's done in production. Not just with a model.predict() call in a Flask app, but something that could simulate a high-pressure inference backend. One that has to deal with unpredictable load, backpressure, failure, retries, caching, observability — all the things you'd find in the backend of a serious ML product.

This wasn't about building a product or chasing metrics. It was about understanding the system — deeply. And the only way to do that was to stop reading and start building.

The System: What I Set Out to Build

The goal was to simulate what happens after a model inference API is hit — everything that takes place between receiving a request and sending back a response, under real-world constraints.

Not just a REST API that returns a prediction, but an end-to-end inference engine with:

Queueing to decouple request load from processing capacity
Dynamic batching to simulate GPU-efficient inference
Worker orchestration that spawns/shrinks based on load
Retry and failure handling to mimic flaky model behavior
Caching to reduce latency and improve throughput
Observability to track performance and identify bottlenecks
Full containerization to simulate production-like deployment

Here's how all of that fits together:

Inferyx system architecture diagram — Above: Inferyx system architecture — job requests move through queuing, batching, worker execution, and caching, with full observability at each step.

This system was built from scratch using:

FastAPI for the gateway and job interface
Redis for queueing, caching, and job tracking
Python workers using multiprocessing
Prometheus + Grafana for metrics and real-time observability
Docker Compose to tie it all together

Each of these components was built to mimic production-like constraints: timeouts, backpressure, failures, job skipping, worker bottlenecks, retry storms — the stuff you don't see in toy projects but hit hard in real deployments.

Component-by-Component Breakdown

Here's a deep dive into how each part of the system works, from the gateway to the workers to the observability stack.

1. FastAPI Gateway

What it does:

Acts as the system's front door. Accepts inference requests, validates inputs, returns request IDs, and allows clients to poll for results.

How I Built It:

Exposed /infer and /result/{request_id} endpoints using FastAPI.
Validated inputs using Pydantic models
Each incoming request is assigned a unique request_id, wrapped as a job, and pushed to a Redis queue.
Minimal key-based API auth to simulate gated access control.

What can go wrong:

If the request rate spikes and the system can't dequeue fast enough, the /infer endpoint returns 429 Too Many Requests, mimicking real-world rate-limiting/backpressure.
Failure to enqueue = job lost → logged + metric incremented.

What I measured:

Total inference requests received (inference_requests_total)
Jobs skipped due to load (jobs_skipped_total)
Request queue size (via Redis)

2. Redis Queue

What it does:

Temporarily holds incoming inference jobs, decoupling the rate of incoming requests from the system's ability to process them. It's the first line of defense against load spikes.

How I Built It:

Used a Redis list inference_queue to enqueue jobs.
Each job is a JSON blob: {"request_id": "...", "input": "...", "timestamp": "...", "model_id": "..."}.
Redis hash job:{request_id} tracks job status and output.

What can go wrong:

If Redis crashes or enqueue/dequeue isn't atomic, inconsistencies can creep in.
I ran into this early — job went into queue but not into hash → inconsistent state → discovered atomicity concerns firsthand.

What I measured:

Queue length over time (inference_queue_length)
Failed enqueue attempts (job_failures_total)

3. Dynamic Batching

What it does:

Bundles individual inference jobs into batches to simulate real GPU efficiency — fewer calls, higher throughput.

How I Built It:

A tight loop that:
- Pulls up to N jobs from the Redis queue
- Waits up to T milliseconds to fill a batch if not full
- Pushes the batch into the worker job queue

What can go wrong:

If batch size is too small, you underutilize the GPU.
If you wait too long to fill a batch, you increase latency.
This is the core tradeoff of batching: throughput vs latency.

What I measured:

Average batch size (batch_size)
Throughput (job_processed_total)

4. Worker Pool

What it does:

Executes inference batches. These are the simulated "GPU workers" — dynamically spawned and terminated based on system load to emulate real-world autoscaling behavior.

How I Built It:

The system always keeps:
- Minimum 4 workers alive (base capacity)
- Maximum 16 workers under load (autoscaling ceiling)
Scaling Rules:
- Spawn more workers if queue pressure builds up (≥10 jobs per worker).
- Terminate idle workers after timeout to reclaim resources.
All parameters are configurable from config.py: batch size, max/min workers, job-to-worker ratio, timeout.

What can go wrong:

Without throttling, the system can spawn too many workers → CPU/memory blowout (ask my MacBook).
Aggressive scaling → short-lived workers → high overhead.
Too conservative → job lag, queue overflow.

But when tuned right? Chef's kiss. The system flexes under load and chills when idle.

Why dynamic scaling matters:

It reflects real production behavior: infra should adapt to traffic.
Showcases systems thinking: not just handling load, but doing it efficiently.
Demonstrates understanding of resource management and cost-performance tradeoffs.

Here's what it looks like in action: workers dynamically scaling based on real-time queue pressure, visualized through Grafana:

What I measured:

Worker count over time (spikes during load, drops afterward)
Worker utilization
Worker lifespan (helps tune scale-down timer)