Vessel: Near-Realtime Batch Inference Platform

The Batch Inference Problem

Large language model APIs typically optimize for low latency—responding to individual requests as quickly as possible. This design makes sense for interactive applications like chatbots and coding assistants that require near-instant responses. However, many real-world LLM use cases are fundamentally different:

Processing datasets for classification, sentiment analysis, or content moderation
Generating embeddings for millions of documents to build vector databases
Creating synthetic training data for fine-tuning or data augmentation
Running evaluation benchmarks to test model performance
Batch conversion of product catalogs or documentation

For these workloads, individual request latency is irrelevant. What matters is total throughput: completing all 100,000 requests in 2 hours instead of 20, at an affordable price. Yet most inference APIs charge per-token pricing optimized for interactive workloads, where GPUs sit at 10-30% utilization waiting for the next request rather than running parallel computations.

The Economics Problem

Traditional inference APIs reflect the cost structure of serving low-latency requests: models must be kept constantly loaded in memory, ready to respond instantly. For batch workloads, this pricing model is inefficient—you're paying for low-latency infrastructure you don't need, and subsidizing idle GPU time between sequential requests.

OpenAI's Batch API addressed this with 50% discounts, acknowledging that batch and real-time inference have different cost structures. However, it introduced a critical limitation: processing time. Batch jobs "typically complete within 24 hours," but in practice, observed completion times for 10,000-request batches range from 45-90 minutes at best to 6-24 hours for larger workloads. For many use cases—alert triaging, rapid experimentation, iterative development—this delay is prohibitive.

A Dedicated Batch Infrastructure

Vessel was built around a fundamental question: What if batch inference could complete in minutes instead of hours, at a fraction of the cost?

The platform achieves 100-1000x faster completion times and 10x lower costs compared to traditional batch APIs through a key architectural insight: traditional inference workflows load a model into GPU memory but process one request at a time, leaving the vast majority of compute units idle. The model weights might occupy 12GB of VRAM, but a single conversation's KV cache uses only ~100MB.

Vessel inverts this completely: load the model once, then process hundreds of requests simultaneously. The KV cache scales to fill available VRAM (20-30GB), and all tensor cores work in parallel using only one loaded copy of the weights. The result is 10,000-50,000+ tokens per second on consumer GPUs versus 500-1,000 tokens per second for request-at-a-time serving.

Architecture Overview

Vessel uses a Captain + Crewmate (master + worker) distributed design that separates orchestration from computation.

The Captain: Lightweight Coordinator

A stateless FastAPI service managing the control plane:

API Gateway: Exposes OpenAI-compatible REST endpoints (/v1/batches, /v1/files)
Batch Management: Tracks batch lifecycle from submission through completion
Task Queue: Maintains a Redis-backed FIFO queue of pending inference tasks
Worker Registry: Monitors available workers via heartbeat mechanism
Authentication & Billing: Validates API keys, enforces credit limits, calculates token usage
Result Aggregation: Assembles chunked results from workers into final output files

The Captain is intentionally lightweight (~500MB memory, minimal CPU), running on inexpensive instances while workers run on expensive GPU instances. This separation enables elastic scaling economics: 1 Captain + 100 Crewmates is far cheaper than 100 integrated nodes.

The Crewmate: GPU Inference Workers

Stateless GPU-enabled Python processes that execute the actual inference:

Task Claiming: Polls Captain for available work, atomically claims tasks
Model Management: Loads models into GPU memory and caches them aggressively
Batched Inference: Processes hundreds of requests simultaneously with dynamic memory management
Multi-GPU Support: Distributes batches across multiple GPUs with near-linear scaling
Result Submission: Uploads completed results with automatic chunking for large batches
Heartbeat System: Maintains registration with Captain (5-minute TTL, refreshed every 30s)

Workers are completely stateless—no local state beyond the current task. They can be added or removed without disruption, and naturally handle heterogeneous GPU fleets. A worker with a small GPU claims small model tasks; a worker with a large GPU can handle any model. The system intelligently routes work without explicit coordination.

Data Layer

Redis serves as the "speed layer" for ephemeral batch state: - Task queue (FIFO via lists) - Batch metadata and status - Worker registrations (with TTLs) - File metadata

MongoDB provides durable storage for user accounts: - API keys and authentication - Credit balances and billing - Usage metrics and aggregations

Filesystem stores batch content: - Input JSONL files - Output result files - 7-day retention with automatic cleanup

Performance Achievements

Real-World Benchmark: MMLU

The MMLU benchmark consists of 14,042 multiple-choice questions testing model knowledge across 57 subjects. This represents a realistic large-scale batch workload.

OpenAI Batch API (GPT-4o-mini): - Total time: 50-80 minutes (typical weekday) - Estimated cost: ~$3-4 - Breakdown: ~10s upload, 30-60 min queue wait, ~20 min processing

Vessel: - Total time: 1.72 minutes (103 seconds processing) - Cost: $0.015 - Throughput: 20,183 tokens/second - Speedup: 29-47x faster - Cost reduction: 200-267x cheaper

This isn't a synthetic benchmark—it's a real evaluation workload that researchers and developers run regularly. The difference between waiting an hour and waiting under 2 minutes fundamentally changes how you work: instead of submitting a batch and checking back later, you get results before context-switching away.

Throughput Characteristics

Measured sustained throughput on production workloads:

Embeddings: 50,000-180,000+ tokens/second (depending on model size)
Chat completions: 10,000-20,000+ tokens/second (depending on output length)
GPU utilization: 90-92% VRAM usage, near 100% compute utilization
Multi-GPU scaling: 95% efficiency (2 GPUs = 1.9x throughput, 4 GPUs = 3.7x)

These numbers represent actual production performance, not theoretical peaks. The platform maintains this throughput across hours-long batches processing millions of tokens.

Key Technical Innovations

Vessel's performance comes from several compounding technical innovations that work together:

1. Dynamic Batch Sizing

The critical challenge in batch inference is determining how large a batch you can process without running out of GPU memory. Too conservative and you waste hardware; too aggressive and you crash with out-of-memory errors.

Vessel uses a three-phase approach: - Theoretical estimation: Calculate expected memory usage based on model architecture - Empirical calibration: Run test batches and measure actual memory consumption - Binary search: Find the maximum safe batch size that keeps VRAM at 90-92% utilization

This approach consistently achieves 2.3x better hardware utilization than fixed batch sizing, which directly translates to 2.3x better throughput per dollar spent.

2. Aggressive Model Caching

Traditional serverless inference suffers from cold starts—loading a model from disk to GPU takes 10-30 seconds. Vessel keeps models loaded in GPU memory between tasks, achieving zero cold start latency after the first task.

The system includes intelligent model switching: if a worker needs to load a different model, it performs aggressive cleanup (model deletion, garbage collection, VRAM cache clearing) before loading the new one. Workers also implement "lazy claiming"—if a worker can handle a task but has the wrong model loaded, it waits a few seconds to give priority to workers that already have the correct model cached.

3. Elastic Worker Architecture

The Captain-Crewmate split enables several critical capabilities:

Spot Instance Optimization: Workers are designed to be ephemeral and can run on spot instances (70% cost reduction). Because workers are stateless and task claiming is atomic, a worker can be terminated mid-task without data loss. The Captain detects the loss via missed heartbeats and another worker claims the task.

Heterogeneous Fleets: Different workers with different GPU capabilities naturally specialize. The queue returns "one task per model," allowing each worker to claim tasks matching its capabilities. A small-GPU worker claims small model tasks; a large-GPU worker can handle anything.

Horizontal Scaling: Adding capacity is as simple as starting a new worker process. It detects its GPU capabilities, registers with the Captain, and immediately starts processing. No coordination protocol, no leader election, no distributed consensus required.

4. Multi-GPU Parallelism

Batch inference is embarrassingly parallel—each sequence in the batch can be processed independently. Vessel splits large batches across multiple GPUs using simple data parallelism, achieving 92-95% scaling efficiency.

The key insight: you only need to load the model weights once per GPU, then each GPU processes its subset of the batch in parallel. Result aggregation happens in memory with minimal overhead.

5. Chunked Result Submission

Large batches (100K+ requests) create gigabyte-sized result payloads. Network failures during submission would lose hours of compute. Vessel automatically chunks large results into progressive submissions:

Try full payload first (fast path for small batches)
On failure, chunk into pieces (1000 → 500 → 250 → 100 requests per chunk)
Submit incrementally with accumulation
Each chunk is a commit point—if submission fails at chunk 15/20, chunks 1-14 are already saved

This makes the system resilient to network issues and payload size limits without sacrificing performance for common cases.

6. Model-Based Task Grouping

Rather than maintaining a simple FIFO queue, Vessel groups tasks by model and returns "one task per model" to polling workers. This enables model specialization: workers that already have a model loaded see tasks for that model immediately and claim them, avoiding model switching overhead.

This seemingly simple optimization dramatically reduces model loading churn in heterogeneous workloads where different batches use different models.

Design Philosophy & Tradeoffs

Vessel makes explicit tradeoffs to achieve its performance characteristics:

What You Give Up

Reasoning Quality: Vessel uses smaller, faster models (0.7B-21B parameters) rather than state-of-the-art reasoning models like GPT-5 or Claude 4.5. For the MMLU benchmark, Vessel's smallest model achieves 46% accuracy versus GPT-4o-mini's estimated 82%. This 35 percentage point gap is the cost of speed.

Latest Knowledge: Models don't have access to real-time internet search or tools (though this may be added in the future).

Output Polish: Smaller models may require more prompt engineering or post-processing to achieve desired output formats.

What You Gain

Speed: 29-1000x faster completion times. The MMLU benchmark runs in under 2 minutes versus 50-80 minutes. For rapid experimentation and iteration, this is transformative—you can run 50+ experiments per day versus 1-2.

Cost: 10-200x cheaper per token. Processing a million requests becomes affordable rather than prohibitively expensive.

Throughput: 10,000-180,000+ tokens per second sustained. Workloads that would take days complete in hours.

The Target Sweet Spot

Vessel is designed for use cases where:

Volume matters: Processing 1,000+ requests per batch
Speed matters: Need results in minutes, not hours
Cost matters: Budget is tight, need to process millions of requests
Quality is sufficient: Tasks don't require deep reasoning (classification, extraction, summarization, embedding generation, evaluation)

This covers a large and growing set of real-world applications: - Dataset evaluation and benchmarking - Large-scale classification and content moderation - Embedding generation for vector databases - Synthetic data generation for training - Batch document processing and reformatting - Alert triaging and log analysis

OpenAI API Compatibility

Vessel implements the OpenAI Batch API specification precisely, enabling zero-code migration. Change one line:

# Before
client = OpenAI(api_key="sk-...")

# After
client = OpenAI(
    base_url="https://vessel-platform.acampi.dev/v1",
    api_key="vessel-..."
)

# Everything else works unchanged
batch = client.batches.create(...)

This compatibility extends to: - Same HTTP endpoints and request/response schemas - Same status lifecycle (validating → queued → in_progress → completed) - Same error formats - Works with OpenAI SDK, LangChain, LlamaIndex, and other tools

The compatibility adds ~15% overhead (JSONL parsing, file I/O, polling-based status), but eliminates all adoption friction. A new platform where "adoption friction is the killer" makes this tradeoff 100 times out of 100.

Production Deployment

Vessel is currently deployed in private preview mode at vessel-platform.acampi.dev, processing real production workloads. The platform demonstrates that batch-first inference can be both practical and performant.

Architectural Resilience

The system is designed to tolerate failures gracefully:

Three-layer timeout detection: - Unclaimed tasks timeout after 1 hour - In-progress tasks timeout after 12 hours - Worker heartbeats expire after 5 minutes

Atomic task claiming: Race conditions are handled through Redis atomic operations—two workers can attempt to claim the same task; exactly one succeeds.

Acceptable data loss: The system explicitly tolerates Redis restarts (wipes queues) because batch jobs have timeout protections and are inherently retryable. This conscious choice optimizes for the 99.9% case rather than the 0.1% catastrophic failure case.

The Broader Insight

Vessel demonstrates a fundamental principle: when you design infrastructure specifically for batch workloads rather than adapting real-time infrastructure, you unlock order-of-magnitude improvements.

Traditional batch APIs are afterthoughts—shared infrastructure where batch jobs fill gaps between real-time requests. Vessel inverts this: batch processing is the primary design primitive, and every architectural choice (stateless workers, aggressive batching, model caching, spot instance optimization) flows from that decision.

The result is a system that excels at throughput-oriented workloads while explicitly sacrificing goals that conflict: individual request latency, state-of-the-art reasoning quality, and high-availability guarantees. For workloads that fit this profile—and there are many of them—the platform delivers transformational improvements in both speed and economics.

As LLM applications mature beyond interactive chatbots toward data processing, evaluation frameworks, and infrastructure building, throughput-first inference becomes increasingly important. Vessel stakes out the extreme of that tradeoff space: maximum throughput at minimum cost, accepting quality ceilings as a necessary tradeoff. For the right workloads, this combination is unbeatable.