Pricing verified against replicate.com as of April 2026. Replicate bills per second of compute with no subscription required.
Pricing Overview
Replicate uses a pure pay-as-you-go pricing model billed per second of compute time. There are no monthly subscriptions, seat licenses, or minimum commitments. You pay only for the hardware seconds your model predictions consume. This usage-based approach makes Replicate accessible for experimentation while scaling costs linearly with production workloads.
Hardware pricing ranges from $0.09/hr for CPU instances to $43.92/hr for 8x H100 GPU clusters. Public models hosted on Replicate have fixed per-prediction pricing: Flux Schnell costs $0.003/image, Flux 1.1 Pro costs $0.04/image, and DeepSeek R1 runs at $3.75 per 1M input tokens. Video generation with Wan 2.1 at 480p costs $0.09 per second of generated video. Enterprise customers can negotiate volume discounts through committed spend agreements.
Plan Comparison
Replicate does not use traditional subscription tiers. Instead, pricing is determined by the hardware tier selected for each model deployment:
| Hardware Tier | Hourly Rate | Per-Second Rate | Best For |
|---|---|---|---|
| CPU | $0.09/hr | $0.000025/sec | Lightweight preprocessing, text models |
| Nvidia T4 | $0.81/hr | $0.000225/sec | Budget inference, small image models |
| Nvidia A40 Large | $1.48/hr | $0.000411/sec | Mid-range inference workloads |
| A100 40GB | $3.15/hr | $0.000875/sec | Large language models, training |
| A100 80GB | $5.04/hr | $0.001400/sec | 70B+ parameter models, high-memory tasks |
| H100 | $5.49/hr | $0.001525/sec | Fastest single-GPU inference |
| 4x H100 | $21.96/hr | $0.006100/sec | Distributed inference, large batch jobs |
| 8x H100 | $43.92/hr | $0.012200/sec | Maximum throughput, multi-GPU training |
Hidden Costs and Considerations
Cold start latency. Models that are not actively running incur a cold start delay of 5-30 seconds while Replicate provisions the GPU. For latency-sensitive production APIs, this means either accepting occasional slow responses or keeping a model "warm" by sending periodic predictions, which adds to compute costs.
Idle time billing. Replicate bills per second of actual compute, but the clock starts when hardware is allocated, not when your code begins executing. Model loading time (downloading weights, initializing frameworks) counts as billable seconds, especially impactful for large models that take 10-20 seconds to load.
Enterprise volume discounts. Replicate offers committed spend agreements for high-volume customers. The exact discount tiers are not published, but teams spending $5,000+/month should contact Replicate sales to negotiate lower rates.
No free tier for custom models. While Replicate offers free credits for new accounts, running custom models in production requires a payment method. Public model pricing (like $0.003/image for Flux Schnell) applies regardless of volume.
Cost Estimates by Team Size
Solo developer or hobbyist: Running 1,000 image generations per month with Flux Schnell at $0.003/image costs $3/month. Occasional experimentation with larger models on T4 GPUs ($0.81/hr) for 10 hours adds $8.10. Monthly total: $11-$15.
Small startup (3-5 engineers): A team running 50,000 Flux Schnell predictions per month ($150) plus a custom model on A100 80GB for 100 hours ($504) with DeepSeek R1 processing 10M tokens ($37.50). Monthly total: $500-$900.
Mid-size company (15-25 engineers): Production workloads running custom models on H100 GPUs for 500 hours/month ($2,745), plus 500,000 image generations ($1,500), and video generation processing 1,000 seconds ($90). Before enterprise discounts, monthly total: $4,000-$6,000. With committed spend discounts, expect 15-25% savings.
How Replicate Pricing Compares
Replicate's per-second billing model differs fundamentally from competitors that charge per token or per million tokens. Direct cost comparison depends on the specific model and workload pattern.
vs. Fireworks AI: Fireworks charges per token, starting at $0.10 per 1M tokens for sub-4B parameter models and $0.90 per 1M tokens for 16B+ models. For LLM inference, Fireworks is substantially cheaper for high-throughput text workloads. Replicate's advantage is broader model support including image, video, and audio models that Fireworks does not host.
vs. Together AI: Together AI offers inference from $0.10 per 1M tokens for smaller models. For pure LLM serving, Together provides more predictable per-token pricing. Replicate's per-second hardware billing can be more cost-effective for models with variable output lengths or non-text modalities.
vs. Groq: Groq charges $0.59 per 1M input tokens and $0.79 per 1M output tokens for Llama 70B. For LLM-only workloads requiring the lowest latency, Groq undercuts Replicate on price and speed. Replicate serves a broader set of use cases beyond text generation.
Replicate's strongest cost advantage is for teams that need to run custom models (fine-tuned or proprietary) across multiple modalities. The per-second billing model works well for bursty, unpredictable workloads where you want to avoid paying for idle capacity. For teams focused purely on LLM inference at scale, token-based providers like Fireworks AI and Together AI deliver better unit economics.