Pricing information was last verified on April 29, 2026. Pricing may have changed. Visit Replicate for current pricing.

Pricing verified against replicate.com as of April 2026. Replicate bills per second of compute with no subscription required.

Pricing Overview

Replicate uses a pure pay-as-you-go pricing model billed per second of compute time. There are no monthly subscriptions, seat licenses, or minimum commitments. You pay only for the hardware seconds your model predictions consume. This usage-based approach makes Replicate accessible for experimentation while scaling costs linearly with production workloads.

Hardware pricing ranges from $0.09/hr for CPU instances to $43.92/hr for 8x H100 GPU clusters. Public models hosted on Replicate have fixed per-prediction pricing: Flux Schnell costs $0.003/image, Flux 1.1 Pro costs $0.04/image, and DeepSeek R1 runs at $3.75 per 1M input tokens. Video generation with Wan 2.1 at 480p costs $0.09 per second of generated video. Enterprise customers can negotiate volume discounts through committed spend agreements.

Plan Comparison

Replicate does not use traditional subscription tiers. Instead, pricing is determined by the hardware tier selected for each model deployment:

Hardware Tier	Hourly Rate	Per-Second Rate	Best For
CPU	$0.09/hr	$0.000025/sec	Lightweight preprocessing, text models
Nvidia T4	$0.81/hr	$0.000225/sec	Budget inference, small image models
Nvidia A40 Large	$1.48/hr	$0.000411/sec	Mid-range inference workloads
A100 40GB	$3.15/hr	$0.000875/sec	Large language models, training
A100 80GB	$5.04/hr	$0.001400/sec	70B+ parameter models, high-memory tasks
H100	$5.49/hr	$0.001525/sec	Fastest single-GPU inference
4x H100	$21.96/hr	$0.006100/sec	Distributed inference, large batch jobs
8x H100	$43.92/hr	$0.012200/sec	Maximum throughput, multi-GPU training

Hidden Costs and Considerations

Cold start latency. Models that are not actively running incur a cold start delay of 5-30 seconds while Replicate provisions the GPU. For latency-sensitive production APIs, this means either accepting occasional slow responses or keeping a model "warm" by sending periodic predictions, which adds to compute costs.

Idle time billing. Replicate bills per second of actual compute, but the clock starts when hardware is allocated, not when your code begins executing. Model loading time (downloading weights, initializing frameworks) counts as billable seconds, especially impactful for large models that take 10-20 seconds to load.

Enterprise volume discounts. Replicate offers committed spend agreements for high-volume customers. The exact discount tiers are not published, but teams spending $5,000+/month should contact Replicate sales to negotiate lower rates.

No free tier for custom models. While Replicate offers free credits for new accounts, running custom models in production requires a payment method. Public model pricing (like $0.003/image for Flux Schnell) applies regardless of volume.

Cost Estimates by Team Size

Solo developer or hobbyist: Running 1,000 image generations per month with Flux Schnell at $0.003/image costs $3/month. Occasional experimentation with larger models on T4 GPUs ($0.81/hr) for 10 hours adds $8.10. Monthly total: $11-$15.

Small startup (3-5 engineers): A team running 50,000 Flux Schnell predictions per month ($150) plus a custom model on A100 80GB for 100 hours ($504) with DeepSeek R1 processing 10M tokens ($37.50). Monthly total: $500-$900.

Mid-size company (15-25 engineers): Production workloads running custom models on H100 GPUs for 500 hours/month ($2,745), plus 500,000 image generations ($1,500), and video generation processing 1,000 seconds ($90). Before enterprise discounts, monthly total: $4,000-$6,000. With committed spend discounts, expect 15-25% savings.

How Replicate Pricing Compares

Replicate's per-second billing model differs fundamentally from competitors that charge per token or per million tokens. Direct cost comparison depends on the specific model and workload pattern.

vs. Fireworks AI: Fireworks charges per token, starting at $0.10 per 1M tokens for sub-4B parameter models and $0.90 per 1M tokens for 16B+ models. For LLM inference, Fireworks is substantially cheaper for high-throughput text workloads. Replicate's advantage is broader model support including image, video, and audio models that Fireworks does not host.

vs. Together AI: Together AI offers inference from $0.10 per 1M tokens for smaller models. For pure LLM serving, Together provides more predictable per-token pricing. Replicate's per-second hardware billing can be more cost-effective for models with variable output lengths or non-text modalities.

vs. Groq: Groq charges $0.59 per 1M input tokens and $0.79 per 1M output tokens for Llama 70B. For LLM-only workloads requiring the lowest latency, Groq undercuts Replicate on price and speed. Replicate serves a broader set of use cases beyond text generation.

Replicate's strongest cost advantage is for teams that need to run custom models (fine-tuned or proprietary) across multiple modalities. The per-second billing model works well for bursty, unpredictable workloads where you want to avoid paying for idle capacity. For teams focused purely on LLM inference at scale, token-based providers like Fireworks AI and Together AI deliver better unit economics.

Pricing verified against replicate.com as of April 2026. Replicate bills per second of compute with no subscription required.

Pricing Overview

Plan Comparison

Replicate does not use traditional subscription tiers. Instead, pricing is determined by the hardware tier selected for each model deployment:

Hardware Tier	Hourly Rate	Per-Second Rate	Best For
CPU	$0.09/hr	$0.000025/sec	Lightweight preprocessing, text models
Nvidia T4	$0.81/hr	$0.000225/sec	Budget inference, small image models
Nvidia A40 Large	$1.48/hr	$0.000411/sec	Mid-range inference workloads
A100 40GB	$3.15/hr	$0.000875/sec	Large language models, training
A100 80GB	$5.04/hr	$0.001400/sec	70B+ parameter models, high-memory tasks
H100	$5.49/hr	$0.001525/sec	Fastest single-GPU inference
4x H100	$21.96/hr	$0.006100/sec	Distributed inference, large batch jobs
8x H100	$43.92/hr	$0.012200/sec	Maximum throughput, multi-GPU training

Hidden Costs and Considerations

Cost Estimates by Team Size

How Replicate Pricing Compares

Replicate's per-second billing model differs fundamentally from competitors that charge per token or per million tokens. Direct cost comparison depends on the specific model and workload pattern.

Replicate Pricing in 2026

Pricing Overview

Plan Comparison

Hidden Costs and Considerations

Cost Estimates by Team Size

How Replicate Pricing Compares

Explore More

Comparisons

Related Pricing Guides

Replicate Pricing in 2026

Pricing Overview

Plan Comparison

Hidden Costs and Considerations

Cost Estimates by Team Size

How Replicate Pricing Compares

Explore More

Comparisons

Related Pricing Guides