Currently in beta — free during preview

One endpoint.
Every model.

InferWeave is a lightweight inference gateway that routes requests across LLM providers. Swap models without changing code. Add fallback chains in one line. Drop-in compatible with the OpenAI SDK.

Get Started → Star on GitHub

MIT Licensed · Changelog · Launching Q3 2026

quickstart.py

from openai import OpenAI

# Just change the base URL. That's it.
client = OpenAI(
    base_url="https://api.inferweave.cloud/v1",
    api_key="iw_sk_..."  # your InferWeave key
)

response = client.chat.completions.create(
    model="gpt-4o",  # or "claude-4-sonnet", "gemini-2.5-pro"
    messages=[{"role": "user", "content": "Hello"}],

    # InferWeave-specific: automatic fallback
    extra_body={
        "fallback": ["claude-4-sonnet", "gemini-2.5-pro"],
        "budget": "$0.02"  # max cost per request
    }
)
                    

POST https://api.inferweave.cloud/v1/chat/completions click to copy

Capabilities

Built for inference at scale

Not another wrapper. InferWeave runs as a stateless proxy with sub-5ms overhead. Configure routing rules, deploy, and forget.

⇄

Fallback chains

Define ordered fallback lists per request. If GPT-4o returns a 5xx or times out, the request automatically retries against your next provider. Configurable via the fallback parameter or in your inferweave.yaml.

Cost-aware routing

Set a per-request budget with "budget": "$0.01". InferWeave picks the cheapest model that meets your latency and quality constraints. Route simple classification tasks to Haiku, complex reasoning to Opus.

☷

Unified observability

Every request logs model used, latency (TTFB + total), input/output tokens, and cost. Export via OpenTelemetry or query the built-in dashboard. No sampling — every request, every field.

↦

OpenAI SDK compatible

Swap base_url to api.inferweave.cloud/v1 and you're done. Works with the official OpenAI Python/Node SDKs, LangChain, LlamaIndex, and any client that speaks the OpenAI chat completions format.

Workflow

Three steps to deploy

Configuration lives in a single YAML file. No vendor lock-in, no proprietary SDK.

Define your routing config

Create an inferweave.yaml in your project root. Declare models, fallback chains, budgets, and retry policies.

Deploy with one command

Run infer deploy to push your config. Your routing rules are live in under 2 seconds, globally distributed across edge nodes.

Point your SDK and ship

Change your base_url to our endpoint. Existing code works unchanged. Monitor everything from the dashboard or via OTLP export.

inferweave.yaml

# inferweave.yaml
version: 1

models:
  primary: gpt-4o
  fallback:
    - claude-4-sonnet
    - gemini-2.5-pro

routing:
  strategy: cost-optimized
  max_budget_per_request: $0.03
  timeout_ms: 30000
  retries: 2

observability:
  export: otlp
  endpoint: https://otel.yourinfra.dev
  sample_rate: 1.0  # 100% of requests

keys:
  openai: ${OPENAI_API_KEY}
  anthropic: ${ANTHROPIC_API_KEY}
  google: ${GOOGLE_API_KEY}
                    

Examples

Real usage patterns

Two common patterns: streaming with fallback, and cost-aware model selection via the routing config.

Streaming with automatic failover

stream.ts

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.inferweave.cloud/v1",
  apiKey: process.env.INFERWEAVE_KEY,
});

const stream = await client.chat.completions.create({
  model: "claude-4-sonnet",
  stream: true,
  messages: [
    { role: "user", content: prompt }
  ],
  // If Claude is down, fall back to GPT-4o
  extra_body: {
    fallback: ["gpt-4o", "gemini-2.5-pro"]
  }
});

for await (const chunk of stream) {
  process.stdout.write(
    chunk.choices[0]?.delta?.content ?? ""
  );
}
                        

Cost-optimized classification

classify.py

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.inferweave.cloud/v1",
    api_key=os.environ["INFERWEAVE_KEY"]
)

# Let InferWeave pick the cheapest model
# that can handle classification
response = client.chat.completions.create(
    model="auto",  # cost-optimized selection
    messages=[{
        "role": "user",
        "content": f"Classify: {ticket}"
    }],
    extra_body={
        "budget": "$0.002",
        "prefer": "low-latency"
    }
)

# Response includes which model was used
print(response.model)
# => "claude-4-haiku" (cheapest that fit)
                        

Pricing

Simple, transparent pricing

You pay for InferWeave routing + your underlying model costs (passed through at cost, no markup).

Free

$0 / mo

For prototyping and personal projects

1,000 requests / day
3 model providers
Community Discord support
7-day log retention
Single fallback chain

Get Started

Pro

$49 / mo

For production workloads

100,000 requests / day
All model providers
Priority support (email + Discord)
90-day log retention
Unlimited fallback chains
OTLP export
Custom routing strategies

Start Free Trial

Enterprise

Custom

For teams with advanced requirements

Unlimited requests
Dedicated infrastructure
SLA with uptime guarantee
SSO + RBAC
Custom model hosting
On-prem deployment option
Dedicated support engineer

Contact Sales

One endpoint.Every model.