Currently in beta — free during preview

One endpoint.
Every model.

InferWeave is a lightweight inference gateway that routes requests across LLM providers. Swap models without changing code. Add fallback chains in one line. Drop-in compatible with the OpenAI SDK.

MIT Licensed · Changelog · Launching Q3 2026

quickstart.py
from openai import OpenAI

# Just change the base URL. That's it.
client = OpenAI(
    base_url="https://api.inferweave.cloud/v1",
    api_key="iw_sk_..."  # your InferWeave key
)

response = client.chat.completions.create(
    model="gpt-4o",  # or "claude-4-sonnet", "gemini-2.5-pro"
    messages=[{"role": "user", "content": "Hello"}],

    # InferWeave-specific: automatic fallback
    extra_body={
        "fallback": ["claude-4-sonnet", "gemini-2.5-pro"],
        "budget": "$0.02"  # max cost per request
    }
)
POST https://api.inferweave.cloud/v1/chat/completions click to copy

Built for inference at scale

Not another wrapper. InferWeave runs as a stateless proxy with sub-5ms overhead. Configure routing rules, deploy, and forget.

Fallback chains

Define ordered fallback lists per request. If GPT-4o returns a 5xx or times out, the request automatically retries against your next provider. Configurable via the fallback parameter or in your inferweave.yaml.

$

Cost-aware routing

Set a per-request budget with "budget": "$0.01". InferWeave picks the cheapest model that meets your latency and quality constraints. Route simple classification tasks to Haiku, complex reasoning to Opus.

Unified observability

Every request logs model used, latency (TTFB + total), input/output tokens, and cost. Export via OpenTelemetry or query the built-in dashboard. No sampling — every request, every field.

OpenAI SDK compatible

Swap base_url to api.inferweave.cloud/v1 and you're done. Works with the official OpenAI Python/Node SDKs, LangChain, LlamaIndex, and any client that speaks the OpenAI chat completions format.

Three steps to deploy

Configuration lives in a single YAML file. No vendor lock-in, no proprietary SDK.

1

Define your routing config

Create an inferweave.yaml in your project root. Declare models, fallback chains, budgets, and retry policies.

2

Deploy with one command

Run infer deploy to push your config. Your routing rules are live in under 2 seconds, globally distributed across edge nodes.

3

Point your SDK and ship

Change your base_url to our endpoint. Existing code works unchanged. Monitor everything from the dashboard or via OTLP export.

inferweave.yaml
# inferweave.yaml
version: 1

models:
  primary: gpt-4o
  fallback:
    - claude-4-sonnet
    - gemini-2.5-pro

routing:
  strategy: cost-optimized
  max_budget_per_request: $0.03
  timeout_ms: 30000
  retries: 2

observability:
  export: otlp
  endpoint: https://otel.yourinfra.dev
  sample_rate: 1.0  # 100% of requests

keys:
  openai: ${OPENAI_API_KEY}
  anthropic: ${ANTHROPIC_API_KEY}
  google: ${GOOGLE_API_KEY}

Real usage patterns

Two common patterns: streaming with fallback, and cost-aware model selection via the routing config.

Streaming with automatic failover
stream.ts
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.inferweave.cloud/v1",
  apiKey: process.env.INFERWEAVE_KEY,
});

const stream = await client.chat.completions.create({
  model: "claude-4-sonnet",
  stream: true,
  messages: [
    { role: "user", content: prompt }
  ],
  // If Claude is down, fall back to GPT-4o
  extra_body: {
    fallback: ["gpt-4o", "gemini-2.5-pro"]
  }
});

for await (const chunk of stream) {
  process.stdout.write(
    chunk.choices[0]?.delta?.content ?? ""
  );
}
Cost-optimized classification
classify.py
import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.inferweave.cloud/v1",
    api_key=os.environ["INFERWEAVE_KEY"]
)

# Let InferWeave pick the cheapest model
# that can handle classification
response = client.chat.completions.create(
    model="auto",  # cost-optimized selection
    messages=[{
        "role": "user",
        "content": f"Classify: {ticket}"
    }],
    extra_body={
        "budget": "$0.002",
        "prefer": "low-latency"
    }
)

# Response includes which model was used
print(response.model)
# => "claude-4-haiku" (cheapest that fit)

Simple, transparent pricing

You pay for InferWeave routing + your underlying model costs (passed through at cost, no markup).

Free
$0 / mo
For prototyping and personal projects
  • 1,000 requests / day
  • 3 model providers
  • Community Discord support
  • 7-day log retention
  • Single fallback chain
Get Started
Enterprise
Custom
For teams with advanced requirements
  • Unlimited requests
  • Dedicated infrastructure
  • SLA with uptime guarantee
  • SSO + RBAC
  • Custom model hosting
  • On-prem deployment option
  • Dedicated support engineer
Contact Sales