Home

Rate Limiting

LLM rate limits aren't a single RPM number — they're a combination of 4 metrics, 5 layers, 4 algorithms and 5 enforcement actions.

Most teams arriving at an LLM gateway think of rate limiting as "max requests per minute". Production reality kicks in fast:

A single 100k-context request can wipe out a whole minute of TPM in one go.
Streaming calls hold the connection for 60 seconds — what's the right counter?
The platform allows 10k RPM total, but you want per-customer caps of 100 RPM.
VIP customers want looser limits — without a separate deployment.
A limit got tripped: should we reject, queue, or fall back to a cheaper model?
End of month and 5% budget left — you want to brake without halting in-flight batches.

Hydite Vtslx AO solves all of the above in a single, coherent system. This article is an opinionated decision manual. By the end you should be able to answer four questions:

Which limiting scenario do I fall into?
Which counter and algorithm should I pick?
At which layer (Org / Group / Team / Key / User) do I configure it?
When a limit is tripped, how do I triage and recover?

1. Four counters#

LLM gateways can't get away with just request counts. Hydite tracks four counters, individually or combined:

Counter	Measures	Best for	Typical range
RPM (Requests / Min)	Call count	Anti-scraping, abuse caps	60 – 100,000
TPM (Tokens / Min)	prompt + completion tokens	The real compute throttle	10k – 100M
Concurrency	In-flight requests	Long-context / streaming pile-up	1 – 1,000
Budget	Cumulative USD (day / month)	Cost guardrail	$1 – $1M+

Rules of thumb:

Always set TPM. RPM is the symptom; TPM is the truth — one 200k-context call equals 200 normal calls.
Add RPM as a sanity cap for high-frequency small calls.
Add Concurrency for long-context / streaming pile-ups: hold-open connections + 100+ concurrency = downstream GPU queue meltdown.
Budget is the master fuse. Hard cap stops the cluster; soft cap fires alerts only.

2. Five-layer inheritance#

1
Organization        ←— top-level red lines (contract caps, security)
2
                 │
3
              Group             ←— policy boundary (workload / customer / env)
4
                 │
5
              Team              ←— people unit (shared quotas)
6
                 │
7
              Key               ←— credential (frontend / backend / CI / demo)
8
                 │
9
            end-user-id         ←— SaaS end users (one rogue user can't blow the pool)

Each layer maintains its own counter; a request must pass every layer or get blocked by the strictest one.

Concrete example:

Layer	RPM cap
Org	100,000
Group `prod-api`	60,000
Team `growth-team`	10,000
Key `frontend-web`	1,000
end-user `user_42`	30

user_42's actual rate = min(100k, 60k, 10k, 1k, 30) = 30 RPM.

Three operational wins:

Security sets red lines at Org, business sets policy at Group, devs tighten at Key — without stepping on each other.
No separate deployments for VIPs — drop them into a higher-quota Group.
Precise triage: 429 responses tell you exactly which layer fired, in 10 seconds.

3. Four algorithms — when to pick which#

Algorithm	Accuracy	Burst tolerance	Overhead	Use when
Sliding Window	★★★★★	★★	Medium	Default. Highest accuracy + fairness
Token Bucket	★★★★	★★★★★	Low	Bursty client traffic that should be tolerated briefly
Leaky Bucket	★★★	★★★	Low	You want smoothing — async crawlers, flash sales
Fixed Window	★★	★	Minimal	Legacy compatibility; minute-boundary insensitive

When to deviate from the default:

All-batch backend traffic (bursty but stable) → Token Bucket, 50% burst tolerance.
Both your business and upstream GPUs hate spikes → Leaky Bucket for hard smoothing.
Old SDKs expecting minute-boundary resets → Fixed Window.
Everything else → Sliding Window.

Switch in Channels → Group Settings → Rate Limiting Algorithm with no restart.

4. Five overflow behaviours#

A 429 reject is the simplest response — but rarely the best one for LLM workloads:

Action	Client experience	Best for
Reject (429)	Immediate fail + `Retry-After`	Real-time UX where clients should back off
Queue	Request waits for next window	Batch / async (max 60s wait)
Fallback	Switch to cheaper / backup model	High-priority paths that mustn't 429
Soft Limit	No block, fires webhook only	Budget warnings, dashboards
Burst	Short Burst Pool punches through	Marketing flash sales, lightning trades

Queue and Fallback are Hydite's signature plays:

Queue mode: instead of erroring, Hydite holds the request in a bounded internal queue and releases it when the next sliding window opens. Clients see slightly higher latency, no error.
Fallback mode: leverages the Group's routing chain — main claude-sonnet-4-5 is full → fall back to claude-3-5-sonnet → if that's also full, drop to gpt-4o-mini. Business doesn't break.

You can mix per-key inside one Group:

key-frontend → Reject (UI should surface and retry)
key-batch → Queue (background can wait)
key-vip → Fallback (VIP never sees 429)

5. Six real-world recipes#

5.1 Anti-bot without harming real users#

Don't just look at RPM — bots betray themselves through behaviour:

1
group: public-website
2
quota:
3
  rpm: 100               # max 100 RPM per IP
4
  tpm: 50_000
5
  concurrency: 5
6
end_user_limit:
7
  rpm: 30                # same end-user-id capped further
8
network:
9
  ip_rate_limit: 60      # > 60 RPM per IP → CAPTCHA
10
on_exceed: reject

Combine with Origin checks and mTLS (see API Key Groups · Networking) — bots have nowhere to hide.

5.2 Multi-tenant SaaS#

Your product has 1,000 customers; each should get max 100 RPM, total ≤ 50k RPM:

1
group: saas-customers
2
quota:
3
  rpm: 50_000             # global ceiling
4
  tpm: 10_000_000
5
end_user_limit:           # the magic line
6
  rpm: 100
7
  tpm: 200_000
8
on_exceed: reject

Pass metadata.user_id per request:

1
client.chat.completions.create(
2
    model="claude-sonnet-4-5",
3
    messages=[...],
4
    extra_body={"metadata": {"user_id": "customer_42"}},
5
)

Hydite tracks per-customer counters automatically. No per-customer keys, no extra code.

5.3 Same Group, isolated blast radius#

1
Group: prod-api  (60k RPM)
2
├─ Key: frontend-web   → 5k RPM, 1M TPM, cheap models only
3
├─ Key: backend-batch  → 30k RPM, 5M TPM, all models, Queue mode
4
└─ Key: ci-bot         → 100 RPM, $20/day budget

A frontend DDoS doesn't crush the backend batch; a buggy CI script doesn't burn the whole budget.

5.4 VIP tier without a second deployment#

Just create a higher-quota Group:

1
group: vip-tier-platinum
2
quota:
3
  rpm: 10_000             # 10x baseline
4
  tpm: 5_000_000
5
on_exceed: fallback        # VIPs never see 429
6
fallback_chain:
7
  - claude-sonnet-4-5
8
  - claude-3-5-sonnet
9
  - gpt-4o-mini

Keys minted in this Group inherit the high quota and never-fail UX.

5.5 Marketing flash event#

Midnight flash → 100x traffic spike. Burst Pool to the rescue:

1
group: campaign-double11
2
quota:
3
  rpm: 1_000              # baseline
4
burst:
5
  enabled: true
6
  multiplier: 50           # allows 50,000 RPM
7
  duration: 600            # for 10 minutes
8
  triggers_per_day: 2
9
  cooldown: 1800
10
on_exceed: queue           # within burst, still queue if exceeded

Cost protected, customer experience intact.

5.6 End-of-month braking#

5% budget left, 3 days to go — slow down without halting:

1
group: prod-api
2
quota:
3
  monthly_budget_usd: 50_000
4
soft_budget_usd: 47_500
5
on_budget_exceeded: queue
6
on_budget_breach:
7
  threshold_pct: 100
8
  action: reject_writes_only

Webhook pings finance; devs get headroom; nobody gets a hard stop.

6. End-user limits — the most underrated capability#

If you build any kind of AI-powered SaaS, this is the section to read twice.

Hydite lets you rate-limit your customers' customers. Pass an end-user-id with every request:

1
client.chat.completions.create(
2
    model="claude-sonnet-4-5",
3
    messages=[...],
4
    user="customer_42",                                     # OpenAI native field
5
    extra_body={"metadata": {"user_id": "customer_42"}},    # or via metadata
6
)

Hydite maintains per-end-user counters. Every counter — RPM / TPM / Budget — can be applied at this layer.

Value	Outcome
Stop one runaway end-user from DDoS-ing your platform	Your service stays fair
Build per-user metering products	Read per-user dollars from the dashboard directly
Free / Paid / VIP tiers	Different `metadata.tier` → different Group
Compliance per-user audit trail	First-class support

Strongly recommended for any SaaS / PaaS / agent product exposing LLM capabilities. It removes ~90% of "customer becomes the incident source" risk architecturally.

7. Time-window limiting#

Many businesses tide:

1
group: prod-api
2
quota:
3
  rpm: 10_000
4
schedule:
5
  - cron: "0 0-7 * * *"      # 12am – 7am
6
    rpm: 500
7
  - cron: "0 9-18 * * 1-5"   # weekday business hours
8
    rpm: 30_000
9
holidays:
10
  - "2025-01-29 to 2025-02-04"  # Lunar New Year
11
    rpm: 200

Useful for:

Geo / timezone-sensitive campaigns
Cost shaping during off-peak
Regulator-mandated lockdowns during sensitive dates

The Group → Schedule editor in the dashboard ships common templates — you don't need to write cron.

8. The 429 contract#

When a limit fires, every API returns the OpenAI-style error body plus a set of headers:

1
HTTP/1.1 429 Too Many Requests
2
Content-Type: application/json
3
Retry-After: 12
4
X-RateLimit-Limit-Requests: 1000
5
X-RateLimit-Limit-Tokens: 200000
6
X-RateLimit-Remaining-Requests: 0
7
X-RateLimit-Remaining-Tokens: 0
8
X-RateLimit-Reset-Requests: 12s
9
X-RateLimit-Reset-Tokens: 12s
10
X-Hydite-Limit-Layer: group
11
X-Hydite-Limit-Group: grp_acme_prod
12
X-Hydite-Limit-Counter: rpm

1
{
2
  "error": {
3
    "type": "rate_limit_error",
4
    "code": "rpm_limit",
5
    "message": "Rate limit exceeded for group grp_acme_prod (1000 RPM). Retry in 12s.",
6
    "param": null,
7
    "_extra": {
8
      "layer": "group",
9
      "limit": 1000,
10
      "remaining": 0,
11
      "reset_seconds": 12,
12
      "counter": "rpm"
13
    }
14
  }
15
}

The Hydite extension headers are gold for triage:

X-Hydite-Limit-Layer → which layer fired (org / group / team / key / user).
X-Hydite-Limit-Counter → which counter saturated (rpm / tpm / concurrency / budget).

Three-second triage, zero log digging.

9. Client-side best practices#

9.1 Honour `Retry-After`#

Don't hardcode 60s, don't retry instantly. Retry-After (seconds) is Hydite's computed safe retry time:

1
import time, openai, random
2
def call_with_retry(fn, max_attempts=5):
3
    for i in range(max_attempts):
4
        try:
5
            return fn()
6
        except openai.RateLimitError as e:
7
            wait = float(e.response.headers.get("retry-after", 2 ** i))
8
            time.sleep(wait + random.random())  # 0-1s jitter to avoid stampede
9
    raise RuntimeError("Exhausted retries")

9.2 Self-throttle with a Token Bucket#

If you know your quota (e.g. 1000 RPM), throttle on the client first so traffic leaves your process smoothed:

Fewer 429s, fewer round-trips
Better tail latency
5 lines with Resilience4j / aiolimiter / p-throttle etc.

9.3 Streaming clients must listen for `error`#

A streaming call cut off mid-flight by a limit will close the SSE stream. Always handle the error event and bail:

1
const stream = await openai.chat.completions.create({...})
2
for await (const chunk of stream) {
3
  if (chunk.choices?.[0]?.finish_reason === "content_filter") break
4
  // ...
5
}

10. Observability#

Dashboard → Overview → Rate Limits offers:

Live counter dashboard — current RPM / TPM / Concurrency / Budget utilisation per Group / Key.
Event stream — last 24h of every 429 / Queue / Fallback / Burst event with timestamp, layer, counter, source.
Top-N offenders — which Keys, IPs or end-users keep tripping limits.
Trends — limit-hit rate, quota utilisation, burst trigger count.

API:

GET /spend/rate-limits/timeseries?group_id=...
GET /spend/rate-limits/events?layer=group&counter=tpm
GET /spend/rate-limits/topn?n=20&dim=user_id

Plus push to Prometheus / Datadog (see Shared Edge · Observability).

11. Tier matrix#

Capability	Shared	Subscription Pro / Team / Business	Enterprise
RPM / TPM	Platform preset	Tiered cap	Unlimited
Concurrency limit	—	✅	✅
Monthly budget	✅	✅	✅
Soft budget	—	✅	✅
End-user limits	—	✅	✅
Queue / Fallback modes	Reject	Reject + Queue	All modes
Burst Pool	—	Business+	✅
Schedule / holidays	—	Business+	✅
Algorithm choice	Sliding Window	All four	All four
Webhook alerting	Email	Multi-channel	Multi-channel + SIEM

See Shared Edge Instance and Dedicated Instance.

12. Anti-patterns#

❌ RPM without TPM: a single 200k-context call wipes you out invisibly.
❌ Hard budget without soft budget: hits the limit, everything stops, no warning.
❌ End-user limits via Key metadata, not request payload: a Key serving 1,000 users gives Key-level limits zero meaning.
❌ Reject-everywhere: VIPs deserve Fallback, not 429.
❌ Clients ignoring Retry-After: turns the gateway into a self-DDoS target.
❌ Skipping Org / Group, putting everything on Keys: upgrades, reuse, resale all get painful.
❌ Streaming without error listener: client hangs, UX dies.

13. Production-ready starter#

1
# Golden config for a typical prod Group
2
group: prod-api
3
quota:
4
  rpm: 60_000
5
  tpm: 20_000_000
6
  concurrency: 200
7
  monthly_budget_usd: 50_000
8
soft_budget_usd: 47_500
9
algorithm: sliding_window
10
on_exceed: reject
11
on_budget_exceeded: queue
12
end_user_limit:
13
  rpm: 100
14
  tpm: 200_000
15
schedule:
16
  - cron: "0 9-18 * * 1-5"
17
    rpm: 90_000
18
alerts:
19
  webhook: https://hooks.acme.com/hydite-alerts
20
  thresholds:
21
    budget: 80
22
    rpm_utilization: 90

Drop it into Group → Edit Policy and you're live.

Next steps#

Multi-key governance basics → API Key Groups
Branded subdomains → Custom Domains
Endpoints and headers → API Reference

Language

Is this helpful?

AI Tools

Ask ChatGPT Ask Claude