Rate Limiting
LLM rate limits aren't a single RPM number — they're a combination of 4 metrics, 5 layers, 4 algorithms and 5 enforcement actions.
Most teams arriving at an LLM gateway think of rate limiting as "max requests per minute". Production reality kicks in fast:
- A single 100k-context request can wipe out a whole minute of TPM in one go.
- Streaming calls hold the connection for 60 seconds — what's the right counter?
- The platform allows 10k RPM total, but you want per-customer caps of 100 RPM.
- VIP customers want looser limits — without a separate deployment.
- A limit got tripped: should we reject, queue, or fall back to a cheaper model?
- End of month and 5% budget left — you want to brake without halting in-flight batches.
Hydite Vtslx AO solves all of the above in a single, coherent system. This article is an opinionated decision manual. By the end you should be able to answer four questions:
- Which limiting scenario do I fall into?
- Which counter and algorithm should I pick?
- At which layer (Org / Group / Team / Key / User) do I configure it?
- When a limit is tripped, how do I triage and recover?
1. Four counters#
LLM gateways can't get away with just request counts. Hydite tracks four counters, individually or combined:
| Counter | Measures | Best for | Typical range |
|---|---|---|---|
| RPM (Requests / Min) | Call count | Anti-scraping, abuse caps | 60 – 100,000 |
| TPM (Tokens / Min) | prompt + completion tokens | The real compute throttle | 10k – 100M |
| Concurrency | In-flight requests | Long-context / streaming pile-up | 1 – 1,000 |
| Budget | Cumulative USD (day / month) | Cost guardrail | $1 – $1M+ |
Rules of thumb:
- Always set TPM. RPM is the symptom; TPM is the truth — one 200k-context call equals 200 normal calls.
- Add RPM as a sanity cap for high-frequency small calls.
- Add Concurrency for long-context / streaming pile-ups: hold-open connections + 100+ concurrency = downstream GPU queue meltdown.
- Budget is the master fuse. Hard cap stops the cluster; soft cap fires alerts only.
2. Five-layer inheritance#
1Organization ←— top-level red lines (contract caps, security)2 │3 Group ←— policy boundary (workload / customer / env)4 │5 Team ←— people unit (shared quotas)6 │7 Key ←— credential (frontend / backend / CI / demo)8 │9 end-user-id ←— SaaS end users (one rogue user can't blow the pool)Each layer maintains its own counter; a request must pass every layer or get blocked by the strictest one.
Concrete example:
| Layer | RPM cap |
|---|---|
| Org | 100,000 |
Group prod-api | 60,000 |
Team growth-team | 10,000 |
Key frontend-web | 1,000 |
end-user user_42 | 30 |
user_42's actual rate = min(100k, 60k, 10k, 1k, 30) = 30 RPM.
Three operational wins:
- Security sets red lines at Org, business sets policy at Group, devs tighten at Key — without stepping on each other.
- No separate deployments for VIPs — drop them into a higher-quota Group.
- Precise triage: 429 responses tell you exactly which layer fired, in 10 seconds.
3. Four algorithms — when to pick which#
| Algorithm | Accuracy | Burst tolerance | Overhead | Use when |
|---|---|---|---|---|
| Sliding Window | ★★★★★ | ★★ | Medium | Default. Highest accuracy + fairness |
| Token Bucket | ★★★★ | ★★★★★ | Low | Bursty client traffic that should be tolerated briefly |
| Leaky Bucket | ★★★ | ★★★ | Low | You want smoothing — async crawlers, flash sales |
| Fixed Window | ★★ | ★ | Minimal | Legacy compatibility; minute-boundary insensitive |
When to deviate from the default:
- All-batch backend traffic (bursty but stable) → Token Bucket, 50% burst tolerance.
- Both your business and upstream GPUs hate spikes → Leaky Bucket for hard smoothing.
- Old SDKs expecting minute-boundary resets → Fixed Window.
- Everything else → Sliding Window.
Switch in Channels → Group Settings → Rate Limiting Algorithm with no restart.
4. Five overflow behaviours#
A 429 reject is the simplest response — but rarely the best one for LLM workloads:
| Action | Client experience | Best for |
|---|---|---|
| Reject (429) | Immediate fail + Retry-After | Real-time UX where clients should back off |
| Queue | Request waits for next window | Batch / async (max 60s wait) |
| Fallback | Switch to cheaper / backup model | High-priority paths that mustn't 429 |
| Soft Limit | No block, fires webhook only | Budget warnings, dashboards |
| Burst | Short Burst Pool punches through | Marketing flash sales, lightning trades |
Queue and Fallback are Hydite's signature plays:
- Queue mode: instead of erroring, Hydite holds the request in a bounded internal queue and releases it when the next sliding window opens. Clients see slightly higher latency, no error.
- Fallback mode: leverages the Group's routing chain — main
claude-sonnet-4-5is full → fall back toclaude-3-5-sonnet→ if that's also full, drop togpt-4o-mini. Business doesn't break.
You can mix per-key inside one Group:
key-frontend→ Reject (UI should surface and retry)key-batch→ Queue (background can wait)key-vip→ Fallback (VIP never sees 429)
5. Six real-world recipes#
5.1 Anti-bot without harming real users#
Don't just look at RPM — bots betray themselves through behaviour:
1group: public-website2quota:3 rpm: 100 # max 100 RPM per IP4 tpm: 50_0005 concurrency: 56end_user_limit:7 rpm: 30 # same end-user-id capped further8network:9 ip_rate_limit: 60 # > 60 RPM per IP → CAPTCHA10on_exceed: rejectCombine with Origin checks and mTLS (see API Key Groups · Networking) — bots have nowhere to hide.
5.2 Multi-tenant SaaS#
Your product has 1,000 customers; each should get max 100 RPM, total ≤ 50k RPM:
1group: saas-customers2quota:3 rpm: 50_000 # global ceiling4 tpm: 10_000_0005end_user_limit: # the magic line6 rpm: 1007 tpm: 200_0008on_exceed: rejectPass metadata.user_id per request:
1client.chat.completions.create(2 model="claude-sonnet-4-5",3 messages=[...],4 extra_body={"metadata": {"user_id": "customer_42"}},5)Hydite tracks per-customer counters automatically. No per-customer keys, no extra code.
5.3 Same Group, isolated blast radius#
1Group: prod-api (60k RPM)2├─ Key: frontend-web → 5k RPM, 1M TPM, cheap models only3├─ Key: backend-batch → 30k RPM, 5M TPM, all models, Queue mode4└─ Key: ci-bot → 100 RPM, $20/day budgetA frontend DDoS doesn't crush the backend batch; a buggy CI script doesn't burn the whole budget.
5.4 VIP tier without a second deployment#
Just create a higher-quota Group:
1group: vip-tier-platinum2quota:3 rpm: 10_000 # 10x baseline4 tpm: 5_000_0005on_exceed: fallback # VIPs never see 4296fallback_chain:7 - claude-sonnet-4-58 - claude-3-5-sonnet9 - gpt-4o-miniKeys minted in this Group inherit the high quota and never-fail UX.
5.5 Marketing flash event#
Midnight flash → 100x traffic spike. Burst Pool to the rescue:
1group: campaign-double112quota:3 rpm: 1_000 # baseline4burst:5 enabled: true6 multiplier: 50 # allows 50,000 RPM7 duration: 600 # for 10 minutes8 triggers_per_day: 29 cooldown: 180010on_exceed: queue # within burst, still queue if exceededCost protected, customer experience intact.
5.6 End-of-month braking#
5% budget left, 3 days to go — slow down without halting:
1group: prod-api2quota:3 monthly_budget_usd: 50_0004soft_budget_usd: 47_5005on_budget_exceeded: queue6on_budget_breach:7 threshold_pct: 1008 action: reject_writes_onlyWebhook pings finance; devs get headroom; nobody gets a hard stop.
6. End-user limits — the most underrated capability#
If you build any kind of AI-powered SaaS, this is the section to read twice.
Hydite lets you rate-limit your customers' customers. Pass an end-user-id with every request:
1client.chat.completions.create(2 model="claude-sonnet-4-5",3 messages=[...],4 user="customer_42", # OpenAI native field5 extra_body={"metadata": {"user_id": "customer_42"}}, # or via metadata6)Hydite maintains per-end-user counters. Every counter — RPM / TPM / Budget — can be applied at this layer.
| Value | Outcome |
|---|---|
| Stop one runaway end-user from DDoS-ing your platform | Your service stays fair |
| Build per-user metering products | Read per-user dollars from the dashboard directly |
| Free / Paid / VIP tiers | Different metadata.tier → different Group |
| Compliance per-user audit trail | First-class support |
Strongly recommended for any SaaS / PaaS / agent product exposing LLM capabilities. It removes ~90% of "customer becomes the incident source" risk architecturally.
7. Time-window limiting#
Many businesses tide:
1group: prod-api2quota:3 rpm: 10_0004schedule:5 - cron: "0 0-7 * * *" # 12am – 7am6 rpm: 5007 - cron: "0 9-18 * * 1-5" # weekday business hours8 rpm: 30_0009holidays:10 - "2025-01-29 to 2025-02-04" # Lunar New Year11 rpm: 200Useful for:
- Geo / timezone-sensitive campaigns
- Cost shaping during off-peak
- Regulator-mandated lockdowns during sensitive dates
The Group → Schedule editor in the dashboard ships common templates — you don't need to write cron.
8. The 429 contract#
When a limit fires, every API returns the OpenAI-style error body plus a set of headers:
1HTTP/1.1 429 Too Many Requests2Content-Type: application/json3Retry-After: 124X-RateLimit-Limit-Requests: 10005X-RateLimit-Limit-Tokens: 2000006X-RateLimit-Remaining-Requests: 07X-RateLimit-Remaining-Tokens: 08X-RateLimit-Reset-Requests: 12s9X-RateLimit-Reset-Tokens: 12s10X-Hydite-Limit-Layer: group11X-Hydite-Limit-Group: grp_acme_prod12X-Hydite-Limit-Counter: rpm1{2 "error": {3 "type": "rate_limit_error",4 "code": "rpm_limit",5 "message": "Rate limit exceeded for group grp_acme_prod (1000 RPM). Retry in 12s.",6 "param": null,7 "_extra": {8 "layer": "group",9 "limit": 1000,10 "remaining": 0,11 "reset_seconds": 12,12 "counter": "rpm"13 }14 }15}The Hydite extension headers are gold for triage:
X-Hydite-Limit-Layer→ which layer fired (org/group/team/key/user).X-Hydite-Limit-Counter→ which counter saturated (rpm/tpm/concurrency/budget).
Three-second triage, zero log digging.
9. Client-side best practices#
9.1 Honour Retry-After#
Don't hardcode 60s, don't retry instantly. Retry-After (seconds) is Hydite's computed safe retry time:
1import time, openai, random2def call_with_retry(fn, max_attempts=5):3 for i in range(max_attempts):4 try:5 return fn()6 except openai.RateLimitError as e:7 wait = float(e.response.headers.get("retry-after", 2 ** i))8 time.sleep(wait + random.random()) # 0-1s jitter to avoid stampede9 raise RuntimeError("Exhausted retries")9.2 Self-throttle with a Token Bucket#
If you know your quota (e.g. 1000 RPM), throttle on the client first so traffic leaves your process smoothed:
- Fewer 429s, fewer round-trips
- Better tail latency
- 5 lines with Resilience4j / aiolimiter /
p-throttleetc.
9.3 Streaming clients must listen for error#
A streaming call cut off mid-flight by a limit will close the SSE stream. Always handle the error event and bail:
1const stream = await openai.chat.completions.create({...})2for await (const chunk of stream) {3 if (chunk.choices?.[0]?.finish_reason === "content_filter") break4 // ...5}10. Observability#
Dashboard → Overview → Rate Limits offers:
- Live counter dashboard — current RPM / TPM / Concurrency / Budget utilisation per Group / Key.
- Event stream — last 24h of every 429 / Queue / Fallback / Burst event with timestamp, layer, counter, source.
- Top-N offenders — which Keys, IPs or end-users keep tripping limits.
- Trends — limit-hit rate, quota utilisation, burst trigger count.
API:
GET /spend/rate-limits/timeseries?group_id=...GET /spend/rate-limits/events?layer=group&counter=tpmGET /spend/rate-limits/topn?n=20&dim=user_id
Plus push to Prometheus / Datadog (see Shared Edge · Observability).
11. Tier matrix#
| Capability | Shared | Subscription Pro / Team / Business | Enterprise |
|---|---|---|---|
| RPM / TPM | Platform preset | Tiered cap | Unlimited |
| Concurrency limit | — | ✅ | ✅ |
| Monthly budget | ✅ | ✅ | ✅ |
| Soft budget | — | ✅ | ✅ |
| End-user limits | — | ✅ | ✅ |
| Queue / Fallback modes | Reject | Reject + Queue | All modes |
| Burst Pool | — | Business+ | ✅ |
| Schedule / holidays | — | Business+ | ✅ |
| Algorithm choice | Sliding Window | All four | All four |
| Webhook alerting | Multi-channel | Multi-channel + SIEM |
See Shared Edge Instance and Dedicated Instance.
12. Anti-patterns#
- ❌ RPM without TPM: a single 200k-context call wipes you out invisibly.
- ❌ Hard budget without soft budget: hits the limit, everything stops, no warning.
- ❌ End-user limits via Key metadata, not request payload: a Key serving 1,000 users gives Key-level limits zero meaning.
- ❌ Reject-everywhere: VIPs deserve Fallback, not 429.
- ❌ Clients ignoring
Retry-After: turns the gateway into a self-DDoS target. - ❌ Skipping Org / Group, putting everything on Keys: upgrades, reuse, resale all get painful.
- ❌ Streaming without error listener: client hangs, UX dies.
13. Production-ready starter#
1# Golden config for a typical prod Group2group: prod-api3quota:4 rpm: 60_0005 tpm: 20_000_0006 concurrency: 2007 monthly_budget_usd: 50_0008soft_budget_usd: 47_5009algorithm: sliding_window10on_exceed: reject11on_budget_exceeded: queue12end_user_limit:13 rpm: 10014 tpm: 200_00015schedule:16 - cron: "0 9-18 * * 1-5"17 rpm: 90_00018alerts:19 webhook: https://hooks.acme.com/hydite-alerts20 thresholds:21 budget: 8022 rpm_utilization: 90Drop it into Group → Edit Policy and you're live.
Next steps#
- Multi-key governance basics → API Key Groups
- Branded subdomains → Custom Domains
- Endpoints and headers → API Reference