Home

Rate Limiting

LLM rate limits aren't a single RPM number — they're a combination of 4 metrics, 5 layers, 4 algorithms and 5 enforcement actions.


Most teams arriving at an LLM gateway think of rate limiting as "max requests per minute". Production reality kicks in fast:

  • A single 100k-context request can wipe out a whole minute of TPM in one go.
  • Streaming calls hold the connection for 60 seconds — what's the right counter?
  • The platform allows 10k RPM total, but you want per-customer caps of 100 RPM.
  • VIP customers want looser limits — without a separate deployment.
  • A limit got tripped: should we reject, queue, or fall back to a cheaper model?
  • End of month and 5% budget left — you want to brake without halting in-flight batches.

Hydite Vtslx AO solves all of the above in a single, coherent system. This article is an opinionated decision manual. By the end you should be able to answer four questions:

  1. Which limiting scenario do I fall into?
  2. Which counter and algorithm should I pick?
  3. At which layer (Org / Group / Team / Key / User) do I configure it?
  4. When a limit is tripped, how do I triage and recover?

1. Four counters#

LLM gateways can't get away with just request counts. Hydite tracks four counters, individually or combined:

CounterMeasuresBest forTypical range
RPM (Requests / Min)Call countAnti-scraping, abuse caps60 – 100,000
TPM (Tokens / Min)prompt + completion tokensThe real compute throttle10k – 100M
ConcurrencyIn-flight requestsLong-context / streaming pile-up1 – 1,000
BudgetCumulative USD (day / month)Cost guardrail$1 – $1M+

Rules of thumb:

  • Always set TPM. RPM is the symptom; TPM is the truth — one 200k-context call equals 200 normal calls.
  • Add RPM as a sanity cap for high-frequency small calls.
  • Add Concurrency for long-context / streaming pile-ups: hold-open connections + 100+ concurrency = downstream GPU queue meltdown.
  • Budget is the master fuse. Hard cap stops the cluster; soft cap fires alerts only.

2. Five-layer inheritance#

1
Organization ←— top-level red lines (contract caps, security)
2
3
Group ←— policy boundary (workload / customer / env)
4
5
Team ←— people unit (shared quotas)
6
7
Key ←— credential (frontend / backend / CI / demo)
8
9
end-user-id ←— SaaS end users (one rogue user can't blow the pool)

Each layer maintains its own counter; a request must pass every layer or get blocked by the strictest one.

Concrete example:

LayerRPM cap
Org100,000
Group prod-api60,000
Team growth-team10,000
Key frontend-web1,000
end-user user_4230

user_42's actual rate = min(100k, 60k, 10k, 1k, 30) = 30 RPM.

Three operational wins:

  1. Security sets red lines at Org, business sets policy at Group, devs tighten at Key — without stepping on each other.
  2. No separate deployments for VIPs — drop them into a higher-quota Group.
  3. Precise triage: 429 responses tell you exactly which layer fired, in 10 seconds.

3. Four algorithms — when to pick which#

AlgorithmAccuracyBurst toleranceOverheadUse when
Sliding Window★★★★★★★MediumDefault. Highest accuracy + fairness
Token Bucket★★★★★★★★★LowBursty client traffic that should be tolerated briefly
Leaky Bucket★★★★★★LowYou want smoothing — async crawlers, flash sales
Fixed Window★★MinimalLegacy compatibility; minute-boundary insensitive

When to deviate from the default:

  • All-batch backend traffic (bursty but stable) → Token Bucket, 50% burst tolerance.
  • Both your business and upstream GPUs hate spikes → Leaky Bucket for hard smoothing.
  • Old SDKs expecting minute-boundary resets → Fixed Window.
  • Everything else → Sliding Window.

Switch in Channels → Group Settings → Rate Limiting Algorithm with no restart.

4. Five overflow behaviours#

A 429 reject is the simplest response — but rarely the best one for LLM workloads:

ActionClient experienceBest for
Reject (429)Immediate fail + Retry-AfterReal-time UX where clients should back off
QueueRequest waits for next windowBatch / async (max 60s wait)
FallbackSwitch to cheaper / backup modelHigh-priority paths that mustn't 429
Soft LimitNo block, fires webhook onlyBudget warnings, dashboards
BurstShort Burst Pool punches throughMarketing flash sales, lightning trades

Queue and Fallback are Hydite's signature plays:

  • Queue mode: instead of erroring, Hydite holds the request in a bounded internal queue and releases it when the next sliding window opens. Clients see slightly higher latency, no error.
  • Fallback mode: leverages the Group's routing chain — main claude-sonnet-4-5 is full → fall back to claude-3-5-sonnet → if that's also full, drop to gpt-4o-mini. Business doesn't break.

You can mix per-key inside one Group:

  • key-frontend → Reject (UI should surface and retry)
  • key-batch → Queue (background can wait)
  • key-vip → Fallback (VIP never sees 429)

5. Six real-world recipes#

5.1 Anti-bot without harming real users#

Don't just look at RPM — bots betray themselves through behaviour:

1
group: public-website
2
quota:
3
rpm: 100 # max 100 RPM per IP
4
tpm: 50_000
5
concurrency: 5
6
end_user_limit:
7
rpm: 30 # same end-user-id capped further
8
network:
9
ip_rate_limit: 60 # > 60 RPM per IP → CAPTCHA
10
on_exceed: reject

Combine with Origin checks and mTLS (see API Key Groups · Networking) — bots have nowhere to hide.

5.2 Multi-tenant SaaS#

Your product has 1,000 customers; each should get max 100 RPM, total ≤ 50k RPM:

1
group: saas-customers
2
quota:
3
rpm: 50_000 # global ceiling
4
tpm: 10_000_000
5
end_user_limit: # the magic line
6
rpm: 100
7
tpm: 200_000
8
on_exceed: reject

Pass metadata.user_id per request:

1
client.chat.completions.create(
2
model="claude-sonnet-4-5",
3
messages=[...],
4
extra_body={"metadata": {"user_id": "customer_42"}},
5
)

Hydite tracks per-customer counters automatically. No per-customer keys, no extra code.

5.3 Same Group, isolated blast radius#

1
Group: prod-api (60k RPM)
2
├─ Key: frontend-web → 5k RPM, 1M TPM, cheap models only
3
├─ Key: backend-batch → 30k RPM, 5M TPM, all models, Queue mode
4
└─ Key: ci-bot → 100 RPM, $20/day budget

A frontend DDoS doesn't crush the backend batch; a buggy CI script doesn't burn the whole budget.

5.4 VIP tier without a second deployment#

Just create a higher-quota Group:

1
group: vip-tier-platinum
2
quota:
3
rpm: 10_000 # 10x baseline
4
tpm: 5_000_000
5
on_exceed: fallback # VIPs never see 429
6
fallback_chain:
7
- claude-sonnet-4-5
8
- claude-3-5-sonnet
9
- gpt-4o-mini

Keys minted in this Group inherit the high quota and never-fail UX.

5.5 Marketing flash event#

Midnight flash → 100x traffic spike. Burst Pool to the rescue:

1
group: campaign-double11
2
quota:
3
rpm: 1_000 # baseline
4
burst:
5
enabled: true
6
multiplier: 50 # allows 50,000 RPM
7
duration: 600 # for 10 minutes
8
triggers_per_day: 2
9
cooldown: 1800
10
on_exceed: queue # within burst, still queue if exceeded

Cost protected, customer experience intact.

5.6 End-of-month braking#

5% budget left, 3 days to go — slow down without halting:

1
group: prod-api
2
quota:
3
monthly_budget_usd: 50_000
4
soft_budget_usd: 47_500
5
on_budget_exceeded: queue
6
on_budget_breach:
7
threshold_pct: 100
8
action: reject_writes_only

Webhook pings finance; devs get headroom; nobody gets a hard stop.

6. End-user limits — the most underrated capability#

If you build any kind of AI-powered SaaS, this is the section to read twice.

Hydite lets you rate-limit your customers' customers. Pass an end-user-id with every request:

1
client.chat.completions.create(
2
model="claude-sonnet-4-5",
3
messages=[...],
4
user="customer_42", # OpenAI native field
5
extra_body={"metadata": {"user_id": "customer_42"}}, # or via metadata
6
)

Hydite maintains per-end-user counters. Every counter — RPM / TPM / Budget — can be applied at this layer.

ValueOutcome
Stop one runaway end-user from DDoS-ing your platformYour service stays fair
Build per-user metering productsRead per-user dollars from the dashboard directly
Free / Paid / VIP tiersDifferent metadata.tier → different Group
Compliance per-user audit trailFirst-class support

Strongly recommended for any SaaS / PaaS / agent product exposing LLM capabilities. It removes ~90% of "customer becomes the incident source" risk architecturally.

7. Time-window limiting#

Many businesses tide:

1
group: prod-api
2
quota:
3
rpm: 10_000
4
schedule:
5
- cron: "0 0-7 * * *" # 12am – 7am
6
rpm: 500
7
- cron: "0 9-18 * * 1-5" # weekday business hours
8
rpm: 30_000
9
holidays:
10
- "2025-01-29 to 2025-02-04" # Lunar New Year
11
rpm: 200

Useful for:

  • Geo / timezone-sensitive campaigns
  • Cost shaping during off-peak
  • Regulator-mandated lockdowns during sensitive dates

The Group → Schedule editor in the dashboard ships common templates — you don't need to write cron.

8. The 429 contract#

When a limit fires, every API returns the OpenAI-style error body plus a set of headers:

1
HTTP/1.1 429 Too Many Requests
2
Content-Type: application/json
3
Retry-After: 12
4
X-RateLimit-Limit-Requests: 1000
5
X-RateLimit-Limit-Tokens: 200000
6
X-RateLimit-Remaining-Requests: 0
7
X-RateLimit-Remaining-Tokens: 0
8
X-RateLimit-Reset-Requests: 12s
9
X-RateLimit-Reset-Tokens: 12s
10
X-Hydite-Limit-Layer: group
11
X-Hydite-Limit-Group: grp_acme_prod
12
X-Hydite-Limit-Counter: rpm
1
{
2
"error": {
3
"type": "rate_limit_error",
4
"code": "rpm_limit",
5
"message": "Rate limit exceeded for group grp_acme_prod (1000 RPM). Retry in 12s.",
6
"param": null,
7
"_extra": {
8
"layer": "group",
9
"limit": 1000,
10
"remaining": 0,
11
"reset_seconds": 12,
12
"counter": "rpm"
13
}
14
}
15
}

The Hydite extension headers are gold for triage:

  • X-Hydite-Limit-Layer → which layer fired (org / group / team / key / user).
  • X-Hydite-Limit-Counter → which counter saturated (rpm / tpm / concurrency / budget).

Three-second triage, zero log digging.

9. Client-side best practices#

9.1 Honour Retry-After#

Don't hardcode 60s, don't retry instantly. Retry-After (seconds) is Hydite's computed safe retry time:

1
import time, openai, random
2
def call_with_retry(fn, max_attempts=5):
3
for i in range(max_attempts):
4
try:
5
return fn()
6
except openai.RateLimitError as e:
7
wait = float(e.response.headers.get("retry-after", 2 ** i))
8
time.sleep(wait + random.random()) # 0-1s jitter to avoid stampede
9
raise RuntimeError("Exhausted retries")

9.2 Self-throttle with a Token Bucket#

If you know your quota (e.g. 1000 RPM), throttle on the client first so traffic leaves your process smoothed:

  • Fewer 429s, fewer round-trips
  • Better tail latency
  • 5 lines with Resilience4j / aiolimiter / p-throttle etc.

9.3 Streaming clients must listen for error#

A streaming call cut off mid-flight by a limit will close the SSE stream. Always handle the error event and bail:

1
const stream = await openai.chat.completions.create({...})
2
for await (const chunk of stream) {
3
if (chunk.choices?.[0]?.finish_reason === "content_filter") break
4
// ...
5
}

10. Observability#

Dashboard → Overview → Rate Limits offers:

  • Live counter dashboard — current RPM / TPM / Concurrency / Budget utilisation per Group / Key.
  • Event stream — last 24h of every 429 / Queue / Fallback / Burst event with timestamp, layer, counter, source.
  • Top-N offenders — which Keys, IPs or end-users keep tripping limits.
  • Trends — limit-hit rate, quota utilisation, burst trigger count.

API:

  • GET /spend/rate-limits/timeseries?group_id=...
  • GET /spend/rate-limits/events?layer=group&counter=tpm
  • GET /spend/rate-limits/topn?n=20&dim=user_id

Plus push to Prometheus / Datadog (see Shared Edge · Observability).

11. Tier matrix#

CapabilitySharedSubscription Pro / Team / BusinessEnterprise
RPM / TPMPlatform presetTiered capUnlimited
Concurrency limit
Monthly budget
Soft budget
End-user limits
Queue / Fallback modesRejectReject + QueueAll modes
Burst PoolBusiness+
Schedule / holidaysBusiness+
Algorithm choiceSliding WindowAll fourAll four
Webhook alertingEmailMulti-channelMulti-channel + SIEM

See Shared Edge Instance and Dedicated Instance.

12. Anti-patterns#

  • RPM without TPM: a single 200k-context call wipes you out invisibly.
  • Hard budget without soft budget: hits the limit, everything stops, no warning.
  • End-user limits via Key metadata, not request payload: a Key serving 1,000 users gives Key-level limits zero meaning.
  • Reject-everywhere: VIPs deserve Fallback, not 429.
  • Clients ignoring Retry-After: turns the gateway into a self-DDoS target.
  • Skipping Org / Group, putting everything on Keys: upgrades, reuse, resale all get painful.
  • Streaming without error listener: client hangs, UX dies.

13. Production-ready starter#

1
# Golden config for a typical prod Group
2
group: prod-api
3
quota:
4
rpm: 60_000
5
tpm: 20_000_000
6
concurrency: 200
7
monthly_budget_usd: 50_000
8
soft_budget_usd: 47_500
9
algorithm: sliding_window
10
on_exceed: reject
11
on_budget_exceeded: queue
12
end_user_limit:
13
rpm: 100
14
tpm: 200_000
15
schedule:
16
- cron: "0 9-18 * * 1-5"
17
rpm: 90_000
18
alerts:
19
webhook: https://hooks.acme.com/hydite-alerts
20
thresholds:
21
budget: 80
22
rpm_utilization: 90

Drop it into Group → Edit Policy and you're live.

Next steps#