Timeout & Retry Policies - Umfassender Leitfaden¶

Dieser Leitfaden beschreibt, wie man Timeout- und Retry-Richtlinien in GAL (Gateway Abstraction Layer) konfiguriert, um die Zuverlässigkeit und Resilienz von API-Gateway-Deployments zu verbessern.

Inhaltsverzeichnis¶

Übersicht
Schnellstart
Konfigurationsoptionen
Provider-Implementierungen
Häufige Anwendungsfälle
Best Practices
Troubleshooting
Provider-Vergleich

Übersicht¶

Was sind Timeouts?¶

Timeouts definieren maximale Wartezeiten für verschiedene Phasen der Kommunikation zwischen Gateway und Upstream-Services:

Connection Timeout: Maximale Zeit zum Aufbau einer TCP-Verbindung
Send Timeout: Maximale Zeit zum Senden einer Request an den Upstream
Read Timeout: Maximale Zeit zum Empfangen einer Response vom Upstream
Idle Timeout: Maximale Zeit für inaktive Keep-Alive-Verbindungen

Was sind Retries?¶

Retries (Wiederholungsversuche) ermöglichen es dem Gateway, fehlgeschlagene Requests automatisch zu wiederholen, bevor ein Fehler an den Client zurückgegeben wird.

Retry-Bedingungen bestimmen, wann ein Request wiederholt wird: - connect_timeout: Bei Connection-Timeouts - http_5xx: Bei allen 5xx HTTP-Statuscodes - http_502, http_503, http_504: Bei spezifischen 5xx-Codes - reset: Bei Connection-Reset - refused: Bei Connection-Refused

Retry-Strategien: - Exponential Backoff: Wartezeit verdoppelt sich mit jedem Versuch (empfohlen) - Linear Backoff: Konstante Wartezeit zwischen Versuchen

Warum sind Timeouts & Retries wichtig?¶

Resilienz: Automatische Wiederholung bei transienten Fehlern
Verfügbarkeit: Vermeidung von Request-Hangs bei langsamen Upstreams
Performance: Schnelleres Failover zu gesunden Servern
Benutzerfreundlichkeit: Bessere User Experience durch kürzere Wartezeiten
Ressourcen-Schonung: Vermeidung von Thread-Blockierung

Schnellstart¶

Beispiel 1: Basic Timeout-Konfiguration¶

version: "1.0"
provider: envoy

global:
  host: 0.0.0.0
  port: 10000

services:
  - name: api_service
    type: rest
    protocol: http
    upstream:
      host: api-backend
      port: 8080
    routes:
      - path_prefix: /api
        timeout:
          connect: "5s"
          send: "30s"
          read: "60s"
          idle: "300s"

Erklärung: - Connection-Aufbau: Max. 5 Sekunden - Request-Senden: Max. 30 Sekunden - Response-Empfang: Max. 60 Sekunden - Idle-Verbindung: Max. 5 Minuten

Beispiel 2: Basic Retry-Konfiguration¶

services:
  - name: api_service
    upstream:
      host: api-backend
      port: 8080
    routes:
      - path_prefix: /api
        retry:
          enabled: true
          attempts: 3
          backoff: exponential
          base_interval: "25ms"
          max_interval: "250ms"
          retry_on:
            - connect_timeout
            - http_5xx

Erklärung: - Maximal 3 Wiederholungsversuche - Exponential Backoff: 25ms → 50ms → 100ms - Wiederholung bei: Connection-Timeout oder 5xx-Fehler

Beispiel 3: Timeout & Retry kombiniert (EMPFOHLEN)¶

services:
  - name: payment_service
    upstream:
      host: payment-backend
      port: 8080
    routes:
      - path_prefix: /api/payments
        timeout:
          connect: "3s"
          send: "10s"
          read: "30s"
        retry:
          enabled: true
          attempts: 3
          backoff: exponential
          base_interval: "50ms"
          max_interval: "500ms"
          retry_on:
            - connect_timeout
            - http_502
            - http_503
            - http_504

Erklärung: - Kurze Timeouts für schnelles Failover - Aggressive Retry-Strategie für kritische Payment-API - Spezifische 5xx-Codes statt generischem http_5xx

Konfigurationsoptionen¶

TimeoutConfig¶

timeout:
  connect: "5s"      # Connection-Timeout (default: "5s")
  send: "30s"        # Send-Timeout (default: "30s")
  read: "60s"        # Read-Timeout (default: "60s")
  idle: "300s"       # Idle-Timeout (default: "300s")

Parameter:

Parameter	Typ	Default	Beschreibung
`connect`	string	`"5s"`	Maximale Zeit zum Aufbau der TCP-Verbindung zum Upstream
`send`	string	`"30s"`	Maximale Zeit zum Senden des Requests an den Upstream
`read`	string	`"60s"`	Maximale Zeit zum Empfangen der Response vom Upstream
`idle`	string	`"300s"`	Maximale Zeit für inaktive Keep-Alive-Verbindungen (5 Minuten)

Format: Zeitangaben als String mit Suffix: - s = Sekunden (z.B. "5s", "30s") - m = Minuten (z.B. "1m", "10m") - ms = Millisekunden (z.B. "500ms")

RetryConfig¶

retry:
  enabled: true                  # Retry aktivieren (default: true)
  attempts: 3                    # Anzahl der Versuche (default: 3)
  backoff: exponential           # Backoff-Strategie (default: "exponential")
  base_interval: "25ms"          # Basis-Intervall (default: "25ms")
  max_interval: "250ms"          # Maximales Intervall (default: "250ms")
  retry_on:                      # Retry-Bedingungen (default: ["connect_timeout", "http_5xx"])
    - connect_timeout
    - http_5xx
    - http_502
    - http_503

Parameter:

Parameter	Typ	Default	Beschreibung
`enabled`	boolean	`true`	Aktiviert/Deaktiviert Retry-Logik
`attempts`	integer	`3`	Anzahl der Wiederholungsversuche (inkl. Originalversuch)
`backoff`	string	`"exponential"`	Backoff-Strategie: `"exponential"` oder `"linear"`
`base_interval`	string	`"25ms"`	Basis-Intervall für Exponential Backoff
`max_interval`	string	`"250ms"`	Maximales Intervall zwischen Retries
`retry_on`	list[string]	`["connect_timeout", "http_5xx"]`	Liste der Retry-Bedingungen

Retry-Bedingungen:

Bedingung	Beschreibung
`connect_timeout`	Wiederholung bei Connection-Timeout
`http_5xx`	Wiederholung bei allen 5xx HTTP-Statuscodes
`http_502`	Wiederholung bei HTTP 502 Bad Gateway
`http_503`	Wiederholung bei HTTP 503 Service Unavailable
`http_504`	Wiederholung bei HTTP 504 Gateway Timeout
`retriable_4xx`	Wiederholung bei retriable 4xx-Codes (z.B. 429 Too Many Requests)
`reset`	Wiederholung bei Connection-Reset
`refused`	Wiederholung bei Connection-Refused

Backoff-Strategien:

Exponential Backoff (empfohlen):
Versuch 1: Sofort
Versuch 2: Nach base_interval (z.B. 25ms)
Versuch 3: Nach base_interval * 2 (z.B. 50ms)
Versuch 4: Nach base_interval * 4 (z.B. 100ms)
Maximum: max_interval
Linear Backoff:
Versuch 1: Sofort
Versuch 2: Nach base_interval (z.B. 25ms)
Versuch 3: Nach base_interval (z.B. 25ms)
Versuch 4: Nach base_interval (z.B. 25ms)

Provider-Implementierungen¶

Envoy¶

Timeout-Konfiguration:

# Envoy Static Configuration (envoy.yaml)
clusters:
  - name: api_service_cluster
    connect_timeout: 5s
    # ...

routes:
  - route:
      timeout: 60s          # read timeout
      idle_timeout: 300s    # idle timeout

Retry-Konfiguration:

routes:
  - route:
      retry_policy:
        num_retries: 3
        per_try_timeout: 25ms
        retry_on: "connect-failure,5xx"
        retriable_status_codes: [502, 503, 504]

Besonderheiten: - Cluster-Level: connect_timeout - Route-Level: timeout (read), idle_timeout - Retry-Conditions: connect-failure, 5xx, reset, refused - retriable_status_codes für spezifische 5xx-Codes

GAL-Mapping: - connect_timeout → cluster.connect_timeout - http_5xx → retry_on: "5xx" - http_502 → retriable_status_codes: [502]

Kong¶

Timeout-Konfiguration:

services:
  - name: api_service
    connect_timeout: 10000     # Millisekunden
    read_timeout: 120000       # Millisekunden
    write_timeout: 60000       # Millisekunden

Retry-Konfiguration:

services:
  - name: api_service
    retries: 3

Besonderheiten: - Service-Level: Alle Timeouts in Millisekunden - Retry: Nur Anzahl der Versuche, keine konditionalen Retries - Keine native Backoff-Konfiguration

GAL-Mapping: - timeout.connect: "10s" → connect_timeout: 10000 - timeout.read: "120s" → read_timeout: 120000 - timeout.send: "60s" → write_timeout: 60000 - retry.attempts: 3 → retries: 3

APISIX¶

Timeout-Konfiguration:

{
  "routes": [{
    "plugins": {
      "timeout": {
        "connect": 10,
        "send": 60,
        "read": 120
      }
    }
  }]
}

Retry-Konfiguration:

{
  "routes": [{
    "plugins": {
      "proxy-retry": {
        "retries": 3,
        "retry_timeout": 500,
        "vars": [
          ["status", "==", 502],
          ["status", "==", 503]
        ]
      }
    }
  }]
}

Besonderheiten: - Plugin-basiert: timeout Plugin - Retry via proxy-retry Plugin - Timeouts in Sekunden - Retry-Conditions via vars (Status-Code-Filter)

GAL-Mapping: - timeout.connect: "10s" → timeout.connect: 10 - retry_on: [http_502, http_503] → vars: [["status", "==", 502], ...]

Traefik¶

Timeout-Konfiguration:

services:
  api_service_service:
    loadBalancer:
      serversTransport:
        forwardingTimeouts:
          dialTimeout: 10s
          responseHeaderTimeout: 120s
          idleConnTimeout: 600s

Retry-Konfiguration:

middlewares:
  api_service_router_0_retry:
    retry:
      attempts: 5
      initialInterval: 50ms

Besonderheiten: - Service-Level: serversTransport.forwardingTimeouts - Retry als Middleware konfiguriert - dialTimeout = Connection Timeout - responseHeaderTimeout = Read Timeout - idleConnTimeout = Idle Timeout

GAL-Mapping: - timeout.connect → dialTimeout - timeout.read → responseHeaderTimeout - timeout.idle → idleConnTimeout - retry → Middleware erstellen

Nginx¶

Timeout-Konfiguration:

location /api {
    proxy_connect_timeout 10s;
    proxy_send_timeout 60s;
    proxy_read_timeout 120s;
}

Retry-Konfiguration:

location /api {
    proxy_next_upstream timeout http_502 http_503;
    proxy_next_upstream_tries 3;
    proxy_next_upstream_timeout 500ms;
}

Besonderheiten: - Location-Level: proxy_*_timeout Direktiven - Retry via proxy_next_upstream - Retry-Conditions: timeout, error, http_502, http_503, etc.

GAL-Mapping: - connect_timeout → proxy_next_upstream timeout - http_502 → proxy_next_upstream http_502 - attempts: 3 → proxy_next_upstream_tries 3

HAProxy¶

Timeout-Konfiguration:

backend backend_api_service
    timeout connect 10s
    timeout server 120s
    timeout client 600s

Retry-Konfiguration:

backend backend_api_service
    retry-on conn-failure 502 503
    retries 5

Besonderheiten: - Backend-Level: timeout connect/server/client - Retry via retry-on Direktive - Retry-Conditions: conn-failure, empty-response, HTTP-Statuscodes

GAL-Mapping: - timeout.connect → timeout connect - timeout.read → timeout server - timeout.idle → timeout client - connect_timeout → retry-on conn-failure - http_503 → retry-on 503

Azure APIM¶

Azure API Management konfiguriert Timeouts via Policy XML:

Timeout-Konfiguration:

<policies>
    <inbound>
        <base />
        <!-- Forward-request timeout -->
        <forward-request timeout="120" />
    </inbound>
</policies>

Backend Timeout:

<policies>
    <backend>
        <base />
        <set-backend-service base-url="https://backend.example.com" />
        <!-- Backend send timeout -->
        <send-request timeout="60">
            ...
        </send-request>
    </backend>
</policies>

Retry-Konfiguration: Azure APIM bietet keine native Retry-Policy. Workarounds: - send-request Policy mit count und interval für Custom Retries - Azure Functions mit Retry-Logik als Backend - Application Insights für Monitoring fehlgeschlagener Requests

Besonderheiten: - Timeouts in Sekunden (nicht Millisekunden) - forward-request timeout: Read Timeout für gesamten Request - Keine native Retry-Konfiguration (benötigt Custom Logic) - Standard Timeout: 300 Sekunden (5 Minuten)

GAL-Mapping: - timeout.read: "120s" → <forward-request timeout="120" /> - timeout.send: "60s" → <send-request timeout="60"> - retry.*: Nicht unterstützt (Warning im Log)

Hinweis: Für Production-Retries empfiehlt Azure die Verwendung von Azure Functions oder Logic Apps mit eingebauter Retry-Logik.

GCP API Gateway¶

GCP API Gateway konfiguriert Timeouts via x-google-backend Extension im OpenAPI 2.0 Spec.

Timeout-Konfiguration:

swagger: "2.0"
info:
  title: "API with Timeouts"
  version: "1.0.0"

x-google-backend:
  address: https://backend.example.com
  deadline: 30.0  # Timeout in Sekunden (5-300s)
  path_translation: APPEND_PATH_TO_ADDRESS

paths:
  /api/fast:
    get:
      summary: "Fast endpoint"
      x-google-backend:
        address: https://backend.example.com
        deadline: 5.0  # 5 Sekunden Timeout

  /api/slow:
    get:
      summary: "Slow endpoint (long-running)"
      x-google-backend:
        address: https://backend.example.com
        deadline: 120.0  # 2 Minuten Timeout

Backend Deadline Parameter: - Parameter: deadline (in Sekunden) - Minimum: 5 Sekunden - Maximum: 300 Sekunden (5 Minuten) - Standard: 15 Sekunden - Typ: Float (z.B. 30.0, 5.5)

Retry-Konfiguration: GCP API Gateway bietet keine native Retry-Konfiguration. Alternativen:

Backend-seitige Retries (empfohlen):

# Cloud Run Backend mit Retries
import requests
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=0.5, max=5)
)
def call_downstream_service():
    response = requests.get('https://downstream-api.com/data', timeout=10)
    response.raise_for_status()
    return response.json()

@app.route('/api/data')
def api_data():
    try:
        data = call_downstream_service()
        return {'data': data}, 200
    except Exception as e:
        return {'error': str(e)}, 503

Cloud Tasks für asynchrone Retries:
```
                     
```
name="__codelineno-21-1" href="#__codelineno-21-1">from google.cloud import tasks_v2 class="k">def create_task_with_retry(project, location, queue, url, payload): client = tasks_v2.CloudTasksClient() parent = client.queue_path(project, location, queue) task = { 'http_request': { 'http_method': tasks_v2.HttpMethod.POST, 'url': url, 'headers': {'Content-Type': 'application/json'}, 'body': json.dumps(payload).encode() }, 'retry_config': { 'max_attempts': 5, 'max_retry_duration': '3600s', 'min_backoff': '0.1s', 'max_backoff': '10s', 'max_doublings': 5 } } return client.create_task(request={'parent': parent, 'task': task})

Deployment:

# API Config mit Timeout-Konfiguration erstellen
gcloud api-gateway api-configs create config-v1 \
  --api=my-api \
  --openapi-spec=openapi-with-timeouts.yaml \
  --project=my-project \
  --backend-auth-service-account=backend@my-project.iam.gserviceaccount.com

# Gateway erstellen
gcloud api-gateway gateways create my-gateway \
  --api=my-api \
  --api-config=config-v1 \
  --location=us-central1 \
  --project=my-project

# Timeout-Metriken überwachen
gcloud monitoring time-series list \
  --filter='metric.type="serviceruntime.googleapis.com/api/request_latencies"' \
  --project=my-project

Cloud Monitoring für Timeout-Analyse:

# Requests mit Timeouts anzeigen
gcloud logging read "resource.type=api AND httpRequest.status=504" \
  --project=my-project \
  --limit=50 \
  --format=json

# Latency-Metriken abrufen
gcloud monitoring time-series list \
  --filter='metric.type="serviceruntime.googleapis.com/api/request_latencies"' \
  --project=my-project \
  --format=json

GCP API Gateway Besonderheiten: - ✅ Timeout-Konfiguration via deadline (5-300s) - ✅ Per-Path Timeout-Konfiguration - ✅ Global Default Timeout (15s) - ❌ Keine nativen Retry-Policies - ⚠️ Backend-seitige Retries erforderlich - ✅ Integration mit Cloud Tasks für asynchrone Retries - ✅ Cloud Monitoring für Timeout-Metriken

GAL-Mapping: - timeout.read: "30s" → deadline: 30.0 - timeout.connect → Nicht unterstützt (Backend-Verantwortung) - retry.* → Nicht unterstützt (Backend-Implementierung erforderlich)

Beispiel: Complete Timeout Configuration:

swagger: "2.0"
info:
  title: "Production API with Timeouts"
  version: "1.0.0"

# Global default timeout
x-google-backend:
  address: https://api-backend.example.com
  deadline: 30.0

paths:
  /api/health:
    get:
      summary: "Health check (fast)"
      x-google-backend:
        deadline: 2.0

  /api/users:
    get:
      summary: "List users (normal)"
      x-google-backend:
        deadline: 10.0

  /api/reports:
    post:
      summary: "Generate report (slow)"
      x-google-backend:
        deadline: 120.0

  /api/batch:
    post:
      summary: "Batch processing (very slow)"
      x-google-backend:
        deadline: 300.0  # Maximum: 5 Minuten

Hinweis: Für Production-Grade Timeout & Retry Management empfiehlt Google: 1. Backend-seitige Retry-Logik (z.B. mit Tenacity, Backoff) 2. Cloud Tasks für asynchrone Workflows mit Retries 3. Apigee (Enterprise API Gateway mit nativen Retry Policies)

AWS API Gateway¶

AWS API Gateway implementiert Timeouts über timeoutInMillis in x-amazon-apigateway-integration. Retry-Logik erfolgt client-seitig via AWS SDK.

Integration Timeout: - Parameter: timeoutInMillis im x-amazon-apigateway-integration Block - Minimum: 50ms - Maximum: 29000ms (29 Sekunden) - Hard Limit! - Standard: 29000ms - Mechanismus: Gilt für gesamten Backend-Request (Connection + Response) - Hinweis: 29 Sekunden ist AWS API Gateway's absolutes Maximum

Generiertes OpenAPI Config-Beispiel:

{
  "openapi": "3.0.1",
  "info": {
    "title": "API with Timeouts",
    "version": "1.0.0"
  },
  "paths": {
    "/api/fast": {
      "get": {
        "x-amazon-apigateway-integration": {
          "type": "http_proxy",
          "httpMethod": "GET",
          "uri": "https://backend.example.com/api/fast",
          "timeoutInMillis": 5000,
          "connectionType": "INTERNET"
        }
      }
    },
    "/api/slow": {
      "get": {
        "x-amazon-apigateway-integration": {
          "type": "http_proxy",
          "httpMethod": "GET",
          "uri": "https://backend.example.com/api/slow",
          "timeoutInMillis": 29000,
          "connectionType": "INTERNET"
        }
      }
    }
  }
}

Retry Strategie: AWS API Gateway bietet keine nativen Retry-Policies. Alternativen:

AWS SDK Exponential Backoff (Client-Side):

import boto3
from botocore.config import Config
from botocore.exceptions import ClientError

# AWS SDK mit Retry-Config
config = Config(
    retries={
        'max_attempts': 3,
        'mode': 'adaptive'  # adaptive, standard, legacy
    }
)

client = boto3.client('apigateway', config=config)

# API Request mit automatischen Retries
try:
    response = client.get_rest_api(restApiId='abc123xyz')
except ClientError as e:
    print(f"Error: {e}")

Circuit Breaker Pattern (Lambda + DynamoDB):

import json
import boto3
from datetime import datetime, timedelta

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('circuit-breaker-state')

def lambda_handler(event, context):
    endpoint = event['requestContext']['path']

    # Check Circuit Breaker State
    state = get_circuit_state(endpoint)

    if state == 'OPEN':
        return {
            'statusCode': 503,
            'body': json.dumps({'error': 'Circuit breaker open'})
        }

    # Make Backend Request with Retry
    try:
        response = make_backend_request_with_retry(endpoint)
        record_success(endpoint)
        return response
    except Exception as e:
        record_failure(endpoint)
        return {
            'statusCode': 503,
            'body': json.dumps({'error': str(e)})
        }

def get_circuit_state(endpoint):
    # DynamoDB lookup
    item = table.get_item(Key={'endpoint': endpoint})
    if 'Item' not in item:
        return 'CLOSED'

    failures = item['Item'].get('failures', 0)
    if failures > 5:
        return 'OPEN'
    return 'CLOSED'

def make_backend_request_with_retry(endpoint, max_retries=3):
    import requests
    for attempt in range(max_retries):
        try:
            response = requests.get(f'https://backend.com{endpoint}', timeout=10)
            response.raise_for_status()
            return {
                'statusCode': 200,
                'body': response.text
            }
        except requests.exceptions.RequestException as e:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)  # Exponential backoff

AWS Step Functions für Long-Running Tasks:

{
  "Comment": "Retry State Machine",
  "StartAt": "CallBackend",
  "States": {
    "CallBackend": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:backend-call",
      "Retry": [
        {
          "ErrorEquals": ["States.TaskFailed"],
          "IntervalSeconds": 2,
          "MaxAttempts": 3,
          "BackoffRate": 2.0
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "HandleError"
        }
      ],
      "End": true
    },
    "HandleError": {
      "Type": "Fail",
      "Error": "BackendCallFailed",
      "Cause": "Backend call failed after retries"
    }
  }
}

Deployment:

# OpenAPI mit Timeout-Konfiguration
gal generate -c config.yaml -p aws_apigateway -o api.json

# API erstellen
aws apigateway import-rest-api --body file://api.json

# Deployment
aws apigateway create-deployment \
  --rest-api-id abc123xyz \
  --stage-name prod

# Timeout-Metriken überwachen
aws cloudwatch get-metric-statistics \
  --namespace AWS/ApiGateway \
  --metric-name IntegrationLatency \
  --dimensions Name=ApiName,Value=MyAPI \
  --start-time 2025-10-20T00:00:00Z \
  --end-time 2025-10-20T23:59:59Z \
  --period 300 \
  --statistics Average,Maximum

Testing:

# Backend mit künstlichem Delay
# backend-server.py
from flask import Flask
import time

app = Flask(__name__)

@app.route('/api/slow')
def slow():
    time.sleep(30)  # 30 Sekunden - überschreitet API Gateway Limit
    return {'message': 'This will timeout'}

@app.route('/api/fast')
def fast():
    time.sleep(2)  # 2 Sekunden - innerhalb Limit
    return {'message': 'Success'}

# Test mit curl
curl --max-time 30 https://abc123.execute-api.us-east-1.amazonaws.com/prod/api/slow
# Erwartung: 504 Gateway Timeout nach 29 Sekunden

curl --max-time 30 https://abc123.execute-api.us-east-1.amazonaws.com/prod/api/fast
# Erwartung: 200 OK

# CloudWatch Logs prüfen
aws logs filter-log-events \
  --log-group-name /aws/apigateway/MyAPI \
  --filter-pattern "504" \
  --start-time $(date -u -d '1 hour ago' +%s)000

AWS API Gateway-spezifische Features: - ✅ Integration Timeout (50ms - 29000ms) - ✅ Per-Method Timeout-Konfiguration - ✅ CloudWatch Metrics (IntegrationLatency, Latency) - ✅ X-Ray Tracing für Timeout-Analyse - ⚠️ 29 Sekunden Hard Limit (nicht konfigurierbar) - ❌ Keine nativen Retry-Policies auf Gateway-Ebene - ❌ Kein Connection Timeout (separat) - ❌ Kein Circuit Breaker (benötigt Lambda + DynamoDB)

Limitierungen: - ⚠️ 29 Sekunden Maximum Timeout - nicht erweiterbar! - ⚠️ Keine separaten Connection/Send/Read Timeouts - ⚠️ Retry-Logik muss client-seitig (AWS SDK) oder backend-seitig implementiert werden - ❌ Keine Exponential Backoff auf Gateway-Ebene - ❌ Keine Status Code-basierte Retries

Workarounds für Long-Running Tasks:

Asynchrone Patterns mit SQS:

import boto3

sqs = boto3.client('sqs')
queue_url = 'https://sqs.us-east-1.amazonaws.com/123456789012/my-queue'

def lambda_handler(event, context):
    # Task in Queue stellen (sofortige Response)
    sqs.send_message(
        QueueUrl=queue_url,
        MessageBody=json.dumps({'task': 'long-running-job'})
    )

    return {
        'statusCode': 202,
        'body': json.dumps({
            'message': 'Job queued',
            'job_id': 'job-12345'
        })
    }

Polling Pattern:

# 1. Job starten (sofortige Response)
JOB_ID=$(curl -X POST https://api.example.com/jobs | jq -r '.job_id')

# 2. Status abfragen (mehrmals)
while true; do
  STATUS=$(curl https://api.example.com/jobs/$JOB_ID | jq -r '.status')
  if [ "$STATUS" = "completed" ]; then
    echo "Job completed"
    break
  fi
  sleep 5
done

WebSocket API für Real-Time Updates:

# Lambda für WebSocket Connection
def lambda_handler(event, context):
    connection_id = event['requestContext']['connectionId']

    # Long-Running Task starten
    start_long_running_task(connection_id)

    return {'statusCode': 200}

def send_progress_update(connection_id, progress):
    apigatewaymanagementapi = boto3.client('apigatewaymanagementapi')
    apigatewaymanagementapi.post_to_connection(
        ConnectionId=connection_id,
        Data=json.dumps({'progress': progress})
    )

CloudWatch Alarms für Timeouts:

# Alarm bei hoher Timeout-Rate
aws cloudwatch put-metric-alarm \
  --alarm-name "API-Gateway-High-Timeout-Rate" \
  --alarm-description "Alert when timeout rate > 5%" \
  --metric-name 5XXError \
  --namespace AWS/ApiGateway \
  --statistic Sum \
  --period 300 \
  --evaluation-periods 2 \
  --threshold 50 \
  --comparison-operator GreaterThanThreshold \
  --dimensions Name=ApiName,Value=MyAPI \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts

Hinweis: AWS API Gateway ist für Request-Response APIs optimiert (< 29 Sekunden). Für Long-Running Tasks verwenden Sie: 1. SQS + Lambda (Asynchrone Verarbeitung) 2. Step Functions (Orchestrierung mit Retries) 3. WebSocket API (Real-Time Updates) 4. Polling Pattern (Job Status Abfrage)

Häufige Anwendungsfälle¶

1. REST API mit konservativen Timeouts¶

Use Case: Standard-REST-API mit angemessenen Timeouts für normale Workloads.

services:
  - name: rest_api
    upstream:
      host: api.internal
      port: 8080
    routes:
      - path_prefix: /api
        timeout:
          connect: "5s"
          send: "30s"
          read: "60s"
          idle: "300s"

Erklärung: Standard-Timeouts für die meisten APIs geeignet.

2. Payment API mit aggressiven Retries¶

Use Case: Kritische Payment-API mit hohen Verfügbarkeitsanforderungen.

services:
  - name: payment_api
    upstream:
      host: payment.internal
      port: 8080
    routes:
      - path_prefix: /api/payments
        timeout:
          connect: "3s"
          send: "10s"
          read: "30s"
        retry:
          enabled: true
          attempts: 5
          backoff: exponential
          base_interval: "50ms"
          max_interval: "500ms"
          retry_on:
            - connect_timeout
            - http_502
            - http_503
            - http_504

Erklärung: - Kurze Timeouts für schnelles Failover - 5 Retry-Versuche (mehr als Standard) - Nur spezifische 5xx-Codes (nicht alle)

3. Long-Running Operations¶

Use Case: Batch-Processing oder Report-Generierung mit langen Laufzeiten.

services:
  - name: batch_api
    upstream:
      host: batch.internal
      port: 8080
    routes:
      - path_prefix: /api/batch
        timeout:
          connect: "10s"
          send: "60s"
          read: "600s"       # 10 Minuten
          idle: "3600s"      # 1 Stunde
        retry:
          enabled: false      # Keine Retries bei Long-Running

Erklärung: - Sehr lange Read-Timeouts (10 Minuten) - Retry deaktiviert (Long-Running Operations sollten nicht wiederholt werden)

4. Microservice mit Circuit Breaker¶

Use Case: Microservice mit Circuit Breaker für schnelles Failover.

services:
  - name: user_service
    upstream:
      host: user.internal
      port: 8080
    routes:
      - path_prefix: /api/users
        timeout:
          connect: "2s"
          send: "10s"
          read: "20s"
        retry:
          enabled: true
          attempts: 3
          backoff: exponential
          base_interval: "25ms"
          max_interval: "100ms"
          retry_on:
            - connect_timeout
            - http_503
        circuit_breaker:
          enabled: true
          max_failures: 5
          timeout: "30s"

Erklärung: - Kurze Timeouts + Circuit Breaker = schnelles Failover - Retry nur bei Connect-Timeout und 503 (Service Unavailable)

5. gRPC Service¶

Use Case: gRPC-Service mit speziellen Timeout-Anforderungen.

services:
  - name: grpc_service
    type: grpc
    protocol: http2
    upstream:
      host: grpc.internal
      port: 50051
    routes:
      - path_prefix: /
        timeout:
          connect: "5s"
          send: "30s"
          read: "120s"       # gRPC Streams können länger dauern
        retry:
          enabled: true
          attempts: 3
          retry_on:
            - reset              # Connection-Reset häufig bei gRPC
            - connect_timeout

Erklärung: - Längere Read-Timeouts für gRPC-Streams - Retry bei reset (häufig bei gRPC-Problemen)

6. External API mit Rate Limiting¶

Use Case: Externe API mit Rate Limiting und konservativen Retries.

services:
  - name: external_api
    upstream:
      host: api.external.com
      port: 443
    routes:
      - path_prefix: /api
        timeout:
          connect: "10s"
          send: "30s"
          read: "60s"
        retry:
          enabled: true
          attempts: 3
          backoff: exponential
          base_interval: "100ms"   # Längerer Backoff für externe API
          max_interval: "1s"
          retry_on:
            - connect_timeout
            - http_503
            - retriable_4xx        # 429 Too Many Requests

Erklärung: - Längerer Backoff für externe APIs - Retry bei retriable_4xx (429 Rate Limit)

7. Multi-Datacenter mit Failover¶

Use Case: Multi-Datacenter-Deployment mit schnellem Failover.

services:
  - name: api_service
    upstream:
      targets:
        - host: api-dc1.internal
          port: 8080
        - host: api-dc2.internal
          port: 8080
      load_balancer:
        algorithm: round_robin
    routes:
      - path_prefix: /api
        timeout:
          connect: "2s"        # Kurz für schnelles Failover
          send: "10s"
          read: "30s"
        retry:
          enabled: true
          attempts: 2          # Nur 2 Versuche (Multi-DC hat viele Server)
          backoff: linear
          base_interval: "10ms"
          retry_on:
            - connect_timeout
            - reset

Erklärung: - Sehr kurze Connection-Timeouts (2s) - Linear Backoff für schnelles Failover zwischen DCs

8. WebSocket mit Retry¶

Use Case: WebSocket-Verbindungen mit Retry bei Connection-Fehlern.

services:
  - name: websocket_service
    upstream:
      host: ws.internal
      port: 8080
    routes:
      - path_prefix: /ws
        websocket:
          enabled: true
          idle_timeout: "600s"
        timeout:
          connect: "5s"
          send: "30s"
          read: "600s"       # Lange Timeouts für WebSocket
        retry:
          enabled: true
          attempts: 3
          retry_on:
            - connect_timeout
            - reset

Erklärung: - Lange Read-Timeouts für WebSocket-Verbindungen - Retry nur bei Connection-Fehlern (nicht bei Protokollfehlern)

9. Idempotente API mit vielen Retries¶

Use Case: Idempotente API (GET/PUT/DELETE) mit vielen Retry-Versuchen.

services:
  - name: idempotent_api
    upstream:
      host: api.internal
      port: 8080
    routes:
      - path_prefix: /api/data
        methods:
          - GET
          - PUT
          - DELETE
        timeout:
          connect: "5s"
          send: "30s"
          read: "60s"
        retry:
          enabled: true
          attempts: 7          # Viele Versuche (idempotent)
          backoff: exponential
          base_interval: "25ms"
          max_interval: "1s"
          retry_on:
            - connect_timeout
            - http_5xx
            - reset

Erklärung: - Viele Retry-Versuche (7) sind sicher bei idempotenten Operationen - Exponential Backoff mit höherem Maximum (1s)

10. Non-Idempotente API (POST) ohne Retry¶

Use Case: Non-idempotente API (POST) ohne automatische Retries.

services:
  - name: order_api
    upstream:
      host: order.internal
      port: 8080
    routes:
      - path_prefix: /api/orders
        methods:
          - POST
        timeout:
          connect: "5s"
          send: "30s"
          read: "60s"
        retry:
          enabled: false       # Keine Retries bei POST (non-idempotent)

Erklärung: - POST-Requests sollten nicht automatisch wiederholt werden - Risiko von Duplikaten (z.B. doppelte Bestellungen)

Best Practices¶

1. Verwende Timeouts immer¶

❌ Schlecht:

routes:
  - path_prefix: /api
    # Keine Timeout-Konfiguration

✅ Gut:

routes:
  - path_prefix: /api
    timeout:
      connect: "5s"
      send: "30s"
      read: "60s"

Begründung: Ohne Timeouts können langsame Upstreams alle Gateway-Threads blockieren.

2. Kombiniere Timeouts mit Retries¶

❌ Schlecht:

routes:
  - path_prefix: /api
    timeout:
      connect: "5s"
      read: "60s"
    # Keine Retry-Konfiguration

✅ Gut:

routes:
  - path_prefix: /api
    timeout:
      connect: "5s"
      read: "60s"
    retry:
      enabled: true
      attempts: 3
      retry_on:
        - connect_timeout
        - http_5xx

Begründung: Retries erhöhen die Verfügbarkeit bei transienten Fehlern.

3. Verwende Exponential Backoff¶

❌ Schlecht:

retry:
  backoff: linear
  base_interval: "25ms"

✅ Gut:

retry:
  backoff: exponential
  base_interval: "25ms"
  max_interval: "250ms"

Begründung: Exponential Backoff verhindert Thundering-Herd-Probleme.

4. Passe Timeouts an den Use Case an¶

❌ Schlecht (One-Size-Fits-All):

# Gleiche Timeouts für alle Routes
timeout:
  connect: "5s"
  read: "60s"

✅ Gut (Use-Case-spezifisch):

# Kurze Timeouts für schnelle Endpoints
- path_prefix: /api/health
  timeout:
    connect: "2s"
    read: "5s"

# Lange Timeouts für Report-Generierung
- path_prefix: /api/reports
  timeout:
    connect: "10s"
    read: "600s"

Begründung: Unterschiedliche Endpoints haben unterschiedliche Performance-Charakteristiken.

5. Deaktiviere Retry für Non-Idempotente Operationen¶

❌ Schlecht:

- path_prefix: /api/orders
  methods:
    - POST
  retry:
    enabled: true     # ❌ POST ist nicht idempotent!

✅ Gut:

- path_prefix: /api/orders
  methods:
    - POST
  retry:
    enabled: false    # ✅ Keine Retries für POST

Begründung: POST-Requests können zu Duplikaten führen (z.B. doppelte Bestellungen).

6. Verwende spezifische Retry-Bedingungen¶

❌ Schlecht:

retry_on:
  - http_5xx         # ❌ Zu allgemein

✅ Gut:

retry_on:
  - connect_timeout
  - http_502         # ✅ Spezifische Codes
  - http_503
  - http_504

Begründung: Nicht alle 5xx-Fehler sind retriable (z.B. 501 Not Implemented).

7. Setze maximale Retry-Versuche¶

❌ Schlecht:

retry:
  attempts: 10       # ❌ Zu viele Versuche

✅ Gut:

retry:
  attempts: 3        # ✅ Standard: 3 Versuche

Begründung: Zu viele Retries erhöhen die Latenz und können Upstreams überlasten.

Troubleshooting¶

Problem 1: Requests timeout zu schnell¶

Symptome: - Viele 504 Gateway Timeout Fehler - Logs zeigen "upstream timed out"

Lösung:

timeout:
  read: "120s"      # ✅ Erhöhe Read-Timeout

Diagnose:

# Provider-spezifische Logs prüfen
kubectl logs -n gateway gateway-pod | grep timeout

Problem 2: Retries funktionieren nicht¶

Symptome: - Fehler werden nicht automatisch wiederholt - Logs zeigen nur einen Versuch

Mögliche Ursachen: 1. Retry nicht aktiviert:

retry:
  enabled: true     # ✅ Muss true sein

Falsche Retry-Bedingungen:

retry_on:
  - http_502        # ✅ Prüfe, ob der tatsächliche Fehlercode enthalten ist

Provider-spezifische Limitierungen:
Kong: Keine konditionalen Retries (nur Anzahl)
Traefik: Retry erfordert Middleware

Problem 3: Zu viele Retries überlasten Backend¶

Symptome: - Backend zeigt hohe Last - Cascading Failures

Lösung:

retry:
  attempts: 2               # ✅ Reduziere Versuche
  base_interval: "100ms"    # ✅ Erhöhe Backoff
  max_interval: "1s"

Problem 4: Connection Timeouts zu kurz¶

Symptome: - Viele "connection timeout" Fehler - Backend ist langsam, aber erreichbar

Lösung:

timeout:
  connect: "10s"    # ✅ Erhöhe Connection-Timeout

Problem 5: Idle-Verbindungen werden zu früh geschlossen¶

Symptome: - Keep-Alive funktioniert nicht - Viele neue Connections

Lösung:

timeout:
  idle: "600s"      # ✅ Erhöhe Idle-Timeout (10 Minuten)

Problem 6: Exponential Backoff zu aggressiv¶

Symptome: - Retries geschehen zu langsam - Hohe Latenz bei Retry-Erfolg

Lösung:

retry:
  base_interval: "10ms"     # ✅ Reduziere Base-Interval
  max_interval: "100ms"     # ✅ Reduziere Max-Interval

Provider-Vergleich¶

Feature-Matrix¶

Feature	Envoy	Kong	APISIX	Traefik	Nginx	HAProxy
Timeouts
Connection Timeout	✅	✅	✅	✅	✅	✅
Send/Write Timeout	✅	✅	✅	⚠️	✅	✅
Read Timeout	✅	✅	✅	✅	✅	✅
Idle Timeout	✅	⚠️	⚠️	✅	⚠️	✅
Retries
Retry Attempts	✅	✅	✅	✅	✅	✅
Exponential Backoff	✅	❌	⚠️	✅	❌	❌
Linear Backoff	✅	❌	⚠️	⚠️	✅	⚠️
Retry Conditions	✅	❌	✅	⚠️	✅	✅
Status Code-based Retry	✅	❌	✅	⚠️	✅	✅
Per-Try Timeout	✅	❌	✅	⚠️	✅	⚠️

Legende: - ✅ = Vollständig unterstützt - ⚠️ = Teilweise unterstützt oder Alternative - ❌ = Nicht unterstützt

Provider-spezifische Stärken¶

Envoy: - ✅ Umfassendste Timeout-Konfiguration - ✅ Granulare Retry-Bedingungen - ✅ Per-Try-Timeout - ✅ Retriable Status Codes

Kong: - ✅ Einfache Konfiguration - ✅ Timeouts in Millisekunden (präzise) - ❌ Keine konditionalen Retries - ❌ Kein Backoff

APISIX: - ✅ Plugin-basiert (flexibel) - ✅ Status Code-basierte Retries - ⚠️ Retry-Timeout als Gesamt-Timeout - ⚠️ Kein nativer Exponential Backoff

Traefik: - ✅ Middleware-basiert (wiederverwendbar) - ✅ Exponential Backoff - ⚠️ Retry-Bedingungen limitiert - ⚠️ Keine granularen Retry-Conditions

Nginx: - ✅ Flexible proxy_next_upstream Direktive - ✅ Status Code-basierte Retries - ✅ Per-Versuch-Timeout - ❌ Kein Exponential Backoff

HAProxy: - ✅ retry-on mit vielen Bedingungen - ✅ Status Code-basierte Retries - ✅ Sehr stabil und performant - ⚠️ Kein nativer Exponential Backoff

Empfehlungen¶

Für maximale Flexibilität: Envoy - Beste Retry-Konfiguration - Granulare Timeout-Kontrolle - Per-Try-Timeout

Für Einfachheit: Kong - Einfache Konfiguration - Ausreichend für die meisten Use Cases - Gut dokumentiert

Für Plugin-Ökosystem: APISIX - Plugin-basierte Architektur - Flexible Erweiterbarkeit - Lua-Scripting für Custom-Logik

Für Cloud-Native: Traefik - Kubernetes-native - Middleware-Ansatz - Auto-Discovery

Für Performance: Nginx oder HAProxy - Sehr performant - Niedrige Latenz - Battle-tested

Zusammenfassung¶

Timeout & Retry Policies sind essenzielle Features für resiliente API-Gateway-Deployments:

Timeouts verhindern Request-Hangs und Thread-Blockierung
Retries erhöhen die Verfügbarkeit bei transienten Fehlern
Exponential Backoff verhindert Thundering-Herd-Probleme
Provider-spezifische Implementierungen bieten unterschiedliche Trade-offs
Use-Case-spezifische Konfiguration ist entscheidend für optimale Performance

Nächste Schritte: - Implementiere Timeouts & Retries für deine Services - Monitore Retry-Raten und Timeout-Metriken - Tune Parameter basierend auf Produktions-Traffic - Kombiniere mit Circuit Breaker für maximale Resilienz

Siehe auch: - Circuit Breaker Guide - Health Checks & Load Balancing - Rate Limiting