diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index e81e997..07d7800 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -145,7 +145,7 @@ Brief description. ### Guidelines - Use **kebab-case** for the directory name and the `name` field. -- Keep `SKILL.md` **under 500 lines**. Move detailed reference material to separate files in the same directory. +- Keep `SKILL.md` **under 1000 lines**. Move detailed reference material to separate files in the same directory. - Include concrete, executable steps with code examples where helpful. - Make the skill self-contained — a reader should be able to follow it without outside context. diff --git a/Claude/skills/production-observability/.gitkeep b/Claude/skills/production-observability/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/Claude/skills/production-observability/SKILL.md b/Claude/skills/production-observability/SKILL.md new file mode 100644 index 0000000..8f6bb9e --- /dev/null +++ b/Claude/skills/production-observability/SKILL.md @@ -0,0 +1,1047 @@ +--- +name: production-observability +description: Use when adding logging, metrics, tracing, or alerting to production systems; debugging intermittent failures; or instrumenting code for monitoring. +--- + +# Production Observability + +This skill ensures production systems are instrumented for visibility, debugging, and operational excellence through logging, metrics, tracing, and alerting. + +## When to Activate + +- Debugging production-only or intermittent failures +- Adding monitoring to new features before deployment +- Investigating performance bottlenecks or error spikes +- Implementing SLOs, SLIs, or SLA monitoring +- Setting up alerts for critical thresholds +- When systematic-debugging requires production data + +Do NOT use for local development bugs (use systematic-debugging). + +## Logging + +### Structured Logging + +Structured logs make querying and analysis possible. Always use JSON format with contextual fields. + +#### FAIL: Unstructured Logging +```java +log.info("User login failed for user " + username + " at " + Instant.now()); +// Hard to query, parse, or correlate +``` + +#### PASS: Structured Logging +```java +log.info("login_attempt username={} timestamp={} ip={}", username, Instant.now(), clientIp); +// Queryable: username="john", timestamp=..." +``` + +#### TypeScript/Node.js Example +```typescript +import pino from 'pino'; + +const logger = pino(); + +logger.info({ username, timestamp: new Date(), ip: clientIp }, 'login_attempt'); + +// Produces JSON: +// {"level":30,"time":1704067200000,"username":"john","timestamp":"...","ip":"...","msg":"login_attempt"} +``` + +#### Python Example +```python +import structlog + +logger = structlog.get_logger() +logger.info("login_attempt", username=username, timestamp=datetime.now(), ip=client_ip) + +# Produces JSON: +// {"event":"login_attempt","username":"john","timestamp":"...","ip":"...","level":"info"} +``` + +#### Go Example +```go +import "go.uber.org/zap" + +logger.Info("login_attempt", + zap.String("username", username), + zap.Time("timestamp", time.Now()), + zap.String("ip", clientIp), +) +// Produces JSON: +// {"level":"info","ts":1704067200.123,"msg":"login_attempt","username":"john","ip":"..."} +``` + +### Log Levels + +Use levels appropriately to control volume and enable production filtering. + +| Level | Use For | Example | +|-------|---------|---------| +| ERROR | Failures requiring intervention | `log.error("payment_failed", error=err)` | +| WARN | Unexpected but non-failing conditions | `log.warn("cache_miss", key=cacheKey)` | +| INFO | Normal operation, business events | `log.info("order_created", orderId=123)` | +| DEBUG | Detailed execution flow | `log.debug("cache_hit", key=cacheKey, value=result)` | +| TRACE | Very detailed, typically disabled | `log.trace("db_query", sql=query, params=params)` | + +**Production filtering:** +```yaml +# config.yaml +logging: + level: INFO # Production default + packages: + important_service: DEBUG # Specific packages + chatty_dependency: WARN +``` + +### Log Sampling + +High-volume events should be sampled to reduce costs while maintaining visibility. + +```java +// Sample 10% of debug logs +if (Math.random() < 0.1 && logger.isDebugEnabled()) { + logger.debug("cache_debug", key=key, value=value); +} +``` + +### Correlation IDs + +Always include correlation IDs to trace requests across services. + +```java +// Filter to extract/create correlation ID +@Component +public class CorrelationFilter extends OncePerRequestFilter { + private static final String CORRELATION_ID = "X-Correlation-ID"; + + @Override + protected void doFilterInternal(HttpServletRequest request, HttpServletResponse response, FilterChain chain) { + String correlationId = request.getHeader(CORRELATION_ID); + if (correlationId == null) { + correlationId = UUID.randomUUID().toString(); + } + MDC.put("correlationId", correlationId); + + try { + chain.doFilter(request, response); + } finally { + response.setHeader(CORRELATION_ID, correlationId); + MDC.remove("correlationId"); + } + } +} + +// Use in logs +log.info("api_call", correlationId=MDC.get("correlationId"), endpoint=requestPath); +``` + +### Sensitive Data Redaction + +Never log passwords, tokens, credit cards, PII, or secrets. + +#### FAIL: Logging Sensitive Data +```typescript +console.log("User login", { email, password, creditCard }); +``` + +#### PASS: Redacted Logging +```typescript +console.log("User login", { email, userId, cardLast4: card.last4 }); +``` + +#### Automated Redaction Pattern +```typescript +function redact(obj: any): any { + const sensitiveFields = ['password', 'token', 'creditCard', 'ssn', 'apiKey']; + const redacted = { ...obj }; + + Object.keys(redacted).forEach(key => { + if (sensitiveFields.some(field => key.toLowerCase().includes(field))) { + redacted[key] = '[REDACTED]'; + } + }); + + return redacted; +} + +logger.info("user_data", redact(userData)); +``` + +#### Java Redaction +```java +public class RedactingConverter extends MessageConverter { + + private static final Pattern SENSITIVE_PATTERN = Pattern.compile( + "(password|token|creditCard|apiKey)=[^&\\s]+", Pattern.CASE_INSENSITIVE + ); + + @Override + protected String convert(ILoggingEvent event) { + String message = super.convert(event); + return SENSITIVE_PATTERN.matcher(message).replaceAll("$1=[REDACTED]"); + } +} +``` + +### Verification Steps + +- [ ] All logs use structured format (JSON) +- [ ] Log levels set appropriately (INFO for production, DEBUG for staging) +- [ ] High-volume events sampled (≤10% rate) +- [ ] Correlation ID included in all service logs +- [ ] Sensitive data redacted (passwords, tokens, PII) +- [ ] No console.log (use proper logger) +- [ ] Error logs include stack traces and context + +## Metrics + +### Metric Types + +| Type | Use For | Cardinality | +|------|---------|-------------| +| Counter | Things that only increase (requests, errors, bytes sent) | Low | +| Gauge | Current value (connections, memory, queue size) | Very Low | +| Histogram | Quantiles and distributions (request latency, response sizes) | Med | +| Summary | Similar to histogram, compute on client | Med | + +### Counter Metrics + +Count monotonic events. + +```java +// Counter for tracking HTTP requests +private final Counter httpRequests; + +public MetricsController(MeterRegistry registry) { + this.httpRequests = Counter.builder("http.requests") + .description("HTTP requests") + .tag("method", "GET") // Low cardinality tag + .tag("status", "200") + .register(registry); +} + +httpRequests.increment(); +``` + +```typescript +import { Counter } from 'prom-client'; + +const httpRequestCounter = new Counter({ + name: 'http_requests_total', + help: 'HTTP requests total', + labelNames: ['method', 'status', 'route'], +}); + +httpRequestCounter.inc({ method: 'GET', status: '200', route: '/users' }); +``` + +```python +from prometheus_client import Counter + +http_requests = Counter('http_requests_total', 'HTTP requests total', ['method', 'status', 'route']) +http_requests.labels(method='GET', status='200', route='/users').inc() +``` + +### Gauge Metrics + +Track instantaneous values. + +```java +// Gauge for active database connections +private final AtomicInteger activeConnections = new AtomicInteger(0); + +Gauge.builder("db.connections.active", activeConnections::get) + .description("Active database connections") + .register(registry); +``` + +### Histogram/Summary Metrics + +Track distributions (latency, sizes). + +```java +// Histogram for request latency +private final Histogram latencyHistogram; + +public MetricsController(MeterRegistry registry) { + this.latencyHistogram = Histogram.builder("http.request.latency") + .description("HTTP request latency") + .serviceLevelObjectives( + Duration.ofMillis(10), + Duration.ofMillis(50), + Duration.ofMillis(100), + Duration.ofMillis(500), + Duration.ofMillis(1000) + ) + .register(registry); +} + +Timer.Sample sample = Timer.start(registry); +// ... do work +sample.stop(registry.timer("http.request.duration")); +``` + +```typescript +import { Histogram } from 'prom-client'; + +const httpRequestDuration = new Histogram({ + name: 'http_request_duration_seconds', + help: 'HTTP request duration in seconds', + labelNames: ['method', 'route', 'status'], + buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10], +}); + +const end = httpRequestDuration.startTimer(); +// ... do work +end({ method: 'GET', route: '/users', status: '200' }); +``` + +### Metric Naming Conventions + +Follow naming standards for consistency. + +| Pattern | Example | Use | +|---------|---------|-----| +| `__total` | `http_requests_total` | Counters | +| `_` | `db_connections_active` | Gauges | +| `__seconds` | `http_request_duration_seconds` | Duration | +| `__bytes` | `api_response_size_bytes` | Size | + +**Good patterns:** +```java +Counter.builder("http.requests.total") // ✅ Clear +.register(registry); + +Counter.builder("http_req_2xx") // ✅ Specific +.register(registry); + +Counter.builder("user_logins") // ❌ Missing base +.register(registry); +``` + +### Cardinality Management + +High cardinality metrics (many unique tag values) cause performance issues and memory bloat. + +#### FAIL: High Cardinality +```java +// user_id has unlimited values - DON'T DO THIS +Counter.builder("http.requests") + .tag("user_id", userId.toString()) // ❌ Millions of unique values + .register(registry); +``` + +#### PASS: Low Cardinality +```java +// Use aggregates instead +Counter.builder("http.requests") + .tag("method", request.getMethod()) // ✅ Limited (GET, POST, PUT, DELETE) + .tag("status", Integer.toString(response.getStatus())) // ✅ Limited (200, 404, 500) + .register(registry); +``` + +#### Cardinality Guidelines +| Tag Type | maxUniqueValues | Safe? | +|----------|----------------|-------| +| HTTP method | ~10 | ✅ Yes | +| Status code | ~50 | ✅ Yes | +| Service name | ~100 | ✅ Yes | +| Customer ID | ~10,000+ | ❌ No | +| Request ID | Unlimited | ❌ No | + +### Business vs Infrastructure Metrics + +**Infrastructure (automated):** +- CPU usage +- Memory usage +- Disk I/O +- Network throughput +- Database connections + +**Business (require instrumentation):** +- Orders per minute +- User registrations +- Payment success rate +- Search click-through rate +- Feature adoption rate + +```java +// Business metric example +private final Counter ordersCreated; + +public OrderService(MeterRegistry registry) { + this.ordersCreated = Counter.builder("orders.total") + .description("Total orders created") + .tag("status", "success") + .register(registry); +} + +public Order createOrder(CreateOrderRequest request) { + Order order = // ... create order + ordersCreated.increment(); + return order; +} +``` + +### Verification Steps + +- [ ] All metrics follow naming conventions +- [ ] Cardinality managed properly (no high-cardinality tags) +- [ ] Both infrastructure and business metrics instrumented +- [ ] Histograms used for latency (include relevant percentiles) +- [ ] Counters for monotonic events +- [ ] Gauges for current state +- [ ] Metrics visible in dashboards (Datadog, Prometheus, Grafana) + +## Distributed Tracing + +### Tracing Fundamentals + +Tracing follows requests across multiple services to visualize latency and identify bottlenecks. + +**Span components:** +- Operation name (e.g., "http.request", "db.query") +- Start/stop timestamps +- Tags (metadata) +- Events (timed annotations) +- Links (to other traces) + +### OpenTelemetry Tracing + +#### Java/Spring Boot Example +```java +@Service +class UserService { + + @Span("UserService.getUserById") // Create span + public User getUserById(Long id) { + // Span automatically created by @Span annotation + return userRepository.findById(id); + } + + public User createUser(CreateUserRequest request) { + Tracer tracer = OpenTelemetry.getGlobalTracer(); + Span span = tracer.spanBuilder("UserService.createUser") + .setSpanKind(SpanKind.SERVER) + .startSpan(); + + try (Scope scope = span.makeCurrent()) { + // Add tags to span + span.setAttribute("user.id", request.getEmail()); + span.setAttribute("user.role", "customer"); + + User user = userRepository.save(new User(request.getEmail())); + + span.setStatus(StatusCode.OK); + return user; + } catch (Exception e) { + span.recordException(e); + span.setStatus(StatusCode.ERROR, e.getMessage()); + throw e; + } finally { + span.end(); // Always end span + } + } +} +``` + +#### TypeScript/Node.js Example +```typescript +import { trace } from '@opentelemetry/api'; + +const tracer = trace.getTracer('user-service'); + +async function createUser(request: CreateUserRequest) { + const span = tracer.startSpan('createUser'); + + try { + // Add attributes + span.setAttribute('user.email', request.email); + span.setAttribute('user.role', 'customer'); + + // Create user + const user = await userRepository.create(request); + + span.setStatus({ code: SpanStatusCode.OK }); + return user; + } catch (error) { + span.recordException(error); + span.setStatus({ code: SpanStatusCode.ERROR, message: error.message }); + throw error; + } finally { + span.end(); + } +} +``` + +#### Python Example +```python +from opentelemetry import trace + +tracer = trace.get_tracer(__name__) + +def create_user(request: CreateUserRequest) -> User: + with tracer.start_as_current_span("createUser") as span: + # Set attributes + span.set_attribute("user.email", request.email) + span.set_attribute("user.role", "customer") + + # Create user + user = user_repository.create(request) + + span.set_status(Status(StatusCode.OK)) + return user +``` + +#### Go Example +```go +import ( + "go.opentelemetry.io/otel" + "go.opentelemetry.io/otel/trace" + "context" +) + +func CreateUser(ctx context.Context, req *CreateUserRequest) (*User, error) { + tracer := otel.Tracer("user-service") + ctx, span := tracer.Start(ctx, "createUser") + + // Set attributes + span.SetAttributes( + attribute.String("user.email", req.Email), + attribute.String("user.role", "customer"), + ) + + // Create user + user, err := userRepository.Create(ctx, req) + if err != nil { + span.RecordError(err) + span.SetStatus(codes.Error, err.Error()) + return nil, err + } + + span.SetStatus(codes.Ok) + span.End() + return user, nil +} +``` + +### Distributed Context Propagation + +Trace context must flow across service boundaries. + +```java +// HTTP Client with propagation +@Component +class ExternalServiceClient { + + private final Tracer tracer; + private final RestTemplate restTemplate; + + public ExternalApiResponse callExternal(String userId) { + Span span = tracer.spanBuilder("external-service.call") + .setSpanKind(SpanKind.CLIENT) + .startSpan(); + + try (Scope scope = span.makeCurrent()) { + HttpHeaders headers = new HttpHeaders(); + tracer.getTextMapPropagator().inject( + Context.current(), + headers, + HttpHeaders::set + ); + + ResponseEntity response = restTemplate.exchange( + "https://external-service.com/user/" + userId, + HttpMethod.GET, + new HttpEntity<>(headers), + ExternalApiResponse.class + ); + + return response.getBody(); + } finally { + span.end(); + } + } +} +``` + +```typescript +// Propagation in HTTP calls +import * as api from '@opentelemetry/api'; + +async function callExternal(userId: string) { + const span = api.trace.getTracer('client').startSpan('external-service.call'); + + try { + const spanContext = api.trace.setSpan(api.context.active(), span); + + const response = await fetch(`https://external-service.com/user/${userId}`, { + headers: { + // Trace context automatically propagated with auto-instrumentation + 'traceparent': api.trace.getSpan(spanContext).spanContext().toString(), + }, + }); + + return response.json(); + } finally { + span.end(); + } +} +``` + +### Baggage Propagation + +Baggage carries metadata across services without timing measurements. + +```java +// In request handler +Baggage baggage = Baggage.builder() + .put("tenant.id", tenantId) + .put("user.segment", userSegment) + .build(); + +BaggageContext.updateCurrent(baggage); + +// In downstream service +String tenantId = Baggage.fromContext(Context.current()).getEntryValue("tenant.id"); +``` + +### Sampling + +High-traffic systems must sample traces to reduce costs. + +```yaml +# application.yaml +otel: + traces: + sampler: + type: parentbased_traceidratio + argument: 0.1 # Sample 10% of traces +``` + +Or custom sampling: +```java +class BusinessAwareSampler implements Sampler { + + @Override + public SamplingResult shouldSample(...) { + // Always sample errors + SpanContext parentContext = getParentSpanContext(spanContext); + if (parentContext != null && parentContext.isRemote()) { + boolean isError = getSpanKindFromLinks(links) == SpanKind.INTERNAL; + if (isError) { + return SamplingResult.create(true); // Sample error traces + } + } + + // Sample 5% of GET requests to /health + if (name.equals("http.request") && hasTag("path", "/health")) { + return SamplingResult.create(Math.random() < 0.05); + } + + // Default: don't sample + return SamplingResult.create(false); + } +} +``` + +### Verification Steps + +- [ ] All services instrumented with tracing +- [ ] Trace context propagated across service boundaries +- [ ] Appropriate sampling configured +- [ ] Spans include relevant attributes/tags +- [ ] Error conditions recorded in spans +- [ ] Traces visible in tracing backend (Jaeger, Tempo, Datadog) +- [ ] Span names follow naming conventions (e.g., "http.request", "db.query") + +## Alerting + +### Alert Design Principles + +**Alert fatigue** causes teams to ignore real issues. Design alerts to be actionable. + +| Alert Type | Purpose | Example | +|------------|---------|---------| +| Threshold | Fixed value exceeded | Error rate > 5% | +| Anomaly | Expected value exceeded | Latency spikes | +| Composite | Multiple conditions | CPU > 80% AND requests increasing | +| Watchdog | Is a service up? | Health check failing | + +### Alert Hierarchy + +``` +P0 (Critical) - Wake up team immediately + - Service completely down + - Data loss + - Security breach + +P1 (High) - Page within 5 minutes + - Major functionality broken + - SLA violations + - Error rate > 5% + +P2 (Medium) - Message within 30 minutes + - Degraded performance + - Minor feature broken + - Error rate > 1% + +P3 (Low) - Daily digest + - Resource utilization + - Informational alerts + - Trend data +``` + +### Alert Examples + +#### Critical Alert (P0) +```yaml +alert: ServiceDown +expr: up{job="api-server"} == 0 +for: 1m +labels: + severity: critical +annotations: + summary: "API server {{ $labels.instance }} is down" + description: "API server has been down for more than 1 minute." + +routes: + - receiver: oncall + match_re: + severity: critical +``` + +#### High Priority Alert (P1) +```yaml +alert: HighErrorRate +expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05 +for: 2m +labels: + severity: high +annotations: + summary: "High error rate detected" + description: "{{ $value | humanizePercentage }} of requests are errors" + +routes: + - receiver: slack-engineering + match_re: + severity: high +``` + +#### Performance Alert (P2) +```yaml +alert: HighLatency +expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.5 +for: 5m +labels: + severity: medium +annotations: + summary: "99th percentile latency is high" + description: "P99 latency is {{ $value | humanizeDuration }}" +``` + +### Alert Suppression + +Prevent alert storms during expected events. + +```yaml +suppress_alerts: + - name: Deployments + match: + severity: warning|high|critical + condition: deployment_in_progress == true + until: deployment_complete + + - name: Scheduled Maintenance + match: + severity: warning|high + condition: maintenance_window == true + duration: 2h +``` + +### Alert Routing + +Route to appropriate teams. + +```yaml +routes: + - receiver: oncall-platform + match_re: + service: (api-server|user-service|payment-service) + + - receiver: oncall-data + match_re: + service: (database|cache|queue|storage) + + - receiver: slack-qa + match_re: + service: test-* + match: + severity: medium|low + + - receiver: slack-security + match: + alertname: SecurityEvent +``` + +### On-Call Rotation + +```yaml +oncall: + team: platform + rotation: + - engineer: alice + timezone: UTC + start: 2025-01-01T00:00:00Z + duration: 7d + - engineer: bob + timezone: America/Los_Angeles + start: 2025-01-08T00:00:00Z + duration: 7d + + escalation: + - wait: 15m + notify: alice + - wait: 30m + notify: alice,bob # Escalate to backup + - wait: 45m + notify: platform-engineering-manager +``` + +### SLO/SLA Alerting + +Alert on SLO burn rate, not just thresholds. + +```yaml +# SLO: 99.9% availability (0.1% error budget) +alert: SLOBurnRateCritical +expr: | + ( + 1 - rate(http_requests_total{status=~"2..,3.."}[30m]) + / + rate(http_requests_total[30m]) + ) > 0.001 # 0.1% error budget +for: 5m +annotations: + summary: "SLO burn rate is critical" + description: "Error budget burning at {{ $value | humanizePercentage }} per hour" + +alert: SLOBurnRateWarning +expr: | + ( + 1 - rate(http_requests_total{status=~"2..,3.."}[1h]) + / + rate(http_requests_total[1h]) + ) > 0.0005 # 0.05% error budget +for: 15m +``` + +### Verification Steps + +- [ ] Alerts have clear actionability (what to do when triggered) +- [ ] Priorities assigned (P0, P1, P2, P3) +- [ ] On-call rotation configured +- [ ] Escalation paths defined +- [ ] Suppression rules for expected events +- [ ] Alerts tested (trigger test incident) +- [ ] Documentation available (runbook, escalation matrix) + +## Integration Patterns + +### Observability Stack + +``` +Application → Logging Library (Winston, SLF4J) + → Metrics Library (OpenTelemetry, Micrometer) + → Tracing Library (OpenTelemetry) + +Logs → Log Aggregation (Logstash, Fluentd) + → Storage and Indexing (Elasticsearch, Loki) + → Visualization (Kibana, Grafana) + +Metrics → Collection Agent (Prometheus, Datadog Agent) + → Time Series DB (Prometheus, InfluxDB) + → Alerting (Prometheus Alertmanager, PagerDuty) + +Traces → Collector (OpenTelemetry Collector) + → Backend (Jaeger, Tempo, Datadog APM) + → Analysis (Dashboards, Root Cause Analysis) +``` + +### Cross-Language Observability + +OpenTelemetry provides language-agnostic standards. + +``` +┌─────────────────┐ +│ API Service │ (Java/Spring) +└────────┬────────┘ + │ HTTP Request + │ (traceparent, baggage) + ▼ +┌─────────────────┐ +│ User Service │ (Node.js/TypeScript) +└────────┬────────┘ + │ gRPC Call + │ (traceparent, baggage) + ▼ +┌─────────────────┐ +│ Payment Service │ (Python) +└─────────────────┘ + +Single trace spans all services, showing end-to-end latency. +``` + +### Log Correlation with Tracing + +Inject trace ID into logs for cross-referencing. + +```java +// MDC-based correlation +import org.slf4j.MDC; +import io.opentelemetry.api.trace.SpanId; +import io.opentelemetry.api.trace.TraceId; + +@Component +public class TraceContextFilter extends OncePerRequestFilter { + + @Override + protected void doFilterInternal(HttpServletRequest request, + HttpServletResponse response, + FilterChain chain) throws Exception { + Span span = Span.current(); + if (span != null) { + MDC.put("traceId", span.getSpanContext().getTraceId()); + MDC.put("spanId", span.getSpanContext().getSpanId()); + } + + try { + chain.doFilter(request, response); + } finally { + MDC.remove("traceId"); + MDC.remove("spanId"); + } + } +} + +// Logs include trace ID in JSON output +log.info("api_request", endpoint="/api/users", traceId=MDC.get("traceId")); +``` + +## Pre-Deployment Checklist + +Before ANY production deployment: + +### Logging +- [ ] Structured logging format (JSON) configured +- [ ] Log levels set appropriately (INFO in production) +- [ ] Correlation ID propagation across services +- [ ] Sensitive data redaction implemented (passwords, tokens, PII) +- [ ] Log aggregation configured (ELK, CloudWatch, Datadog) +- [ ] Log retention policy configured (90 days minimum for compliance) +- [ ] No console.log or print statements in production code + +### Metrics +- [ ] Infrastructure metrics (CPU, memory, disk, network) +- [ ] Business metrics instrumented (orders, signups, conversion rate) +- [ ] Metric naming conventions followed +- [ ] Cardinality managed properly (no high-cardinality tags) +- [ ] Histograms/buckets configured for latency (P50, P90, P99) +- [ ] Metrics endpoint exposed (/metrics for Prometheus) +- [ ] Metrics collection configured (Datadog Agent, Prometheus) + +### Tracing +- [ ] All services instrumented with OpenTelemetry +- [ ] Trace context propagated across service boundaries +- [ ] Appropriate sampling configured (10% sample rate for high-traffic) +- [ ] Span names follow conventions (e.g., "http.request", "db.query") +- [ ] Spans include relevant attributes (user.id, service.name, error.type) +- [ ] Errors recorded in spans with stack traces +- [ ] Tracing backend configured (Jaeger, Tempo, Datadog APM) + +### Alerting +- [ ] Critical alerts defined (P0 - service down, data loss, security) +- [ ] High-priority alerts defined (P1 - major functionality, SLA violations) +- [ ] On-call rotation configured +- [ ] Escalation paths defined (15m, 30m, 45m) +- [ ] Alert routing configured (oncall, Slack, PagerDuty) +- [ ] Suppression rules for expected events (deployments, maintenance) +- [ ] SLO/SLA alerting configured (error budget burn rate) +- [ ] Runbook documentation available + +### General +- [ ] Observability dashboard created (Grafana, Datadog) +- [ ] SLOs defined and monitored +- [ ] Error budget calculation configured +- [ ] Observability tested in staging environment +- [ ] Runbook created for common incidents +- [ ] Postmortem process defined + +## Verification Checklist + +When debugging production issues: + +- [ ] Check logs for correlation ID matching the incident +- [ ] Review metrics at time of incident (CPU, memory, error rate, latency) +- [ ] Examine trace spanning affected services +- [ ] Identify the component or service where failure originated +- [ ] Verify if failure correlation exists with recent deployments +- [ ] Check alert history (were there prior warnings?) +- [ ] Review configuration changes (feature flags, environment variables) +- [ ] Verify external dependencies status (third-party APIs, database, cache) +- [ ] Load test scenario to reproduce issue in controlled environment + +## Common Mistakes + +- **Over-logging without sampling** → High costs, log noise, slower storage queries +- **Missing correlation IDs** → Impossible to trace requests across services +- **High-cardinality metrics** → Database performance issues, memory bloat +- **Alerting on everything** → Alert fatigue, ignored real incidents +- **No sensitive data redaction** → Security compliance violations +- **No sampling for traces** → Unnecessary costs for high-traffic systems +- **Missing context in logs** → "log.info('error')" provides no actionable information +- **Unstructured logs** → Cannot query, parse, or extract insights +- **Single metric per alert** → May cause false positives without composite conditions + +## Red Flags + +- **"Can't reproduce in dev"** → Add production instrumentation, log levels, trace sampling +- **"Guessing without data"** → Use systematic-debugging to gather evidence +- **"We'll monitor later"** → Instrument before deploy, create logs/metrics/traces first +- **"High alert volume"** → Tune thresholds, add suppression, adjust priorities +- **"Logs are unstructured text"** → Convert to JSON, add structured fields +- **"Metrics exploding"** → Check cardinality, remove high-cardinality tags +- **"Traces too expensive"** → Adjust sampling rate, filter unnecessary spans +- **"Alert fatigue"** → Reduce alert count, increase threshold severity + +## Rationalizations Table + +| Excuse | Reality | +|--------|---------| +| "Too busy for monitoring" | Unmonitored code causes MTTR to increase 3-5x | +| "Simple feature, no need" | Simple code fails - observability takes minutes | +| "We'll add later" | Retroactive instrumenting misses initial failures | +| "Console.log is enough" | Unstructured logs impossible to query or alert | +| "We have infrastructure alerts" | Business metrics required for feature health | +| "Tracing is expensive" | Proper sampling reduces costs by 90%+ | +| "Logs consume too much storage" | Structured logs + sampling = efficient storage | +| "Alert on everything is safer" | Alert fatigue causes real issues to be ignored | + +## Real-World Impact + +- **MTTR Reduction**: Proper observability reduces mean time to resolve by 60-80% +- **Debugging Speed**: Correlation IDs enable root cause identification in minutes vs hours +- **Performance**: Latency issues identified before SLA violations +- **Capacity**: Proactive capacity planning based on metric trends +- **Cost**: Structured logs + sampling reduce storage costs by 70-90% + +## Pair With Other Skills + +- **systematic-debugging** - For analyzing production issues with observability data +- **test-driven-development** - For writing tests for instrumentation code +- **security-review** - Ensure observability doesn't leak sensitive data + +--- + +**Remember**: Observability is not optional. You cannot improve or fix what you cannot see. Production systems without observability are black boxes that inevitably fail unpredictably. \ No newline at end of file