Feature/prod logging skill#9
Feature/prod logging skill#9abhex merged 8 commits intoarbisoft:feature/abhex/prod-logging-skillfrom
Conversation
ddc990d
into
arbisoft:feature/abhex/prod-logging-skill
There was a problem hiding this comment.
Pull request overview
Adds a substantially expanded production-observability skill guide covering logging, metrics, tracing, and alerting with multi-language examples and checklists.
Changes:
- Introduces a new
production-observabilityskill with extensive guidance, FAIL/PASS examples, and verification/pre-deployment checklists. - Adds multi-language instrumentation examples for structured logging, metrics, and OpenTelemetry tracing.
- Updates contributing guidance to allow larger
SKILL.mdfiles (under 1000 lines).
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 16 comments.
| File | Description |
|---|---|
Claude/skills/production-observability/SKILL.md |
New, large observability skill guide with patterns/examples across logging/metrics/tracing/alerting. |
Claude/skills/production-observability/.gitkeep |
Keeps the new skill directory present in git. |
CONTRIBUTING.md |
Raises the SKILL.md line limit guideline from 500 to 1000 lines. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| service: test-* | ||
| match: | ||
| severity: medium|low |
There was a problem hiding this comment.
In the Alertmanager routing example, match_re: service: test-* is not a valid regex for “prefix test-” (it matches tes + zero/more t-). Also severity: medium|low is under match: (exact match) rather than match_re:. Please correct these patterns to valid Alertmanager matching to avoid misconfiguration.
| service: test-* | |
| match: | |
| severity: medium|low | |
| service: ^test-.*$ | |
| severity: ^(medium|low)$ |
| private static final String CORRELATION_ID = "X-Correlation-ID"; | ||
|
|
||
| @Override | ||
| protected void doFilterInternal(HttpServletRequest request, HttpServletResponse response, FilterChain chain) { |
There was a problem hiding this comment.
OncePerRequestFilter#doFilterInternal in Spring has a checked-exception signature; this example omits the throws ServletException, IOException (or throws Exception) clause, so it won’t compile as shown. Please update the method signature to match the real API to keep the snippet executable.
| protected void doFilterInternal(HttpServletRequest request, HttpServletResponse response, FilterChain chain) { | |
| protected void doFilterInternal(HttpServletRequest request, HttpServletResponse response, FilterChain chain) | |
| throws ServletException, IOException { |
|
|
||
| #### PASS: Redacted Logging | ||
| ```typescript | ||
| console.log("User login", { email, userId, cardLast4: card.last4 }); |
There was a problem hiding this comment.
The “PASS: Redacted Logging” example still uses console.log, but the verification checklist later says “No console.log (use proper logger)”. Please change the PASS example to use the same structured logger used elsewhere (and consider noting that raw emails are often PII and may need redaction/hashing too).
| console.log("User login", { email, userId, cardLast4: card.last4 }); | |
| log.info("user_login", { emailHash, userId, cardLast4: card.last4 }); // raw email may also be PII; redact/hash per policy |
| @Service | ||
| class UserService { | ||
|
|
||
| @Span("UserService.getUserById") // Create span | ||
| public User getUserById(Long id) { | ||
| // Span automatically created by @Span annotation | ||
| return userRepository.findById(id); | ||
| } | ||
|
|
||
| public User createUser(CreateUserRequest request) { | ||
| Tracer tracer = OpenTelemetry.getGlobalTracer(); |
There was a problem hiding this comment.
The Java OpenTelemetry tracing example appears to use nonstandard/outdated APIs: @Span isn’t an OpenTelemetry annotation (commonly @WithSpan), and OpenTelemetry.getGlobalTracer() isn’t a typical entry point (usually GlobalOpenTelemetry.getTracer(...)). Please update to the correct OpenTelemetry APIs so readers can copy/paste working code.
| @Service | |
| class UserService { | |
| @Span("UserService.getUserById") // Create span | |
| public User getUserById(Long id) { | |
| // Span automatically created by @Span annotation | |
| return userRepository.findById(id); | |
| } | |
| public User createUser(CreateUserRequest request) { | |
| Tracer tracer = OpenTelemetry.getGlobalTracer(); | |
| import io.opentelemetry.api.GlobalOpenTelemetry; | |
| import io.opentelemetry.api.trace.Span; | |
| import io.opentelemetry.api.trace.SpanKind; | |
| import io.opentelemetry.api.trace.StatusCode; | |
| import io.opentelemetry.api.trace.Tracer; | |
| import io.opentelemetry.context.Scope; | |
| import io.opentelemetry.instrumentation.annotations.WithSpan; | |
| @Service | |
| class UserService { | |
| @WithSpan("UserService.getUserById") // Create span | |
| public User getUserById(Long id) { | |
| // Span automatically created by @WithSpan annotation | |
| return userRepository.findById(id); | |
| } | |
| public User createUser(CreateUserRequest request) { | |
| Tracer tracer = GlobalOpenTelemetry.getTracer(UserService.class.getName()); |
| .put("user.segment", userSegment) | ||
| .build(); | ||
|
|
||
| BaggageContext.updateCurrent(baggage); |
There was a problem hiding this comment.
BaggageContext.updateCurrent(baggage) is not a standard OpenTelemetry Java API pattern; typically you attach baggage to a Context and make it current (e.g., baggage.storeInContext(Context.current()) / baggage.makeCurrent()). Please update this example to the real API so it’s copy/pasteable.
| BaggageContext.updateCurrent(baggage); | |
| Context contextWithBaggage = baggage.storeInContext(Context.current()); | |
| try (Scope scope = contextWithBaggage.makeCurrent()) { | |
| // Make downstream calls or invoke work that should see this baggage here | |
| } |
| ```java | ||
| // Sample 10% of debug logs | ||
| if (Math.random() < 0.1 && logger.isDebugEnabled()) { | ||
| logger.debug("cache_debug", key=key, value=value); | ||
| } | ||
| ``` |
There was a problem hiding this comment.
Several Java logging snippets use pseudo “named-argument” syntax (e.g., key=key, correlationId=...) which is not valid Java/SLF4J. Please rewrite these examples using either SLF4J placeholders (key={}) or an explicit structured-logging API (e.g., markers/key-value arguments) so the examples are directly usable.
| try (Scope scope = span.makeCurrent()) { | ||
| HttpHeaders headers = new HttpHeaders(); | ||
| tracer.getTextMapPropagator().inject( | ||
| Context.current(), | ||
| headers, | ||
| HttpHeaders::set | ||
| ); |
There was a problem hiding this comment.
The Java context-propagation snippet uses tracer.getTextMapPropagator(), but propagation is typically obtained from OpenTelemetry.getPropagators().getTextMapPropagator() (not from Tracer). As written, this won’t compile and may mislead readers—please update to the correct OpenTelemetry propagation API.
| ```yaml | ||
| alert: ServiceDown | ||
| expr: up{job="api-server"} == 0 | ||
| for: 1m | ||
| labels: | ||
| severity: critical | ||
| annotations: | ||
| summary: "API server {{ $labels.instance }} is down" | ||
| description: "API server has been down for more than 1 minute." | ||
|
|
||
| routes: | ||
| - receiver: oncall | ||
| match_re: | ||
| severity: critical | ||
| ``` |
There was a problem hiding this comment.
The alert examples mix Prometheus alert rules with Alertmanager routing (routes: blocks). In Prometheus, routing is configured in Alertmanager, not inside rule definitions; please separate these into distinct examples (or clearly label which file each snippet belongs to) to avoid users copying an invalid configuration.
| 1 - rate(http_requests_total{status=~"2..,3.."}[30m]) | ||
| / | ||
| rate(http_requests_total[30m]) | ||
| ) > 0.001 # 0.1% error budget |
There was a problem hiding this comment.
The SLO expressions use status=~"2..,3..", but in PromQL regexes the comma is literal—this won’t match 2xx/3xx as intended. Please change to a correct regex (e.g., 2..|3..) or invert with status!~"5.." depending on the desired success definition.
| if (span != null) { | ||
| MDC.put("traceId", span.getSpanContext().getTraceId()); | ||
| MDC.put("spanId", span.getSpanContext().getSpanId()); |
There was a problem hiding this comment.
Span.current() never returns null; it returns an invalid/no-op span when no context exists. Checking span != null will always pass, so this can inject meaningless trace/span IDs into logs—please check span.getSpanContext().isValid() (or equivalent) before writing to MDC.
| if (span != null) { | |
| MDC.put("traceId", span.getSpanContext().getTraceId()); | |
| MDC.put("spanId", span.getSpanContext().getSpanId()); | |
| SpanContext spanContext = span.getSpanContext(); | |
| if (spanContext.isValid()) { | |
| MDC.put("traceId", spanContext.getTraceId()); | |
| MDC.put("spanId", spanContext.getSpanId()); |
Summary
What Changed
Logging Section (major expansion)
Metrics Section (comprehensive coverage)
Tracing Section (added depth)
Alerting Section (design patterns)
Additional Major Additions
Test Plan
Motivation
Production observability is critical infrastructure, and teams need detailed guidance with examples to implement logging, metrics, tracing, and alerting correctly. This expansion makes the skill production-ready and provides actionable patterns that prevent common observability anti-patterns.
Related