What We Learned from Thousands of Production Incidents

The three pillars of observability, alert fatigue, and the dashboards that actually matter. Hard-won lessons from years of operating enterprise systems.

The 3 AM problem

Every engineer who has operated production systems knows the feeling. Your phone buzzes at 3 AM. An alert fires. You open your laptop, bleary-eyed, and stare at a dashboard full of red indicators. Something is wrong. But what?

This is the moment that separates systems with good observability from systems without it. In one scenario, you spend the next 90 minutes digging through logs, guessing at causes, and trying to correlate events across multiple systems. In the other, you look at your dashboard, see exactly which component is misbehaving, drill into the relevant traces, and identify the root cause in minutes.

After years of operating enterprise systems across banking, middleware, and distributed platforms, we have developed strong opinions about what actually works in production observability. Most of what the industry teaches is correct in theory but incomplete in practice.

The three pillars and the one-pillar reality

Everyone talks about the three pillars of observability: logs, metrics, and traces. What we have observed is that most teams only do one of them well — usually logging — and treat the other two as afterthoughts.

Logs are the default. Every application produces logs. The problem is that most logs are unstructured streams of text that are nearly useless at 3 AM. We will come back to this.

Metrics are numerical measurements over time. CPU usage, request latency, error rates, queue depth. Metrics tell you what is happening at a high level. They are the early warning system that tells you something is wrong before users notice.

Traces follow a single request through all the services it touches. In distributed systems, a trace is the only way to understand the full journey of a request from entry to response. Without traces, debugging distributed systems is guesswork.

The reason most teams only do logging well is that logging requires almost no upfront investment. You call a log function, and text appears somewhere. Metrics require instrumentation — you have to decide what to measure, how to aggregate it, and where to visualize it. Traces require propagation — you need correlation IDs flowing through every service, every queue, every database call.

Our recommendation: invest in all three from day one. The cost of adding observability after the fact — when systems are already in production and incidents are already happening — is an order of magnitude higher than building it in from the start.

Structured logging that actually helps

The single highest-impact change you can make to your logging is to switch from unstructured text to structured data. Instead of logging "User 12345 failed to process payment of $100.00 - timeout", log a structured event with fields: user_id, action, amount, error_type, duration_ms, service_name, trace_id.

Why this matters at 3 AM:

With unstructured logs, finding all failed payments in the last hour requires regex. Finding all failures for a specific user requires a different regex. Correlating failures with a specific downstream service requires yet another. When you are tired and the system is down, you do not want to be writing regex.

With structured logs, these are simple queries. Show me all events where action=payment AND error_type=timeout AND timestamp > 1 hour ago. Group by service_name. Show me the distribution of duration_ms.

The fields we include in every log event:

timestamp (ISO 8601), service_name, trace_id, span_id, level (info/warn/error), action (what was being attempted), outcome (success/failure), duration_ms (how long it took), and any domain-specific fields relevant to the action.

The trace_id is critical. It connects your log events to your distributed traces. When you see an error in the logs, you can immediately pull up the full trace and see every service that request touched.

Alert fatigue and how to fix it

Alert fatigue is the number one operational problem we see in enterprise environments. Teams set up alerting, configure thresholds for every metric they can think of, and then get buried in notifications. After a week of false alarms, the on-call engineer starts ignoring alerts. Then a real incident happens, and nobody notices until a user reports it.

The root cause of alert fatigue is alerting on symptoms instead of impact.

A CPU spike to 80% is a symptom. An error rate increase from 0.1% to 2% is a symptom. These may or may not indicate an actual problem. What matters is: are users affected?

We restructured alerting around three tiers:

Tier 1: User-impacting. Error rates, latency percentiles (p95, p99), and availability. These fire to the on-call engineer immediately. If these are alerting, users are having a bad experience right now.

Tier 2: System health. Resource utilization, queue depths, connection pool saturation. These are warnings. They indicate that a problem may develop if not addressed. They go to a Slack channel, not to pager.

Tier 3: Anomalies. Deviations from normal patterns that might be interesting but are not immediately actionable. These feed into a daily review dashboard.

The key insight: every Tier 1 alert should have a runbook. When the alert fires, the on-call engineer should know exactly what to check first, what the likely causes are, and what the remediation steps are. If you cannot write a runbook for an alert, the alert is not specific enough to be useful.

Dashboards that matter

We have seen organizations with hundreds of dashboards where nobody looks at any of them. The problem is not the tool — it is that most dashboards are built to look impressive rather than to answer questions.

The three dashboards every system needs:

Dashboard 1: The business dashboard. This answers "is the system working from the user's perspective?" Key metrics: request volume, error rate, p50/p95/p99 latency, active users, and business-specific KPIs (transactions processed, payments completed, messages delivered). This is the dashboard you look at first during an incident.

Dashboard 2: The service health dashboard. One panel per service showing: request rate, error rate, latency, resource utilization. This answers "which service is the problem?" When Dashboard 1 shows something is wrong, Dashboard 2 tells you where.

Dashboard 3: The dependency dashboard. Shows the health of external dependencies: databases, message queues, third-party APIs, downstream services. This answers "is the problem in our code or in something we depend on?"

The best dashboards are the ones that answer a question in under 5 seconds. If you have to think about what a panel is showing you, the dashboard needs work.

Building the operational culture

Observability is not just tooling. It is a culture. The tools are useless if the team does not use them, does not maintain them, and does not improve them based on real incidents.

Post-incident reviews are the engine of observability improvement. After every significant incident, we ask three questions about observability: How did we detect the problem? What information did we have during the investigation? What information were we missing?

Every missing piece becomes an action item. If we could not identify the failing component quickly, we add a metric or improve a dashboard. If we could not trace a request through the system, we add trace instrumentation. If the alert fired too late, we adjust the threshold or add a leading indicator.

Over time, this creates a feedback loop where every incident makes the system more observable. The goal is not to prevent all incidents — that is impossible. The goal is to detect them quickly, diagnose them quickly, and resolve them quickly.

After implementing this approach across multiple enterprise systems, the pattern is consistent. Mean time to detection drops from hours to minutes. Mean time to resolution drops from hours to tens of minutes. And the on-call engineers actually trust the alerts — because every alert means something real.

Observability is not a project with a completion date. It is a practice that improves continuously. The systems we operate today are orders of magnitude more observable than they were two years ago, and in two years they will be better still. That is the point.

The 3 AM problem

The three pillars and the one-pillar reality

Logs are the default. Every application produces logs. The problem is that most logs are unstructured streams of text that are nearly useless at 3 AM. We will come back to this.

Structured logging that actually helps

Why this matters at 3 AM:

The fields we include in every log event:

Alert fatigue and how to fix it

The root cause of alert fatigue is alerting on symptoms instead of impact.

A CPU spike to 80% is a symptom. An error rate increase from 0.1% to 2% is a symptom. These may or may not indicate an actual problem. What matters is: are users affected?

We restructured alerting around three tiers:

Tier 3: Anomalies. Deviations from normal patterns that might be interesting but are not immediately actionable. These feed into a daily review dashboard.

Dashboards that matter

The three dashboards every system needs:

The best dashboards are the ones that answer a question in under 5 seconds. If you have to think about what a panel is showing you, the dashboard needs work.

Building the operational culture

Observability is not just tooling. It is a culture. The tools are useless if the team does not use them, does not maintain them, and does not improve them based on real incidents.

What We Learned from Thousands of Production Incidents

The 3 AM problem

The three pillars and the one-pillar reality

Structured logging that actually helps

Alert fatigue and how to fix it

Dashboards that matter

Building the operational culture

Let's build something that lasts.

What We Learned from Thousands of Production Incidents

The 3 AM problem

The three pillars and the one-pillar reality

Structured logging that actually helps

Alert fatigue and how to fix it

Dashboards that matter

Building the operational culture

Let's build something that lasts.