The alert fired at 2:47 AM on a Tuesday. P99 latency (the maximum response time experienced by 99% of requests) on the payment validation service had climbed from 82 milliseconds to 340 milliseconds and was giving the on-call team pause.
The team opened their APM dashboard. The traces looked normal in terms of shape: individual spans were within acceptable ranges, no downstream service was throwing errors, and no database query was obviously hot. Distributed tracing showed the request flowing through the right services in the right order. The APM solution was doing exactly what it was designed to do, and it was telling the team that nothing visible looked wrong.
That is the moment every SRE (Site Reliability Engineer) dreads. The monitoring solution is doing its job, but you are still not able to determine root cause.

What APM Sees
APM tools work by loading a Java agent alongside the JVM, instrumenting bytecode at method entry and exit points to record timing, errors, and trace context. Some vendors also offer continuous profilers that sample stack traces in production to surface hotspots. These are powerful capabilities and essential ones. For distributed systems, there is no substitute for seeing how a request moves across services, or for knowing the moment a downstream dependency starts misbehaving.
But the model has a hard architectural boundary. Sampling-based profilers can extend this picture somewhat, surfacing which code paths consume the most CPU. What neither approach can do is tell you whether a specific class has ever loaded and executed across your entire fleet or confirm that a code path you retired last quarter has not been seen since. APM and profiling show what’s happening and what’s running slower than expected. The JVM natively knows something different: what code exists in production, and has it ever run?
The Missing Context
After exhausting the APM traces, the team turned to Azul Intelligence Cloud to get a fleet-level view of what was actually running. It uses information the JVM inherently has when running a Java application. One of its core capabilities is JVM Inventory, which continuously monitors every running JVM in the environment and captures vendor, version, configuration, and which applications are deployed on each instance.
A reasonable question is what Intelligence Cloud actually installs and whether it requires switching to an Azul JDK. It deploys a lightweight Java agent attached at JVM startup via the standard
-javaagent flag, alongside a Forwarder that brokers communication between your JVMs and the Intelligence Cloud SaaS. The agent works with any JVM running Java 8 or newer, regardless of vendor or OS. Unlike a traditional APM agent, it reads data the JVM already tracks natively rather than instrumenting bytecode at method entry and exit points, with negligible impact on performance.
The inventory surfaced something the APM dashboard had no way to show. Of the twelve nodes running the payment validation service, eleven were on the same JVM vendor and version that the platform team had standardized three months earlier. The twelfth was not. It was running an older build, a configuration that had been formally retired as part of a vendor consolidation effort. That node had been sitting in the cluster behind a load balancer, processing its share of traffic ever since. However, nothing in their APM tooling had flagged it because from a trace perspective it was behaving acceptably.
It was not the root cause, but it was a new signal for the incident team and it pointed directly at where to look next.

Evaluating the Code
The deeper answer came from Code Inventory, the second major Intelligence Cloud capability. Code Inventory maintains a historical record of when specific classes and methods were first loaded and last executed, down to the package and method level, across the full application fleet. This is not sampling data. It is a continuous catalog of the runtime execution that bytecode instrumentation cannot produce, because it reads what the JVM already knows rather than injecting new probes to gather fresh observations.
When the team queried Code Inventory for the payment validation service, they found something unexpected. A class called LegacyDiscountValidationRoutine, a component marked for removal during a major refactoring effort the previous year, was showing recent execution timestamps. The last execution was just eleven minutes before they pulled the report.
The routine had been removed from the main codebase, but the twelfth node was running a container image that predated the refactoring. Every request that landed on that node was being routed through the old validation path, which made an additional synchronous call to a discount service that was in the process of being decommissioned. The service was still running, but on reduced infrastructure as the team wound it down. That service was responding, but slowly, and without throwing explicit errors. Enough extra latency to inflate the P99 for the entire cluster. Not enough to produce a trace that looked wrong in isolation.
The APM trace for those requests showed a slightly longer span in the validation step. Without knowing the legacy routine existed and was executing, there was nothing suspicious about the duration by itself. The instrumentation had faithfully reported what happened. It just had no way to tell the team what was causing it.
Resolving the Issue
The fix was straightforward once the diagnosis was clear:
- Update the stale container image on the twelfth node.
- Confirm Code Inventory no longer showed execution of the legacy routine.
- Watch P99 return to baseline.
Total time from receiving Intelligence Cloud data to resolution was about an hour.
What the incident exposed was more instructive than the fix itself. The team had no mechanism for knowing that a ghost code path was running in production. Their APM solution gave them excellent visibility into individual traces. Their deployment tooling had a gap that allowed a stale image to persist across a significant configuration change.
A fair critique is that good deployment hygiene and image digests should prevent this. But deployment tooling records what you intended to run, not what is actually running. Those two things diverge more than teams expect, through rollbacks applied outside the normal pipeline, cached image layers, and configuration drift. Code Inventory is not a substitute for deployment hygiene. It is the verification layer that confirms intent and reality are aligned, i.e. “Is the code we believe is running actually the code that is running?”
That question turns out to be useful well beyond incident response, since:
- Performance engineers can scope load tests around the code paths that execute in production.
- SREs get a continuous, always-current inventory of what is deployed and running, not a profiler snapshot frozen at a single moment in time.
- DevOps and platform teams get a way to verify over time that retired code paths have actually stopped executing across the fleet, something deployment logs alone cannot confirm.
- Engineering managers get an answer to a question most observability stacks cannot provide: what code is actually running in production right now, across every node. That has value for compliance, incident governance, and making the case for technical debt remediation.
APM is not going anywhere, nor should it. It is the right tool for understanding what happened inside a request, why a user experienced an error, and where time is being spent across a distributed call graph. But it answers a specific set of questions, and those questions are not the same ones the JVM itself can answer.
Intelligence Cloud is not an observability solution. It is a runtime intelligence layer that surfaces what the JVM inherently knows. The JVM knows what code has ever been loaded and run. It knows what runtime configuration is in place on every node. It knows whether the class your team retired last quarter is still executing at 2 AM on a Tuesday. APM was never built to surface these things because the information does not live in execution events. It lives in the runtime itself.
The teams that operate most effectively in production treat the JVM as a first-class data source, not just a platform their monitoring agent happens to sit on top of. If your APM solution can give you a complete trace of every request but cannot tell you what code is running across your fleet, you have a gap worth closing. The payment service incident took 48 hours to diagnose and an hour to resolve once the right data was in the room. If that gap sounds familiar, we invite you to reach out and schedule a deeper conversation to learn more about Azul Intelligence Cloud.