Logan Rosen's session from Monitorama PDX 2024.
Our team focuses on improving the observability experience for engineers at our company, specifically in triaging/performing causal analysis for alerts. Previously, we have taken a time-based correlation approach, which can yield success but is not always point to the right problem. Especially in a multi-layered stack of many microservices, we would often end up with red herrings when comparing metrics that seemed to match up in shape but didn't end up being related. It is important that we point engineers in the right direction, and quickly, in order to reduce the amount of time it takes to resolve site impact.
To remedy this, we focused on leveraging distributed tracing for determining causation - by stream processing the events emitted at each hop of a request, we could deterministically point to the problems that led to the alerts very soon after they arise. Specifically, we could look at chains of errors and/or latency on a per-trace basis and use aggregate counts over time to provide a causal graph to engineers debugging site issues.
We were able to build this by leveraging Go and Kafka, and it required significant amounts of performance tuning and careful coding to make it efficient as possible, given the immense scale of trace events being processed. The data coming out of this pipeline has shown tremendous promise, and we aim to surface it to our users by the end of this year/early 2024.
This will be a structured talk that walks audience members through our journey of maturing our triage/RCA approaches and building this pipeline, as well as through the technical challenges we encountered/how we surmounted them. It is targeted at engineers looking to improve their observability tooling within their own organizations.
Негізгі бет Monitorama PDX 2024 - From Alerts to Insights: Performing Trace-Based Causation at Scale
Пікірлер