Monitorama PDX 2024 - From Alerts to Insights: Performing Trace-Based Causation at Scale

Logan Rosen's session from Monitorama PDX 2024.
Our team focuses on improving the observability experience for engineers at our company, specifically in triaging/performing causal analysis for alerts. Previously, we have taken a time-based correlation approach, which can yield success but is not always point to the right problem. Especially in a multi-layered stack of many microservices, we would often end up with red herrings when comparing metrics that seemed to match up in shape but didn't end up being related. It is important that we point engineers in the right direction, and quickly, in order to reduce the amount of time it takes to resolve site impact.
To remedy this, we focused on leveraging distributed tracing for determining causation - by stream processing the events emitted at each hop of a request, we could deterministically point to the problems that led to the alerts very soon after they arise. Specifically, we could look at chains of errors and/or latency on a per-trace basis and use aggregate counts over time to provide a causal graph to engineers debugging site issues.
We were able to build this by leveraging Go and Kafka, and it required significant amounts of performance tuning and careful coding to make it efficient as possible, given the immense scale of trace events being processed. The data coming out of this pipeline has shown tremendous promise, and we aim to surface it to our users by the end of this year/early 2024.
This will be a structured talk that walks audience members through our journey of maturing our triage/RCA approaches and building this pipeline, as well as through the technical challenges we encountered/how we surmounted them. It is targeted at engineers looking to improve their observability tooling within their own organizations.

Жүктеу

Where Are Laid Off Tech Employees Going? | CNBC Marathon

The $25B Oil Pipeline That Could Make or Break Canada’s Economy | WSJ Breaking Ground

Допрос | 2 серия | Сериал «Эскорт. Новый вызов» | КОНКУРС

Officer Rabbit is so bad. He made Luffy deaf. #funny #supersiblings #comedy

Как подписать? 😂 #shorts

哈哈大家为了进去也是想尽办法！#火影忍者 #佐助 #家庭

Rory Sutherland on the Magic of Original Thinking

The Race to Harness Quantum Computing's Mind-Bending Power | The Future With Hannah Fry

Why Musk and Other Tech Execs Want as Many Babies as Possible | WSJ

Top 5 techniques for building the worst microservice system ever - William Brander - NDC London 2023

Event-Driven Architecture (EDA) vs Request/Response (RR)

Generative AI and Observability Automation

The Gray Area | Yuval Noah Harari on the AI revolution

How Corporate Greed Killed Local News

Data Engineering Principles - Build frameworks not pipelines - Gatis Seja

Tech Talk: What is Public Key Infrastructure (PKI)?

Допрос | 2 серия | Сериал «Эскорт. Новый вызов» | КОНКУРС

Monitorama PDX 2024 - From Alerts to Insights: Performing Trace-Based Causation at Scale

Пікірлер