Automated Root Cause Analysis in SRE: Applying Machine Learning to Incident Telemetry and Change Events

research-article
Received: Mar 25, 2024
Published: Jun 12, 2024
Authors:

Abstract

Root cause analysis (RCA) remains time-consuming due to fragmented telemetry and complex change histories. This paper proposes a machine learning approach that correlates incident telemetry with deployment events, configuration changes, and dependency health signals to suggest likely fault domains. The study evaluates graph-based correlation, anomaly scoring, and explanation techniques that keep recommendations actionable for engineers. Results show reductions in triage time and improved first-guess accuracy during high-severity incidents.

Cite this article

(2024). Automated Root Cause Analysis in SRE: Applying Machine Learning to Incident Telemetry and Change Events. Research Explorations in Global Knowledge & Technology (REGKT), 3 (2). Retrieved from https://regkt.com/article.php?id=781&slug=automated-root-cause-analysis-sre-applying-ml-incident-telemetry-change-events

Premium Membership Required

You need a premium account to view or download this article.

Become Premium