View publication

Title A Study on Automated Problem Troubleshooting in Cloud Environments with Rule Induction and Verification
Authors Arnak Poghosyan, Ashot Harutyunyan, Edgar Davtyan, Karen Petrosyan, Nelson Baloian
Publication date 2024
Abstract In a vast majority of cases, remediation of IT issues
into domain-specific or
user-defined alerts occurring in cloud environments and customer ecosystems
suffers from accurate
recommendations, which could be supplied in a timely manner for recovery of
performance degradations. This is hard to realize by furnishing those
abnormality definitions with appropriate expert
knowledge, which varies from one environment to another. At the same time,
in many support cases,
the reported problems under Global Support Services (GSS) or Site
Reliability Engineering (SRE)
treatment ultimately go down to the product teams, making them waste costly
development hours on
investigating self-monitoring metrics of our solutions. Therefore, the lack
of a systematic approach to
adopting AI Ops significantly impacts the mean-time-to-resolution (MTTR)
rates of problems/alerts.
This would imply building, maintaining, and continuously
improving/annotating a data store of
insights on which ML models are trained and generalized across the whole
customer base and
corporate cloud services. Our ongoing study aligns with this vision and
validates an approach that
learns the alert resolution patterns in such a global setting and explains
them using interpretable
AI methodologies. The knowledge store of causative rules is then applied to
predicting potential
sources of the application degradation reflected in an active alert
instance. In this communication,
we share our experiences with a prototype solution and up-to-date analysis
demonstrating how
root conditions are discovered accurately for a specific type of problem. It
is validated against the
historical data of resolutions performed by heavy manual development
efforts. We also offer experts
a Dempster-Shafer theory-based rule verification framework as a what-if
analysis tool to test their
hypotheses about the underlying environment.
Downloaded 79 times
Pages article 1047
Volume 14