Publications

Stats

View publication

Title Discovery of Cloud Incidents through Streaming Consolidation of Events across Timeline and Topology Hierarchy
Authors Ashot Harutyunyan, Arnak Poghosyan, Tigran Bunarjyan, Naira Grigoryan, Artur Grigoryan, Vahan Tadevosyan, Nelson Baloian
Publication date 2024
Abstract With the growing complexity and dynamism of cloud environments,
users of operations management solutions are facing a critical headache of
"event storms". Understanding and prioritizing reactions to such high
volumes of noisy recommendation content for various tasks is beyond the
capacities of human operators. This significantly degrades the resolution
metrics of performance issues and optimization of infrastructures and
applications. We have devised a novel streaming clustering algorithm for
processing alerts and discovering Alert Episodes with their evolution
tracked in time and space. It is based on the principles of the classical
density-based clustering DBSCAN. We learn Unknown Problems applying this
algorithm to low-level events within the VMware Aria Operations manager.
Those episodes might typically be out of alert definitions coverage and
explain new types of emerging incidents. Our solutions with different
hyperparameters are prototyped and integrated into the production. We share
experimental insights from an internal environment with interesting alert
episodes learned and unknown problems of alarms/symptoms discovered with a
self-explainable story on where the source of the performance issue stays
and how it evolved into a larger problem situation affecting several objects
and hierarchy layers. The constructs we introduce help reduce user efforts
in making sense of events' waves and perform troubleshooting with relevance.
Our solution can be refactored into an independent event management service
for cloud operations.
Pages 1-7
Conference name IEEE Symposium on Network Operations and Management
Publisher IEEE-xplore
Reference URL View reference page