View publication
Title | Discovery of Cloud Incidents through Streaming Consolidation of Events across Timeline and Topology Hierarchy |
Authors | Ashot Harutyunyan, Arnak Poghosyan, Tigran Bunarjyan, Naira Grigoryan, Artur Grigoryan, Vahan Tadevosyan, Nelson Baloian |
Publication date | 2024 |
Abstract | With the growing complexity and dynamism of cloud environments, users of operations management solutions are facing a critical headache of "event storms". Understanding and prioritizing reactions to such high volumes of noisy recommendation content for various tasks is beyond the capacities of human operators. This significantly degrades the resolution metrics of performance issues and optimization of infrastructures and applications. We have devised a novel streaming clustering algorithm for processing alerts and discovering Alert Episodes with their evolution tracked in time and space. It is based on the principles of the classical density-based clustering DBSCAN. We learn Unknown Problems applying this algorithm to low-level events within the VMware Aria Operations manager. Those episodes might typically be out of alert definitions coverage and explain new types of emerging incidents. Our solutions with different hyperparameters are prototyped and integrated into the production. We share experimental insights from an internal environment with interesting alert episodes learned and unknown problems of alarms/symptoms discovered with a self-explainable story on where the source of the performance issue stays and how it evolved into a larger problem situation affecting several objects and hierarchy layers. The constructs we introduce help reduce user efforts in making sense of events' waves and perform troubleshooting with relevance. Our solution can be refactored into an independent event management service for cloud operations. |
Pages | 1-7 |
Conference name | IEEE Symposium on Network Operations and Management |
Publisher | IEEE-xplore |
Reference URL |