34 research outputs found
Intelligent Monitoring Framework for Cloud Services: A Data-Driven Approach
Cloud service owners need to continuously monitor their services to ensure
high availability and reliability. Gaps in monitoring can lead to delay in
incident detection and significant negative customer impact. Current process of
monitor creation is ad-hoc and reactive in nature. Developers create monitors
using their tribal knowledge and, primarily, a trial and error based process.
As a result, monitors often have incomplete coverage which leads to production
issues, or, redundancy which results in noise and wasted effort.
In this work, we address this issue by proposing an intelligent monitoring
framework that recommends monitors for cloud services based on their service
properties. We start by mining the attributes of 30,000+ monitors from 791
production services at Microsoft and derive a structured ontology for monitors.
We focus on two crucial dimensions: what to monitor (resources) and which
metrics to monitor. We conduct an extensive empirical study and derive key
insights on the major classes of monitors employed by cloud services at
Microsoft, their associated dimensions, and the interrelationship between
service properties and this ontology. Using these insights, we propose a deep
learning based framework that recommends monitors based on the service
properties. Finally, we conduct a user study with engineers from Microsoft
which demonstrates the usefulness of the proposed framework. The proposed
framework along with the ontology driven projections, succeeded in creating
production quality recommendations for majority of resource classes. This was
also validated by the users from the study who rated the framework's usefulness
as 4.27 out of 5
PACE-LM: Prompting and Augmentation for Calibrated Confidence Estimation with GPT-4 in Cloud Incident Root Cause Analysis
Major cloud providers have employed advanced AI-based solutions like large
language models to aid humans in identifying the root causes of cloud
incidents. Despite the growing prevalence of AI-driven assistants in the root
cause analysis process, their effectiveness in assisting on-call engineers is
constrained by low accuracy due to the intrinsic difficulty of the task, a
propensity for LLM-based approaches to hallucinate, and difficulties in
distinguishing these well-disguised hallucinations. To address this challenge,
we propose to perform confidence estimation for the predictions to help on-call
engineers make decisions on whether to adopt the model prediction. Considering
the black-box nature of many LLM-based root cause predictors, fine-tuning or
temperature-scaling-based approaches are inapplicable. We therefore design an
innovative confidence estimation framework based on prompting
retrieval-augmented large language models (LLMs) that demand a minimal amount
of information from the root cause predictor. This approach consists of two
scoring phases: the LLM-based confidence estimator first evaluates its
confidence in making judgments in the face of the current incident that
reflects its ``grounded-ness" level in reference data, then rates the root
cause prediction based on historical references. An optimization step combines
these two scores for a final confidence assignment. We show that our method is
able to produce calibrated confidence estimates for predicted root causes,
validate the usefulness of retrieved historical data and the prompting strategy
as well as the generalizability across different root cause prediction models.
Our study takes an important move towards reliably and effectively embedding
LLMs into cloud incident management systems
Hybrid Retrieval-Augmented Generation for Real-time Composition Assistance
Retrieval augmented models show promise in enhancing traditional language
models by improving their contextual understanding, integrating private data,
and reducing hallucination. However, the processing time required for retrieval
augmented large language models poses a challenge when applying them to tasks
that require real-time responses, such as composition assistance.
To overcome this limitation, we propose the Hybrid Retrieval-Augmented
Generation (HybridRAG) framework that leverages a hybrid setting that combines
both client and cloud models. HybridRAG incorporates retrieval-augmented memory
generated asynchronously by a Large Language Model (LLM) in the cloud. By
integrating this retrieval augmented memory, the client model acquires the
capability to generate highly effective responses, benefiting from the LLM's
capabilities. Furthermore, through asynchronous memory integration, the client
model is capable of delivering real-time responses to user requests without the
need to wait for memory synchronization from the cloud. Our experiments on
Wikitext and Pile subsets show that HybridRAG achieves lower latency than a
cloud-based retrieval-augmented LLM, while outperforming client-only models in
utility
Dependency Aware Incident Linking in Large Cloud Systems
Despite significant reliability efforts, large-scale cloud services
inevitably experience production incidents that can significantly impact
service availability and customer's satisfaction. Worse, in many cases one
incident can lead to multiple downstream failures due to cascading effects that
creates several related incidents across different dependent services. Often
time On-call Engineers (OCEs) examine these incidents in silos that lead to
significant amount of manual toil and increase the overall time-to-mitigate
incidents. Therefore, developing efficient incident linking models is of
paramount importance for grouping related incidents into clusters so as to
quickly resolve major outages and reduce on-call fatigue. Existing incident
linking methods mostly leverages textual and contextual information of
incidents (e.g., title, description, severity, impacted components), thus
failing to leverage the inter-dependencies between services. In this paper, we
propose the dependency-aware incident linking (DiLink) framework which
leverages both textual and service dependency graph information to improve the
accuracy and coverage of incident links not only coming from same service, but
also from different services and workloads. Furthermore, we propose a novel
method to align the embeddings of multi-modal (i.e., textual and graphical)
data using Orthogonal Procrustes. Extensive experimental results on real-world
incidents from 5 workloads of Microsoft demonstrate that our alignment method
has an F1-score of 0.96 (14% gain over current state-of-the-art methods). We
are also in the process of deploying this solution across 610 services from
these 5 workloads for continuously supporting OCEs improving incident
management and reducing manual toil
X-lifecycle Learning for Cloud Incident Management using LLMs
Incident management for large cloud services is a complex and tedious process
and requires significant amount of manual efforts from on-call engineers
(OCEs). OCEs typically leverage data from different stages of the software
development lifecycle [SDLC] (e.g., codes, configuration, monitor data, service
properties, service dependencies, trouble-shooting documents, etc.) to generate
insights for detection, root causing and mitigating of incidents. Recent
advancements in large language models [LLMs] (e.g., ChatGPT, GPT-4, Gemini)
created opportunities to automatically generate contextual recommendations to
the OCEs assisting them to quickly identify and mitigate critical issues.
However, existing research typically takes a silo-ed view for solving a certain
task in incident management by leveraging data from a single stage of SDLC. In
this paper, we demonstrate that augmenting additional contextual data from
different stages of SDLC improves the performance of two critically important
and practically challenging tasks: (1) automatically generating root cause
recommendations for dependency failure related incidents, and (2) identifying
ontology of service monitors used for automatically detecting incidents. By
leveraging 353 incident and 260 monitor dataset from Microsoft, we demonstrate
that augmenting contextual information from different stages of the SDLC
improves the performance over State-of-The-Art methods
Exploring LLM-based Agents for Root Cause Analysis
The growing complexity of cloud based software systems has resulted in
incident management becoming an integral part of the software development
lifecycle. Root cause analysis (RCA), a critical part of the incident
management process, is a demanding task for on-call engineers, requiring deep
domain knowledge and extensive experience with a team's specific services.
Automation of RCA can result in significant savings of time, and ease the
burden of incident management on on-call engineers. Recently, researchers have
utilized Large Language Models (LLMs) to perform RCA, and have demonstrated
promising results. However, these approaches are not able to dynamically
collect additional diagnostic information such as incident related logs,
metrics or databases, severely restricting their ability to diagnose root
causes. In this work, we explore the use of LLM based agents for RCA to address
this limitation. We present a thorough empirical evaluation of a ReAct agent
equipped with retrieval tools, on an out-of-distribution dataset of production
incidents collected at Microsoft. Results show that ReAct performs
competitively with strong retrieval and reasoning baselines, but with highly
increased factual accuracy. We then extend this evaluation by incorporating
discussions associated with incident reports as additional inputs for the
models, which surprisingly does not yield significant performance improvements.
Lastly, we conduct a case study with a team at Microsoft to equip the ReAct
agent with tools that give it access to external diagnostic services that are
used by the team for manual RCA. Our results show how agents can overcome the
limitations of prior work, and practical considerations for implementing such a
system in practice
Solving the Batch Stochastic Bin Packing Problem in Cloud: A Chance-constrained Optimization Approach
This paper investigates a critical resource allocation problem in the first
party cloud: scheduling containers to machines. There are tens of services and
each service runs a set of homogeneous containers with dynamic resource usage;
containers of a service are scheduled daily in a batch fashion. This problem
can be naturally formulated as Stochastic Bin Packing Problem (SBPP). However,
traditional SBPP research often focuses on cases of empty machines, whose
objective, i.e., to minimize the number of used machines, is not well-defined
for the more common reality with nonempty machines. This paper aims to close
this gap. First, we define a new objective metric, Used Capacity at Confidence
(UCaC), which measures the maximum used resources at a probability and is
proved to be consistent for both empty and nonempty machines, and reformulate
the SBPP under chance constraints. Second, by modeling the container resource
usage distribution in a generative approach, we reveal that UCaC can be
approximated with Gaussian, which is verified by trace data of real-world
applications. Third, we propose an exact solver by solving the equivalent
cutting stock variant as well as two heuristics-based solvers -- UCaC best fit,
bi-level heuristics. We experimentally evaluate these solvers on both synthetic
datasets and real application traces, demonstrating our methodology's advantage
over traditional SBPP optimal solver minimizing the number of used machines,
with a low rate of resource violations.Comment: To appear in SIGKDD 2022 as Research Track pape
Rethinking Privacy in Machine Learning Pipelines from an Information Flow Control Perspective
Modern machine learning systems use models trained on ever-growing corpora.
Typically, metadata such as ownership, access control, or licensing information
is ignored during training. Instead, to mitigate privacy risks, we rely on
generic techniques such as dataset sanitization and differentially private
model training, with inherent privacy/utility trade-offs that hurt model
performance. Moreover, these techniques have limitations in scenarios where
sensitive information is shared across multiple participants and fine-grained
access control is required. By ignoring metadata, we therefore miss an
opportunity to better address security, privacy, and confidentiality
challenges. In this paper, we take an information flow control perspective to
describe machine learning systems, which allows us to leverage metadata such as
access control policies and define clear-cut privacy and confidentiality
guarantees with interpretable information flows. Under this perspective, we
contrast two different approaches to achieve user-level non-interference: 1)
fine-tuning per-user models, and 2) retrieval augmented models that access
user-specific datasets at inference time. We compare these two approaches to a
trivially non-interfering zero-shot baseline using a public model and to a
baseline that fine-tunes this model on the whole corpus. We evaluate trained
models on two datasets of scientific articles and demonstrate that retrieval
augmented architectures deliver the best utility, scalability, and flexibility
while satisfying strict non-interference guarantees
Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation
Recent advancements in Large Language Models (LLMs) have revolutionized
decision-making by breaking down complex problems into more manageable language
sequences referred to as ``thoughts''. An effective thought design should
consider three key perspectives: performance, efficiency, and flexibility.
However, existing thought can at most exhibit two of these attributes. To
address these limitations, we introduce a novel thought prompting approach
called ``Everything of Thoughts'' (XoT) to defy the law of ``Penrose triangle
of existing thought paradigms. XoT leverages pretrained reinforcement learning
and Monte Carlo Tree Search (MCTS) to incorporate external domain knowledge
into thoughts, thereby enhancing LLMs' capabilities and enabling them to
generalize to unseen problems efficiently. Through the utilization of the
MCTS-LLM collaborative thought revision framework, this approach autonomously
produces high-quality comprehensive cognitive mappings with minimal LLM
interactions. Additionally, XoT empowers LLMs to engage in unconstrained
thinking, allowing for flexible cognitive mappings for problems with multiple
solutions. We evaluate XoT on several challenging multi-solution
problem-solving tasks, including Game of 24, 8-Puzzle, and Pocket Cube. Our
results demonstrate that XoT significantly outperforms existing approaches.
Notably, XoT can yield multiple solutions with just one LLM call, showcasing
its remarkable proficiency in addressing complex problems across diverse
domains.Comment: 17 pages, 5 figure
Large Language Models can Deliver Accurate and Interpretable Time Series Anomaly Detection
Time series anomaly detection (TSAD) plays a crucial role in various
industries by identifying atypical patterns that deviate from standard trends,
thereby maintaining system integrity and enabling prompt response measures.
Traditional TSAD models, which often rely on deep learning, require extensive
training data and operate as black boxes, lacking interpretability for detected
anomalies. To address these challenges, we propose LLMAD, a novel TSAD method
that employs Large Language Models (LLMs) to deliver accurate and interpretable
TSAD results. LLMAD innovatively applies LLMs for in-context anomaly detection
by retrieving both positive and negative similar time series segments,
significantly enhancing LLMs' effectiveness. Furthermore, LLMAD employs the
Anomaly Detection Chain-of-Thought (AnoCoT) approach to mimic expert logic for
its decision-making process. This method further enhances its performance and
enables LLMAD to provide explanations for their detections through versatile
perspectives, which are particularly important for user decision-making.
Experiments on three datasets indicate that our LLMAD achieves detection
performance comparable to state-of-the-art deep learning methods while offering
remarkable interpretability for detections. To the best of our knowledge, this
is the first work that directly employs LLMs for TSAD
