Explainable AI Based Reliability Analytics for Performance Optimization in Large Scale Cloud Services
Main Article Content
Abstract
Grand scale cloud services have highly dynamic workloads, heterogeneous resources, as well as service dependencies that are so complex that performance optimization and reliability assurance becomes progressively more difficult with respect to traditional monitoring and rule-based management too. Although artificial intelligence has improved cloud service management through the detection of anomalies, prediction of faults as well as adaptive optimization, most AI-based solutions are ranked low on interpretability, which lowers their trust and use in mission-critical applications. The current paper proposes an Explainable AI-based Reliability Analytics (XAI-RA) model to streamline the operations of the big-scale cloud services. It consists of a distributed monitoring agent, anomaly detectors via machine learning, and explainable inference via SHAP / LIME, reliability analytics, performance optimization controller via an adjustable performance controller, and an integrated architecture. The proposed system will be intended to identify un-normative cloud behavior, the most important performance variables, help in interpreting root causes, and trigger optimization behaviors, such as resource allocation and automatic scaling. Experimental assessment of the speculated framework during simulated cloud workloads reveals that the suggested framework can reduce the average response time of 272 ms to 214 ms and throughput of 2580 requests/s to 3345 requests/s, which so far is a good indication of the performance enhancement. The most accurate, precise, recall, and F1-score of 97.3, 96.9, 96.4, and 96.6 percent respectively rated in the XAI-RA model that was proposed in anomaly detection are better than the traditional machine learning baselines. The explainability analysis also shows that the factors capable of causing cloud performance anomalies most include the CPU utilization, the memory usage and the network latency. Scalability test of a load of 1000 to 6000 requests has confirmed that the framework has a reduced response time and a higher throughput than the baseline system with the differences of 33.7 and 38.7 at peak load, respectively. The results show the extent in which explainable AI and reliability analytics can be integrably implemented in building realistic and reliable middle ground on the management of intelligent, scale-able and transparent cloud services.