Orchestrating Adaptive Resilience and Continuity Restoration in Cloud-Native Environments
Amar Gurajapu
Principal Member of Technical Staff, Network Systems, AT&T, Middletown, New Jersey, United States
Anurag Agarwal
Senior Software Engineer, Network Systems, AT&T, Middletown, New Jersey, United States
Download PDF
http://doi.org/10.37648/ijiest.v12i01.001
Abstract
Cloud-native services must tolerate node failures, network partitions, and entire-region outages without violating SLAs. We survey Adaptive Resilience Mechanisms (ARMs) including pod-level checkpointing, self-healing circuits, and dynamic redundancy—and Continuity Restoration Strategies (CRSs) such as geo-replication with automated DNS switchover. Then we present an AI-driven framework that fuses real-time telemetry, anomaly detection via LSTM autoencoders, failure classification, and Infrastructure-as-Code orchestration. A two-region Kubernetes prototype achieves a Restoration Time Objective (RTO) under 3 minutes and a Continuity Point Objective (CPO) under 5 seconds, improving data continuity by 40 % and availability by 10 %.
Keywords: Infrastructure as Code; Terraform; CI/CD Resource Allocation; Kubernetes Checkpoint/Restore; Geo-Replication; DNS Switchover; Restoration Time Objective (RTO); Continuity Point Objective (CPO)
- Eskandani, N., Koziolek, H., Hark, R., & Linsbauer, S. (2024). The state of container checkpointing with CRIU: A multi-case experience report. In 2024 IEEE 21st International Conference on Software Architecture Companion (ICSA-C) (pp. 54–59). IEEE. https://doi.org/10.1109/ICSA-C63560.2024.00015
- Gur, A. (2025, October 28). Best practices for monitoring Kubernetes clusters: Reliability and minimise operational overhead. ResearchGate. https://www.researchgate.net/publication/399121579
- Lee, Y., Park, C., Kim, N., Ahn, J., & Jeong, J. (2024). LSTM-autoencoder based anomaly detection using vibration data of wind turbines. Sensors, 24(9), 2833. https://doi.org/10.3390/s24092833
- NVIDIA. (n.d.). What is a random forest? NVIDIA Data Science Glossary. https://www.nvidia.com/en-us/glossary/random-forest/
- Serverless Inc. (2025). Serverless container framework documentation. https://www.serverless.com/containers/docs
- Sun, Z. (2025, June 4). Autoencoders for time series anomaly detection: A visual and practical guide. Medium. https://medium.com/@injure21/autoencoder-for-time-series-anomaly-detection-021d4b9c7909
- Ternary & Ternary Team. (2025, March 11). Anomaly detection comparison in AWS vs. Azure vs. Google Cloud. Ternary. https://ternary.app/blog/anomaly-detection-comparison-aws-vs-azure-vs-gcp/