Data Versioning and Drift Detection in ML Pipelines Using AI

Aravinda Kumar Appachikumar

PDF

Published: Sep 20, 2024

Keywords:

Machine Learning (ML) pipelines,data integrity, reproducibil- ity, traceability, data versioning, drift detection, AI-powered solutions, CI/CD workflows.

Aravinda Kumar Appachikumar, Subramanya Shashank Gollapudi Venkata, Vijaya lakshmi Middae, Srikanth Yerra

Abstract

1 In the ever-evolving landscape of machine learning (ML), maintaining consistent model performance over time remains a fundamental challenge. One of the key contributors to model degradation is data drift—the change in data distributions over time—which can significantly compromise the reliability of predictions in production environments. Coupled with this is the often-overlooked challenge of data versioning, a critical aspect in the reproducibility and traceability of machine learning experiments. This paper explores the integration of artificial intelligence (AI)-driven methods for robust data versioning and effective drift detection within modern ML pipelines. We begin by examining traditional practices in data versioning using tools such as DVC, Git-LFS, and MLflow, highlighting their limitations in scalability and automation. To address these gaps, we propose an AI-assisted framework that leverages metadata, schema evolution tracking, and automated tagging to enhance version control throughout the pipeline—from data ingestion to model deployment. In parallel, we evaluate advanced techniques for drift detection including statistical methods (e.g., KS-test, PSI) and AI-enhanced approaches such as autoencoders, recurrent neural networks, and ensemble-based monitor- ing systems. Through a series of experiments conducted on real-world retail and finance datasets, our framework demonstrates high sensitivity in detecting concept and data drift while minimizing false positives. Additionally, we showcase how AI can predict potential drift before it impacts model accuracy by analyzing his- torical patterns and input-output shifts using time-series forecasting models. Integration with CI/CD and MLOps platforms ensures seamless deployment and ongoing monitoring in real-time production environments.

The paper concludes by emphasizing the growing need for intelligent, auto- mated systems that provide transparency, accountability, and resilience in ML workflows. As organizations increasingly rely on machine learning models for critical decision-making, ensuring that these systems remain stable and trust- worthy becomes paramount. Our findings reinforce the importance of combining AI techniques with software engineering best practices to create adaptive, self- healing ML pipelines capable of handling data drift and ensuring reproducibility through systematic versioning.

This research contributes to the field of MLOps by providing a scalable, modular, and AI-enhanced approach for managing data versioning and drift de- tection. Future work will involve extending this framework to accommodate fed- erated learning environments and integrating it with blockchain for immutable audit trails

Issue

Volume 2024, Issue 6

Section

Articles

Article Sidebar

Main Article Content

Abstract

Article Details