Download PDFOpen PDF in browserData Quality Management for Real-World Graduation PredictionEasyChair Preprint 156986 pages•Date: January 10, 2025AbstractThe rapid growth of diverse and multi-sourced data has rendered traditional data storage models inadequate to handle the sheer volume and complexity. Data Lakes, which store all raw data and all data versions in an easily accessible format, are well-suited for deep data analysis and valuable insights discovery. However, the quality of this data is not guaranteed, raising the question of how to utilize this vast repository effectively. Our research proposes a four-step data quality management process profile, implement, monitor, and improve to oversee and ensure data usability within a data lake. This process employs five commonly used evaluation criteria: accuracy, completeness, consistency, uniqueness, and timeliness. Our study focuses on higher education data, an area that has not been extensively explored in previous research, using real-world data from a university’s computer science department. The application context is managing the quality of input data for a machine-learning model that predicts student graduation outcomes. Two advanced boosting machine learning models, LightGBM and CatBoost, are employed, resulting in a 5% improvement in performance. Our research aims to provide a comprehensive solution for assessing data quality in higher education, saving significant time, effort, and cost while enhancing the reliability of data utilization from data lakes Keyphrases: Big Data, Data Quality Management, Educational Data Mining, graduation prediction
|