In the current clinical trial landscape, the traditional way of data cleaning is clearly evolving toward real-time data cleaning, which transforms the process into a more proactive and efficient one. Data management must adopt more effective methods for reviewing and cleaning data, as traditional review practices alone are no longer sufficient.
The objective of data cleaning doesn’t change to ensure complete and consistent data throughout the entire study. However, real-time data cleaning enables this goal to be achieved more efficiently by reducing timelines and improving overall data quality.
Effective data cleaning is not limited to the data itself but also involves the integrated and comprehensive cleaning of all study components.
The Challenges of Traditional Data Cleaning
Traditional data cleaning in clinical trials typically involves collecting and entering data into the Electronic Data Capture (EDC) system without immediate validation. As a result, discrepancies and inconsistencies are not addressed in real-time but are instead reviewed and resolved later during scheduled batch cleaning cycles (weekly or monthly), delaying the availability of clean reliable data.
While automated edit checks allow for some level of validation at the point of data entry, it will be after using data review listings or external data reconciliation. This is often performed manually, when issues such as logical inconsistencies within a single CRF, discrepancies with external data sources, or SAE reconciliation are checked to ensure the data consistency.
This approach typically results in the generation of a large volume of queries at once, which are more difficult to resolve because they relate to older data. Consequently, there is greater dependence on Clinical Research Associates (CRAs) for resolution. As a result, many errors are detected at a later stage, when they are more difficult to correct.
This concept changes significantly with real-time data cleaning, which allows for continuous detection and correction of errors in clinical data throughout the duration of the trial, either simultaneously with data capture or with minimal delay (near real-time).
Leveraging Real-Time Data Cleaning Strategies
Real-time data cleaning strategies consist of a set of processes and tools designed to validate, detect, and correct errors or inconsistencies in a continuous and near-immediate way, as data are entered or integrated into the system (EDC and external sources).
-
A well-defined set of automatic edit checks in the Data Review Plan ensures validation at the point of data entry and prevents incorrect or inconsistent values.
-
Take into account Risk-Based Monitoring (RBM) principles which allow focus on critical data (e.g., primary endpoints, safety variables) and early identification of trends or risks in data to improve the accuracy and quality of future data received. The percentage of data modified during the cleaning process in relation to the total number of data points is only a small proportion. The traditional approach involved 100% SDV against source data, whereas the current approach focuses on a subset of critical data. RBM may support this targeted strategy in real-time data cleaning too.
-
Modern systems enable real-time integration of multiple data sources (such as ePRO, eCOA, imaging, laboratory, and safety data) through APIs, allowing continuous data flow and validation. This approach supports automated and ongoing data reconciliation by comparing EDC data with external sources in real time, helping to proactively identify and prevent inconsistencies or quality issues. When discrepancies are detected across sources, queries can be generated automatically and immediately, facilitating faster resolution and improved data quality throughout the study.
It is well known that external data often arrives late, has different formats, and generates complex discrepancies that require resolution not only by the site but also by the external vendor. With real-time integration, external data is loaded frequently, automatically compared with EDC data, and validated instantly.
Currently, clinical trials are becoming more decentralized (DCTs), therefore the reliance on external data increases and the role of real-time data cleaning becomes even more critical. The integration of all data in the same system allows all team members, regardless of role, to work with the same data and identify issues simultaneously.
-
A robust Clean Patient Tracker is essential for effective real-time data cleaning. By integrating all data into a single platform, it provides a comprehensive, real-time view of data quality and cleaning status across multiple sources, not only EDC, but also laboratory systems, ePRO, eCOA, Imaging and Safety databases.
Centralizing external data within the same system also supports the creation of global dashboards, enabling the identification of systemic issues and highlighting vendors with recurring data quality concerns.
-
The increased adoption of Artificial Intelligence (AI) is transforming real-time data cleaning by enabling the rapid processing and review of large volumes of data. While automation enhances efficiency and accelerates data validation, human oversight remains essential. Clinical Data Managers, Medical Monitors, CRAs, and Statisticians continue to be critical for complex data review and decision-making.
Conclusion
Real-time data cleaning strategies are a very important change in the clinical data management way of work. These strategies enable early detection and resolution of data issues and reduce the time needed for Final Database Lock.
Successful implementation requires a well-defined process, cross-functional collaboration, and a combination of automation and expert oversight.
Author:
Evangelina Garcia
Principal Data Project Manager