The Journal of Oral Pathology & Medicine highlights an important systematic review examining the application of machine learning models in diagnosing intraosseous lesions within gnathic bones. The review, conducted in line with PRISMA 2022 guidelines, delves into the reliability, impact, and practicality of these AI models. Notably, it brings attention to the intricacies of data sampling and the necessity for comprehensive metrics to properly evaluate model performance, aspects often overlooked in existing studies.
The systematic review aimed to accumulate evidence regarding the efficacy of machine learning (ML) models in diagnosing intraosseous lesions in gnathic bones. The researchers employed the PICOS framework to address the reliability of AI in this diagnostic field. A thorough search of electronic databases, including PubMed, Embase, Scopus, Cochrane Library, Web of Science, Lilacs, IEEE Xplore, and various gray literature sources, was conducted. The review was registered in the PROSPERO database (CRD42022379298).
Methodology and Data Collection
Utilizing the PROBAST tool, the researchers assessed the risk of bias in the included studies, synthesizing results based on the dataset’s task and sampling strategy. They focused on 26 studies encompassing 21,146 radiographic images, primarily investigating ameloblastomas, odontogenic keratocysts, dentigerous cysts, and periapical cysts. The studies were predominantly classified as type 2 according to TRIPOD, indicating they were randomly divided.
The evaluation revealed that only 13 studies provided the F1 score metrics for 20 trials, with an average score of 0.71 (±0.25). This score reflects the balance between precision and recall in the models’ performance. However, the review highlighted significant gaps in reporting detailed data sampling methods and the lack of a comprehensive set of metrics for training and validation purposes.
Study Outcomes and Interpretations
Despite the promising potential of ML models, the review found no conclusive evidence supporting their routine clinical application for detecting, segmenting, and classifying intraosseous lesions in gnathic bones. The lack of external testing and inadequate detail on data sampling hinder the ability to properly evaluate the models’ performance.
Earlier reports have similarly indicated the potential of ML in medical diagnoses but often cite issues with data quality and the need for more robust validation processes. Comparatively, this systematic review reiterates these concerns, emphasizing the necessity for external testing and comprehensive metric reporting to enhance the reliability of ML models in clinical settings.
In contrast to previous studies, which might have been more optimistic about AI’s capabilities, this review provides a more cautious outlook, underlining the importance of addressing methodological limitations. The emphasis on rigorous data sampling and comprehensive validation metrics signifies a shift towards more meticulous standards in evaluating AI models.
To advance the field, future research should prioritize the inclusion of detailed data sampling information and the application of a comprehensive set of metrics for model training and validation. This approach will facilitate a more accurate assessment of ML models’ performance and their potential integration into clinical practice. Moreover, the implementation of external testing can provide a more objective evaluation, ensuring these models’ reliability and effectiveness in real-world scenarios.