Standard validation protocols in materials machine learning continue to rely on the assumption that training and test data are drawn from the same underlying distribution. This assumption is almost invariably violated in real-world materials datasets because of temporal drift in measurement techniques, compositional biases in database construction, and experimental confounders arising from different laboratories and instruments. This conceptual framework article proposes adversarial validation as a diagnostic tool specifically tailored for materials informatics: a method that trains a discriminator to explicitly detect whether a distribution shift exists between any two datasets, thereby revealing hidden generalization failures that conventional train-test splits and k-fold cross-validation cannot expose. The framework introduces the conceptual foundations of adversarial validation, distinguishes it from adversarial attacks, articulates why the technique is particularly powerful in the small-data, high-dimensional, and physically constrained domain of materials science, and offers a five-component structure for its systematic application—feature-space definition, classifier selection, shift-detection thresholding, localization of driving features, and actionable response rules. By embedding materials-specific domain knowledge into the interpretation of discriminator performance, the approach transforms validation from a passive checkpoint into an active diagnostic that can distinguish temporal shift from compositional bias and experimental confounding. The implications for materials AI practice are immediate and transformative: researchers can now report adversarial validation results alongside standard metrics, trigger targeted dataset augmentation or model retraining when shifts are detected, and document potential sources of distribution mismatch in experimental workflows, ultimately raising the robustness and trustworthiness of property predictions that underpin materials discovery and design.
This review systematically examines the treatment of absence and null results in the materials machine learning literature spanning 2017–2022, drawing exclusively on a curated set of 30 peer-reviewed publications and foundational works that address publication bias, negative findings, and reproducibility challenges in data-driven materials discovery. Through a targeted search strategy across databases such as Web of Science, Scopus, and arXiv using terms including “null result,” “negative result,” “publication bias,” “file drawer,” “failed synthesis,” and “reproducibility” combined with materials informatics keywords, the analysis reveals a persistent imbalance: while successful predictions and syntheses dominate published outputs, systematic documentation of failed predictions, unsuccessful syntheses, null correlations, and abandoned model architectures remains exceedingly rare. What is currently reported tends to be limited to negative outcomes that coincidentally reveal mechanistic insights or contradict high-profile hypotheses, whereas what is systematically unreported encompasses the vast majority of unsuccessful hyperparameter searches, negative active learning campaigns, and non-discoveries that yield no novel materials meeting target criteria. The typology of absence and null results developed here identifies six distinct categories—negative predictive outcomes, null hypothesis non-rejection, failed synthesis, non-discovery, failed replication, and abandoned architecture—each carrying unique implications for scientific progress. The consequences of this non-reporting include severe overestimation of model performance, widespread redundant experimental effort, a false sense of methodological consensus across the field, and slowed overall discovery rates as potentially informative negative signals remain invisible. Ultimately, this review offers concrete recommendations for authors, journals, and the broader community to shift incentives toward transparent reporting of absence, thereby restoring balance to the materials AI literature and accelerating reliable data-driven discovery.
In the rapidly expanding domain of artificial intelligence applied to materials science, a persistent conceptual ambiguity undermines the reliability of reported model capabilities. The terms “generalization” and “transfer” are routinely conflated, with authors claiming that a model “generalizes” when it is in fact being evaluated on samples drawn from a distinctly different distribution. This boundary/definitional paper draws a sharp conceptual distinction between the two notions. Generalization is defined as the expected performance of a trained model on new samples drawn independently and identically from the same underlying distribution as the training data. In contrast, transfer is defined as performance on samples drawn from a different distribution, where the I.I.D. assumption is violated by construction. The distinction matters because a model that generalizes excellently within its training distribution can fail dramatically under transfer conditions, and conversely, a successful transfer mechanism may mask poor generalization; treating the two interchangeably, therefore, produces overclaims about model robustness that cannot be sustained when materials discovery moves beyond the convex hull of available training data. The paper articulates a two-dimensional boundary framework—distribution-shift magnitude and feature-space overlap—that locates any given evaluation setting along a continuum from pure generalization to pure transfer, thereby enabling authors, reviewers, and practitioners to specify precisely which capability is being claimed and tested. By clarifying these boundaries and exposing the epistemic costs of current usage, the work supplies a conceptual foundation for more disciplined reporting standards and evaluation protocols in materials machine learning.
This review systematically examines the handling—or more often the neglect—of domain shift within the materials machine learning literature published between 2017 and 2023, drawing on a targeted search of peer-reviewed publications across specialized databases and journals to compile and analyze exactly 30 representative studies that span foundational overviews, application-focused works, and methodological explorations. Domain shift in materials science takes four distinct yet interrelated forms—temporal, compositional, experimental, and theoretical—each arising from the inherently heterogeneous nature of materials data sources that range from evolving laboratory protocols and diverse chemical families to inter-laboratory variations and discrepancies between computational approximations and experimental realities. Current practices reveal that explicit acknowledgment of domain shift remains rare, with the majority of papers proceeding under the default assumption of identical training and test distributions. At the same time, detection methods and adaptation strategies appear in fewer than one in five studies, leaving models vulnerable to silent degradation when deployed on real-world materials problems. The surveyed methods for handling domain shift include statistical detection techniques, domain-adversarial training frameworks, feature-alignment approaches, and shift-robust evaluation protocols, many of which have been proposed in adjacent machine-learning fields yet remain underutilized in materials contexts despite their direct relevance to property prediction and inverse design tasks. Collectively, these findings underscore the urgent need for standardized shift-reporting protocols, the development of materials-specific out-of-distribution benchmarks, and the integration of domain-adaptation pipelines into routine workflows, thereby elevating the reliability, generalizability, and practical utility of machine-learning models in accelerating materials discovery.