Artificial intelligence (AI) has rapidly expanded the scale and ambition of materials research, enabling property prediction, candidate screening, and data-driven optimization across large chemical and structural spaces. However, the field still lacks a discipline-specific standard for scientific accountability: a structured way to report what an AI output legitimately warrants, under which assumptions, and with what limitations. This gap is not cosmetic; it is epistemic. Materials AI often converts heterogeneous proxies (composition features, crystal graphs, microstructure descriptors) into numerical predictions. Yet, manuscripts frequently present these outputs as claims of generality, mechanism, or design readiness without specifying the scope conditions that would make such claims defensible. Recent progress in graph neural networks, benchmark suites, and large community datasets improves comparability. Still, it also amplifies risks of leakage, distribution shift, and proxy instability, which can inflate conclusions while remaining underreported. Meanwhile, uncertainty quantification and explainable AI are increasingly used as trust signals, even though both can be misunderstood when their semantics are not clearly stated, and their limitations are not operationalized for decision-making. We propose a novel conceptual standard—the Scientific Accountability Sheet (SAS)—which binds reported claims to explicit claim types, scope boundaries, evidence anchors, uncertainty semantics, and decision admissibility. SAS reframes “responsible reporting” as a scientific warrant structure rather than an optional best-practice appendix.
Artificial intelligence (AI) in materials science is often treated as a pipeline in which bias primarily emerges during model training, evaluation, or deployment. This framing is structurally incomplete. Many distortions later labeled as “dataset bias” are already introduced before any dataset is formally assembled, labeled, cleaned, or modeled. This conceptual manuscript advances a theory-first account of pre-dataset bias: systematic misrepresentation that originates upstream of data tables through decisions about what counts as a material instance, a property definition, a valid operating regime, and an actionable target. We argue that early bias is not merely a statistical artifact but an epistemic and procedural commitment that shapes what becomes observable, measurable, and publishable. We introduce a novel framework—the bias before data (BBD) framework—which decomposes pre-dataset bias into five coupled mechanisms: problem framing bias, regime availability bias, measurement–proxy bias, curation–visibility bias, and legitimacy bias. BBD provides a structured vocabulary for identifying where bias enters, why it persists despite technical improvements, and how it constrains the legitimacy of scientific claims even when predictive performance appears strong.
The integration of artificial intelligence into materials science has accelerated property prediction, inverse design, and discovery pipelines. Yet, the reliability of resulting scientific claims remains vulnerable to distribution shifts—systematic differences between training and inference data distributions arising from variations in synthesis protocols, characterization instruments, environmental conditions, or sampling biases. This purely conceptual manuscript develops a novel theoretical framework for robust materials AI inference in the presence of such shifts. We posit that distribution shifts do not merely degrade predictive accuracy but fundamentally alter the epistemic status of scientific claims by introducing unaccounted covariances between material descriptors and latent generative processes. The framework reconceptualizes inference as a multi-layered epistemic process: (i) shift ontology delineation, (ii) value-laden alignment of data representations with domain invariants, and (iii) claim robustness via counterfactual stabilization. By synthesizing insights from materials informatics, machine learning theory on distribution shifts, and philosophical analyses of epistemic values in science, we argue that robust inference requires explicit modeling of shift-induced epistemic uncertainty rather than mitigation as a post hoc engineering concern. This theory provides a conceptual scaffold for evaluating the validity of AI-derived materials claims across heterogeneous datasets, advancing a shift from performance-centric to epistemically grounded AI deployment in materials science.
Standard validation protocols in materials machine learning continue to rely on the assumption that training and test data are drawn from the same underlying distribution. This assumption is almost invariably violated in real-world materials datasets because of temporal drift in measurement techniques, compositional biases in database construction, and experimental confounders arising from different laboratories and instruments. This conceptual framework article proposes adversarial validation as a diagnostic tool specifically tailored for materials informatics: a method that trains a discriminator to explicitly detect whether a distribution shift exists between any two datasets, thereby revealing hidden generalization failures that conventional train-test splits and k-fold cross-validation cannot expose. The framework introduces the conceptual foundations of adversarial validation, distinguishes it from adversarial attacks, articulates why the technique is particularly powerful in the small-data, high-dimensional, and physically constrained domain of materials science, and offers a five-component structure for its systematic application—feature-space definition, classifier selection, shift-detection thresholding, localization of driving features, and actionable response rules. By embedding materials-specific domain knowledge into the interpretation of discriminator performance, the approach transforms validation from a passive checkpoint into an active diagnostic that can distinguish temporal shift from compositional bias and experimental confounding. The implications for materials AI practice are immediate and transformative: researchers can now report adversarial validation results alongside standard metrics, trigger targeted dataset augmentation or model retraining when shifts are detected, and document potential sources of distribution mismatch in experimental workflows, ultimately raising the robustness and trustworthiness of property predictions that underpin materials discovery and design.
In the rapidly expanding domain of artificial intelligence applied to materials science, a persistent conceptual ambiguity undermines the reliability of reported model capabilities. The terms “generalization” and “transfer” are routinely conflated, with authors claiming that a model “generalizes” when it is in fact being evaluated on samples drawn from a distinctly different distribution. This boundary/definitional paper draws a sharp conceptual distinction between the two notions. Generalization is defined as the expected performance of a trained model on new samples drawn independently and identically from the same underlying distribution as the training data. In contrast, transfer is defined as performance on samples drawn from a different distribution, where the I.I.D. assumption is violated by construction. The distinction matters because a model that generalizes excellently within its training distribution can fail dramatically under transfer conditions, and conversely, a successful transfer mechanism may mask poor generalization; treating the two interchangeably, therefore, produces overclaims about model robustness that cannot be sustained when materials discovery moves beyond the convex hull of available training data. The paper articulates a two-dimensional boundary framework—distribution-shift magnitude and feature-space overlap—that locates any given evaluation setting along a continuum from pure generalization to pure transfer, thereby enabling authors, reviewers, and practitioners to specify precisely which capability is being claimed and tested. By clarifying these boundaries and exposing the epistemic costs of current usage, the work supplies a conceptual foundation for more disciplined reporting standards and evaluation protocols in materials machine learning.
This review systematically examines the handling—or more often the neglect—of domain shift within the materials machine learning literature published between 2017 and 2023, drawing on a targeted search of peer-reviewed publications across specialized databases and journals to compile and analyze exactly 30 representative studies that span foundational overviews, application-focused works, and methodological explorations. Domain shift in materials science takes four distinct yet interrelated forms—temporal, compositional, experimental, and theoretical—each arising from the inherently heterogeneous nature of materials data sources that range from evolving laboratory protocols and diverse chemical families to inter-laboratory variations and discrepancies between computational approximations and experimental realities. Current practices reveal that explicit acknowledgment of domain shift remains rare, with the majority of papers proceeding under the default assumption of identical training and test distributions. At the same time, detection methods and adaptation strategies appear in fewer than one in five studies, leaving models vulnerable to silent degradation when deployed on real-world materials problems. The surveyed methods for handling domain shift include statistical detection techniques, domain-adversarial training frameworks, feature-alignment approaches, and shift-robust evaluation protocols, many of which have been proposed in adjacent machine-learning fields yet remain underutilized in materials contexts despite their direct relevance to property prediction and inverse design tasks. Collectively, these findings underscore the urgent need for standardized shift-reporting protocols, the development of materials-specific out-of-distribution benchmarks, and the integration of domain-adaptation pipelines into routine workflows, thereby elevating the reliability, generalizability, and practical utility of machine-learning models in accelerating materials discovery.