Generative models in materials science have emerged as powerful tools for proposing novel atomic structures, compositions, and functional properties. Yet, their scientific evaluation remains conceptually underdeveloped and fragmented across statistical proxies that rarely capture the true relevance to materials. This review systematically examines the conceptual foundations of scientific evaluation for generative materials AI by targeting 30 peer-reviewed publications spanning 2017–2026 and employing a PRISMA-guided methodology focused on evaluation metrics, physical plausibility, chemical validity, synthesizability, novelty, and utility. The evaluation dimensions extend far beyond conventional statistical metrics such as validity percentages or reconstruction error to encompass six interlocking scientific criteria—chemical validity, structural plausibility, property accuracy, synthesizability, novelty, and utility—that together define whether a generated material constitutes a genuine scientific artifact rather than a computational curiosity. Current evaluation practices, as documented across the literature, remain heavily anchored in validity scores, uniqueness counts, and nearest-neighbor novelty checks, with approximately 68% of studies relying primarily on chemical-validity filters and only 22% incorporating any form of synthesizability assessment, revealing a persistent gap between computational convenience and experimental realism. Critical analysis reveals that these practices are necessary yet profoundly insufficient, frequently conflating statistical fidelity with scientific value and overlooking failure modes such as physically unstable geometries or literature-overlooked duplicates. Emerging frameworks, including multi-objective physics-informed scoring, retrospective validation against subsequent experimental discoveries, and downstream task benchmarking, offer promising pathways toward more rigorous standards. Yet significant gaps persist in the absence of community-wide benchmarks, reliable predictors of synthesizability, and domain-specific utility metrics. This review, therefore, offers actionable recommendations for authors, reviewers, and the broader community to elevate generative materials AI from pattern generation to verifiable scientific discovery, ensuring that evaluation protocols align with the epistemological demands of materials science itself.