Abstract
Text-to-image diffusion models can produce visually impressive images from natural-language prompts, but they often fail to satisfy the detailed semantic constraints expressed in compositional prompts. Typical failure modes include omitted objects, merged entities, incorrect quantities, incorrect attribute binding, and leakage of one entity's attributes onto another. This thesis studies the problem of semantic precision in text-to-image generation: how faithfully a generated image satisfies the structured meaning of its prompt. The thesis makes two linked contributions. First, it presents a training-free inference-time refinement method for diffusion-based image generation. The method operates directly in latent space during denoising and uses noun-phrase-aware cross-attention objectives to improve object presence, attribute binding, and spatial separation between semantically distinct prompt entities. The approach augments an Attend-and-Excite style attention-activation objective with additional losses for attribute alignment, noun-phrase separation, and centroid separation, without retraining the underlying diffusion model. Second, the thesis introduces a structured evaluation framework for prompt-image semantic alignment. Rather than assigning a single global similarity score to the whole prompt-image pair, the framework decomposes the prompt into explicit semantic constraints, grounds relevant image regions, and verifies each constraint through targeted yes/no visual question answering. The resulting scores are aggregated into interpretable category-level and overall measures covering entity presence, attribute correctness, relation correctness, and quantity satisfaction. Experiments on a controlled compositional prompt set show that inference-time latent refinement improves semantic alignment over plain Stable Diffusion and simpler attention-guidance baselines. The proposed evaluation pipeline also exposes failure modes that are frequently hidden by global prompt-image similarity metrics, yielding a more fine-grained and diagnostically useful picture of semantic correctness.
Committee Chair
Nathan Jacobs
Committee Members
Tao Ju, Ilan Goodman
Degree
Master of Science (MS)
Author's Department
Computer Science & Engineering
Document Type
Thesis
Date of Award
Spring 5-6-2026
Language
English (en)
Author's ORCID
https://orcid.org/0009-0005-5053-2499
Recommended Citation
Rouie Miab, Mohammad, "Improving Semantic Precision in Text-to-Image Diffusion Models via Latent-Space Optimization and Semantically-Parsed Evaluation" (2026). McKelvey School of Engineering Theses & Dissertations. 1351.
https://openscholarship.wustl.edu/eng_etds/1351
Included in
Artificial Intelligence and Robotics Commons, Computational Engineering Commons, Data Science Commons, Graphics and Human Computer Interfaces Commons, Other Computer Sciences Commons