Abstract

Text-to-image diffusion models can produce visually impressive images from natural-language prompts, but they often fail to satisfy the detailed semantic constraints expressed in compositional prompts. Typical failure modes include omitted objects, merged entities, incorrect quantities, incorrect attribute binding, and leakage of one entity's attributes onto another. This thesis studies the problem of semantic precision in text-to-image generation: how faithfully a generated image satisfies the structured meaning of its prompt.   The thesis makes two linked contributions. First, it presents a training-free inference-time refinement method for diffusion-based image generation. The method operates directly in latent space during denoising and uses noun-phrase-aware cross-attention objectives to improve object presence, attribute binding, and spatial separation between semantically distinct prompt entities. The approach augments an Attend-and-Excite style attention-activation objective with additional losses for attribute alignment, noun-phrase separation, and centroid separation, without retraining the underlying diffusion model.   Second, the thesis introduces a structured evaluation framework for prompt-image semantic alignment. Rather than assigning a single global similarity score to the whole prompt-image pair, the framework decomposes the prompt into explicit semantic constraints, grounds relevant image regions, and verifies each constraint through targeted yes/no visual question answering. The resulting scores are aggregated into interpretable category-level and overall measures covering entity presence, attribute correctness, relation correctness, and quantity satisfaction.   Experiments on a controlled compositional prompt set show that inference-time latent refinement improves semantic alignment over plain Stable Diffusion and simpler attention-guidance baselines. The proposed evaluation pipeline also exposes failure modes that are frequently hidden by global prompt-image similarity metrics, yielding a more fine-grained and diagnostically useful picture of semantic correctness.

Committee Chair

Nathan Jacobs

Committee Members

Tao Ju, Ilan Goodman

Degree

Master of Science (MS)

Author's Department

Computer Science & Engineering

Author's School

McKelvey School of Engineering

Document Type

Thesis

Date of Award

Spring 5-6-2026

Language

English (en)

Author's ORCID

https://orcid.org/0009-0005-5053-2499

Share

COinS