ReGround: Improving Textual and Spatial Grounding at No Cost

ECCV 2024


tl;dr: ReGround resolves the issue of description omission in GLIGEN [1] while accurately reflecting the bounding boxes, without any extra cost.


When an image generation process is guided by both a text prompt and spatial cues, such as a set of bounding boxes, do these elements work in harmony, or does one dominate the other? Our analysis of a pretrained image diffusion model that integrates gated self-attention into the U-Net reveals that spatial grounding often outweighs textual grounding due to the sequential flow from gated self-attention to cross-attention. We demonstrate that such bias can be significantly mitigated without sacrificing accuracy in either grounding by simply rewiring the network architecture, changing from sequential to parallel for gated self-attention and cross-attention. This surprisingly simple yet effective solution does not require any fine-tuning of the network but significantly reduces the trade-off between the two groundings. Our experiments demonstrate significant improvements from the original GLIGEN to the rewired version in the trade-off between textual grounding and spatial grounding.

Trade-off between Textual and Spatial Grounding


(a) Images generated by GLIGEN [1] while varying the activation duration of gated self-attention (γ) in scheduled sampling (Sec. 5.1). The red words in the text prompt denote the words used as labels of the input bounding boxes. Note that to reflect the underlined description in the text prompt in the final image, γ must be decreased to 0.1, which compromises spatial grounding accuracy. (b) In contrast, our ReGround reflects the underlined phrase even when γ=1.0, therefore achieving high accuracy in both textual and spatial grounding.

Impact of Cross-Attention on Spatial Grounding


Comparison of the output of GLIGEN [1] with and without cross-attention. While the absence of cross-attention reduces realism and quality of the image, the silhouette of objects remains grounded within the given bounding boxes, as shown in the third column of each case.

Main Idea


Comparison between the U-Net architectures of (a) Latent Diffusion Model (LDM) [2], (b) GLIGEN [1] and (c) our ReGround. From LDM, GLIGEN enables spatial grounding by injecting gated self-attention before cross-attention, forming a sequential flow of them.

Based on GLIGEN, our ReGround changes the relationship of the two attention modules to become parallel, resulting in noticeable improvement in textual grounding while preserving the spatial grounding capability. (The residual block before self-attention is omitted.)

Qualitative Comparisons

ReGround (last column) resolves the issue of description omission while accurately reflecting the bounding box information.


comparison-01 comparison-02

See More Results


        title={ReGround: Improving Textual and Spatial Grounding at No Cost},
        author={Lee, Yuseung and Sung, Minhyuk},
        journal={arXiv preprint arXiv:2403.13589},


[1] GLIGEN: Open-Set Grounded Text-to-Image Generation, Li et al., CVPR 2023.
[2] High-Resolution Image Synthesis with Latent Diffusion Models, Rombach et al., CVPR 2022.