Generating Compositional Scenes via Text-to-image RGBA Instance Generation

NeurIPS 2024

Alessandro Fontanella¹, Petru-Daniel Tudosiu², Yongxin Yang², Shifeng Zhang², Sarah Parisot³

¹ University of Edinburgh, ² Huawei Noah's Ark Lab, ³ Microsoft Research

^*Work done while at Huawei

Paper MuLAn Dataset

Our approach allows to build and manipulate images from highly complex prompts with finegrained control over object appearance and location, granting a higher degree of control than competing methods.

RGBA VAE Design & Training

A few modifications were made to the PixArt-Alpha VAE to make it compatible with RGBA images. Firstly, the additional channel is simply taken into account by replacing and retraining the input and output layers of the model. Secondly, we observed that learning a joint RGBA latent space leads to entanglement of RGB and alpha channels, affecting generation capability of diffusion models trained in this latent space. We address this challenge by disentangling representations in the latent space: our VAE predicts two separate distributions [EQ1] and [EQ2], each associated with a separate KL loss.

RGBA LDM Sampling Strategy

In our case, the latent space also encodes information on the transparency layer. When sampling from our model, we seek to exploit the mutual dependency between RGB and alpha channels. In particular, given RGB and Alpha noised latents at timestep t, information contained in the RGB noised latents at timestep t-1 could be employed to inform the Alpha noised latents update from timestep t to t-1.

We can observe that our approach is able to generate realistic instances following the instructions given. Text2Layer shows lower image quality and excessive transparency, while LayerDiffusion struggles to follow prompt details, such as image style. Combining SD with Matting allows to achieve reasonable segmentation of the instances generated, while when applied to PixArt-α it can sometimes struggle to correctly identify and segment the main object of the image, especially when dealing with artwork content. On top of attributes bleeding in the background, this highlights how unreliable matting can be for instance generation purposes.

Multi-Layer Noise Blending for Scene Composition

We consider that we have an image layout available, and generated instances based on our RGBA generator. The layout is represented by a collection of bounding boxes. Our scene composition approach is designed as a multi-layer noise blending process, where instances are sequentially integrated into intermediate layered representations. While we generate K+1 images, where K is the number of instances and we also account for the background, we observed that generating layered images affords more flexibility, better control over relative positions of instances and yields more natural compositions.

While GLIGEN is capable of accurately reproducing the desired layout, it often fails to assign the right attributes to objects and struggles with highly overlapping objects. In contrast, Multidiffusion is more accurate in terms of attribute assignments, but struggles to handle overlapping objects. This can be attributed to the noise averaging process, which fails to integrate a notion of instance ordering like our multi-layer approach. Instance diffusion achieves performance closest to ours, but still struggles with complex patterns, attributes and relative positioning. With our RGBA instance generation and multi-layer noise blending, we are able to accurately assign object attributes and follow the required layout, while successfully building smooth and realistic scenes.

Scene Manipulation

The manipulations we consider here are: attribute modification, instance replacement, and layout adjustment. We note that the first two tasks require RGBA generation of new instances. We highlight that we do not introduce any new explicit scene preservation or image editing technology in this experiment, therefore evaluating our method’s potential for scene manipulation and controllability. We are able to control and modify image content easily while maintaining strong consistency across different versions of the scene, without explicitly enforcing content preservation. This highlights the strong potential of multi-layer approaches to facilitate the development of image editing methods. We can see that we achieve substantially stronger scene preservation compared to instance diffusion, which can generate entirely different images and instances when modifications are too strong.

Generating Compositional Scenes via Text-to-image RGBA Instance Generation

NeurIPS 2024

Our approach allows to build and manipulate images from highly complex prompts with finegrained control over object appearance and location, granting a higher degree of control than competing methods.

Abstract

RGBA Samples

Scenes Samples

RGBA VAE Design & Training

RGBA LDM Sampling Strategy

Multi-Layer Noise Blending for Scene Composition

Scene Manipulation

BibTeX