Multi-LoRA Composition for Image Generation

1University of Illinois Urbana-Champaign, 2Microsoft Corporation
This project explores new methods for text-to-image generation, with a focus on the integration of multiple Low-Rank Adaptations (LoRAs) to create highly customized and detailed images. We present LoRA Switch and LoRA Composite, approaches that aim to surpass traditional techniques in terms of accuracy and image quality, especially in complex compositions.

Project Features:
  • 🚀 Training-Free Methods
    • LoRA Switch and LoRA Composite enable dynamic and precise integration of multiple LoRAs without fine-tuning.
    • Unlike methods that merge LoRA weights, ours focuses on the decoding process, keeping all LoRA weights intact.
  • 📊 ComposLoRA Testbed
    • A new comprehensive platform, featuring 480 composition sets and 22 pre-trained LoRAs across six categories.
    • ComposLoRA is designed for the quantitative evaluation of LoRA-based composable image generation tasks.
  • 📝 GPT-4V-based Evaluator
    • We propose using GPT-4V as an evaluator to assess the efficacy of compositions and the quality of images.
    • This evaluator has demonstrated a better correlation with human judgments.
  • 🏆 Superior Performance
    • Both automated and human evaluations show that our approaches substantially outperform the prevalent LoRA Merge.
    • Our methods exhibit a more significant advantage when generating complex compositions.
  • 🕵️‍♂️ Detailed Analysis
    • We delve deeply into the scenarios where each method excels.
    • We explore the potential bias associated with using GPT-4V for evaluation.

Multi-LoRA composition techniques effectively blend different elements into a cohesive image. Unlike the conventional LoRA Merge approach, which can lead to detail loss and image distortion as more LoRAs are added, our methods retain the accuracy of each element and the overall image quality.

Methods of Multi-LoRA Composition

  • LoRA Merge:
    • Prevalent approach to integrating multiple elements in a unified way in an image.
    • It is realized by linearly combining multiple LoRAs to synthesize a unified LoRA, subsequently plugged into the text-to-image model.
    • LoAR Merge completely overlooks the interaction with the diffusion model during the generative process, resulting in the deformation of the hamburger and fingers in the Figure.
  • LoRA Switch (LoRA-S):
    • To explore activating a single LoRA in each denoising step, we propose LoRA Switch.
    • This method introduces a dynamic adaptation mechanism within diffusion models by sequentially activating individual LoRAs at designated intervals throughout the decoding process.
    • As illustrated in the Figure, each LoRA is represented by a unique color corresponding to a specific element, with only one LoRA engaged per denoising step.
  • LoRA Composite (LoRA-C):
    • To explore incorporating all LoRAs at each timestep without merging weight matrices, we propose LoRA Composite.
    • It involves calculating both unconditional and conditional score estimates for each LoRA individually at each step.
    • By aggregating these scores, the technique ensures balanced guidance throughout the image generation process, facilitating the cohesive integration of all elements represented by different LoRAs.

GPT-4V-based Evaluator

  • While existing metrics can calculate the alignment between text and images, they fall short in assessing the intricacies of specific elements within an image and the quality of their composition.
  • We employ a comparative evaluation method, utilizing GPT-4V to rate generated images across two dimensions: composition quality and image quality.

Experimental Results

  • Our proposed method consistently outperforms LoRA Merge across all configurations and in both dimensions, with the margin of superiority increasing as the number of LoRAs grows.
  • LoRA Switch shows superior performance in composition quality, whereas LoRA Composite excels in image quality.
  • The task of compositional image generation remains highly challenging, especially as the number of elements to be composed increases.

  • Human evaluations aligh with GPT-4V's findings.
  • GPT-4V-based evaluator we adopt shows substantially higher correlations with human judgments, affirming the validity of our evaluation framework.

Analysis

  • LoRA Switch is more adept at composing elements in realistic-style images.
  • LoRA Composite shows a stronger performance in anime-style imagery.

  • The efficiency of the LoRA Switch improves progressively with increased step size, reaching peak performance at 5.
  • The initial choice of LoRA in the activation sequence clearly influences overall performance, while alterations in the subsequent order have minimal impact.

  • GPT-4V exhibits significant positional bias in comparative evaluation.
  • This bias varies depending on the input position of the image and the dimension of the evaluation.

BibTeX

@article{zhong2024multi,
      title={Multi-LoRA Composition for Image Generation},
      author={Zhong, Ming and Shen, Yelong and Wang, Shuohang and Lu, Yadong and Jiao, Yizhu and Ouyang, Siru and Yu, Donghan and Han, Jiawei and Chen, Weizhu},
      journal={arXiv preprint arXiv:2402.16843},
      year={2024}
}