A Research Framework for Controllable Multimodal Visual Design Generation
Zheng Wang et al.
Abstract
Controllable image generation has advanced rapidly with diffusion models, yet existing approaches struggle to integrate multiple heterogeneous control signals and lack mechanisms for enforcing professional design principles. Addressing this challenge requires a unified framework capable of harmonizing semantic, structural, and stylistic modalities while maintaining geometric precision and aesthetic coherence. This study proposes UMC-Design, a unified multimodal controllable framework that introduces a shared control representation, a cross-domain fusion mechanism, and a dual-path diffusion architecture synchronized through a design prior network. The model jointly processes text, vector layouts, semantic maps, and reference images, enabling flexible and scalable multimodal conditioning. Experiments on COCO-Stuff, RICO + PubLayNet, and Crello demonstrate that UMC-Design achieves state-of-the-art performance, reducing FID to 22.3 and improving multimodal alignment to 0.81, surpassing leading baselines by large margins.
Evidence weight
Balanced mode · F 0.40 / M 0.15 / V 0.05 / R 0.40
| F · citation impact | 0.50 × 0.4 = 0.20 |
| M · momentum | 0.50 × 0.15 = 0.07 |
| V · venue signal | 0.50 × 0.05 = 0.03 |
| R · text relevance † | 0.50 × 0.4 = 0.20 |
† Text relevance is estimated at 0.50 on the detail page — for your query’s actual relevance score, open this paper from a search result.