A Research Framework for Controllable Multimodal Visual Design Generation

Zheng Wang et al.

Journal of Organizational and End User Computing2026https://doi.org/10.4018/joeuc.406954article
AJG 1ABDC B
Weight
0.50

Abstract

Controllable image generation has advanced rapidly with diffusion models, yet existing approaches struggle to integrate multiple heterogeneous control signals and lack mechanisms for enforcing professional design principles. Addressing this challenge requires a unified framework capable of harmonizing semantic, structural, and stylistic modalities while maintaining geometric precision and aesthetic coherence. This study proposes UMC-Design, a unified multimodal controllable framework that introduces a shared control representation, a cross-domain fusion mechanism, and a dual-path diffusion architecture synchronized through a design prior network. The model jointly processes text, vector layouts, semantic maps, and reference images, enabling flexible and scalable multimodal conditioning. Experiments on COCO-Stuff, RICO + PubLayNet, and Crello demonstrate that UMC-Design achieves state-of-the-art performance, reducing FID to 22.3 and improving multimodal alignment to 0.81, surpassing leading baselines by large margins.

Open via your library →

Cite this paper

https://doi.org/https://doi.org/10.4018/joeuc.406954

Or copy a formatted citation

@article{zheng2026,
  title        = {{A Research Framework for Controllable Multimodal Visual Design Generation}},
  author       = {Zheng Wang et al.},
  journal      = {Journal of Organizational and End User Computing},
  year         = {2026},
  doi          = {https://doi.org/https://doi.org/10.4018/joeuc.406954},
}

Paste directly into BibTeX, Zotero, or your reference manager.

Flag this paper

A Research Framework for Controllable Multimodal Visual Design Generation

Flags are reviewed by the Arbiter methodology team within 5 business days.


Evidence weight

0.50

Balanced mode · F 0.40 / M 0.15 / V 0.05 / R 0.40

F · citation impact0.50 × 0.4 = 0.20
M · momentum0.50 × 0.15 = 0.07
V · venue signal0.50 × 0.05 = 0.03
R · text relevance †0.50 × 0.4 = 0.20

† Text relevance is estimated at 0.50 on the detail page — for your query’s actual relevance score, open this paper from a search result.