In the current media landscape, misinformation has evolved into a multimodal challenge, presenting misinformation through various modalities simultaneously, particularly through text and visuals. Despite increasing scholarly attention to visual misinformation, as one type of multimodal misinformation, there is a lack of a unified theoretical framework for understanding the cognitive processes involved in how people process visual misinformation and become susceptible to it. In this paper, we introduce a psychological processing model—the Visual Misinformation Processing Model (VMPM)—to bridge this gap. This model outlines four key cognitive stages: (1) encountering visual misinformation; (2) allocating attention to visuals; (3) engaging in dominant processing of visuals alongside text; and (4) becoming persuaded by misinformation. We discuss the current state of research on visual misinformation and suggest directions for future research.