The explanatory interface of M^2Lens consists of five views.The User Panel (A) displays the descriptive statistics about the model and dataset. The Summary View (B) presents a global summary of the importance of individual modalities, as well as their interactions using a three-layer augmented tree-like layout. The Template View (C) and Projection View (D) complement each other for subset-level explanations. Specifically, Template View (C) summarizes frequent and influential templates of feature sets in a table. The Projection View (D) supports multi-faceted explorations of instances that have features of interest. The Instance View (E) provides local explanations by visualizing the important features and the context of individual instances.


Multimodal sentiment analysis aims to recognize people's attitudes from multiple communication channels such as verbal content (i.e., text), voice, and facial expressions. It has become a vibrant and important research topic in natural language processing. Much research focuses on modeling the complex intra- and inter-modal interactions between different communication channels. However, current multimodal models with strong performance are often deep-learning-based techniques and work like black boxes. It is not clear how models utilize multimodal information for sentiment predictions. Despite recent advances in techniques for enhancing the explainability of machine learning models, they often target unimodal scenarios (e.g., images, sentences), and little research has been done on explaining multimodal models. In this paper, we present an interactive visual analytics system, M2Lens, to visualize and explain multimodal models for sentiment analysis. M2Lens provides explanations on intra- and inter-modal interactions at the global, subset, and local levels. Specifically, it summarizes the influence of three typical interaction types (i.e., dominance, complement, and conflict) on the model predictions. Moreover, M2Lens identifies frequent and influential multimodal features and supports the multi-faceted exploration of model behaviors from language, acoustic, and visual modalities. Through two case studies and expert interviews, we demonstrate our system can help users gain deep insights into the multimodal models for sentiment analysis.


PDF | Preprint | Video Demo | System (coming soon)


Xingbo Wang, Jianben He, Zhihua Jin, Muqiao Yang, Yong Wang, and Huamin Qu. 2021. M^2Lens: Visualizing and Explaining Multimodal Models for Sentiment Analysis. IEEE Transactions on Visualization and Computer Graphics (Proceedings of IEEE VIS 2021). To Appear.