We introduce Affective Visual Dialog, an emotion explanation and reasoning task as a testbed for research on understanding the formation of emotions in visually-grounded conversations. The task involves three skills: (1) Dialog-based Question Answering (2) Dialog-based Emotion Prediction and (3) Affective emotion explanation generation based on the dialog. Our key contribution is the collection of a large-scale dataset, dubbed AffectVisDial, consisting of 50K 10-turn visually grounded dialogs as well as concluding emotion attributions and dialog-informed textual emotion explanations, resulting in a total of 27,180 working hours. We explain our design decisions in collecting the dataset and introduce the questioner and answerer tasks that are associated with the participants in the conversation. We train and demonstrate solid Affective Visual Dialog baselines adapted from state-of-the-art models. Remarkably, the responses generated by our models show promising emotional reasoning abilities in response to visually grounded conversations
This project is funded by KAUST BAS/1/1685-01-01, SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence. The authors express their appreciation to Jack Urbanek, Sirojiddin Karimov, and Umid Nejmatullayev for their valuable assistance in data collection setup. Lastly, the authors extend their gratitude to the diligent efforts of the Amazon Mechanical Turkers, DeepenAI, and SmartOne teams, as their contributions were indispensable for the successful completion of this work.