395 research outputs found
Visually Grounded Meaning Representations
In this paper we address the problem of grounding distributional representations of lexical meaning. We introduce a new
model which uses stacked autoencoders to learn higher-level representations from textual and visual input. The visual modality is
encoded via vectors of attributes obtained automatically from images. We create a new large-scale taxonomy of 600 visual attributes
representing more than 500 concepts and 700K images. We use this dataset to train attribute classifiers and integrate their predictions
with text-based distributional models of word meaning. We evaluate our model on its ability to simulate word similarity judgments and
concept categorization. On both tasks, our model yields a better fit to behavioral data compared to baselines and related models which
either rely on a single modality or do not make use of attribute-based input
Grounded Models of Semantic Representation
A popular tradition of studying semantic representation has been driven by the assumption that word meaning can be learned from the linguistic environment, despite ample evidence suggesting that language is grounded in perception and action. In this paper we present a comparative study of models that represent word meaning based on linguistic and perceptual data. Linguistic information is approximated by naturally occurring corpora and sensorimotor experience by feature norms (i.e., attributes native speakers consider important in describing the meaning of a word). The models differ in terms of the mechanisms by which they integrate the two modalities. Experimental results show that a closer correspondence to human data can be obtained by uncovering latent information shared among the textual and perceptual modalities rather than arriving at semantic knowledge by concatenating the two
On the Complementarity of Images and Text for the Expression of Emotions in Social Media
Authors of posts in social media communicate their emotions and what causes
them with text and images. While there is work on emotion and stimulus
detection for each modality separately, it is yet unknown if the modalities
contain complementary emotion information in social media. We aim at filling
this research gap and contribute a novel, annotated corpus of English
multimodal Reddit posts. On this resource, we develop models to automatically
detect the relation between image and text, an emotion stimulus category and
the emotion class. We evaluate if these tasks require both modalities and find
for the image-text relations, that text alone is sufficient for most categories
(complementary, illustrative, opposing): the information in the text allows to
predict if an image is required for emotion understanding. The emotions of
anger and sadness are best predicted with a multimodal model, while text alone
is sufficient for disgust, joy, and surprise. Stimuli depicted by objects,
animals, food, or a person are best predicted by image-only models, while
multimodal models are most effective on art, events, memes, places, or
screenshots.Comment: accepted for WASSA 2022 at ACL 202
- …
