823 research outputs found
Guiding Instruction-based Image Editing via Multimodal Large Language Models
Instruction-based image editing improves the controllability and flexibility
of image manipulation via natural commands without elaborate descriptions or
regional masks. However, human instructions are sometimes too brief for current
methods to capture and follow. Multimodal large language models (MLLMs) show
promising capabilities in cross-modal understanding and visual-aware response
generation via LMs. We investigate how MLLMs facilitate edit instructions and
present MLLM-Guided Image Editing (MGIE). MGIE learns to derive expressive
instructions and provides explicit guidance. The editing model jointly captures
this visual imagination and performs manipulation through end-to-end training.
We evaluate various aspects of Photoshop-style modification, global photo
optimization, and local editing. Extensive experimental results demonstrate
that expressive instructions are crucial to instruction-based image editing,
and our MGIE can lead to a notable improvement in automatic metrics and human
evaluation while maintaining competitive inference efficiency.Comment: ICLR'24 (Spotlight) ; Project at https://mllm-ie.github.io ; Code at
https://github.com/tsujuifu/pytorch_mgi
VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View
Incremental decision making in real-world environments is one of the most
challenging tasks in embodied artificial intelligence. One particularly
demanding scenario is Vision and Language Navigation~(VLN) which requires
visual and natural language understanding as well as spatial and temporal
reasoning capabilities. The embodied agent needs to ground its understanding of
navigation instructions in observations of a real-world environment like Street
View. Despite the impressive results of LLMs in other research areas, it is an
ongoing problem of how to best connect them with an interactive visual
environment. In this work, we propose VELMA, an embodied LLM agent that uses a
verbalization of the trajectory and of visual environment observations as
contextual prompt for the next action. Visual information is verbalized by a
pipeline that extracts landmarks from the human written navigation instructions
and uses CLIP to determine their visibility in the current panorama view. We
show that VELMA is able to successfully follow navigation instructions in
Street View with only two in-context examples. We further finetune the LLM
agent on a few thousand examples and achieve 25%-30% relative improvement in
task completion over the previous state-of-the-art for two datasets.Comment: Accepted at AAAI 202
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Masked visual modeling (MVM) has been recently proven effective for visual
pre-training. While similar reconstructive objectives on video inputs (e.g.,
masked frame modeling) have been explored in video-language (VidL)
pre-training, previous studies fail to find a truly effective MVM strategy that
can largely benefit the downstream performance. In this work, we systematically
examine the potential of MVM in the context of VidL learning. Specifically, we
base our study on a fully end-to-end VIdeO-LanguagE Transformer (VIOLET), where
the supervision from MVM training can be backpropagated to the video pixel
space. In total, eight different reconstructive targets of MVM are explored,
from low-level pixel values and oriented gradients to high-level depth maps,
optical flow, discrete visual tokens, and latent visual features. We conduct
comprehensive experiments and provide insights into the factors leading to
effective MVM training, resulting in an enhanced model VIOLETv2. Empirically,
we show VIOLETv2 pre-trained with MVM objective achieves notable improvements
on 13 VidL benchmarks, ranging from video question answering, video captioning,
to text-to-video retrieval.Comment: CVPR'23; the first two authors contributed equally; code is available
at https://github.com/tsujuifu/pytorch_empirical-mv
Collaborative Generative AI: Integrating GPT-k for Efficient Editing in Text-to-Image Generation
The field of text-to-image (T2I) generation has garnered significant
attention both within the research community and among everyday users. Despite
the advancements of T2I models, a common issue encountered by users is the need
for repetitive editing of input prompts in order to receive a satisfactory
image, which is time-consuming and labor-intensive. Given the demonstrated text
generation power of large-scale language models, such as GPT-k, we investigate
the potential of utilizing such models to improve the prompt editing process
for T2I generation. We conduct a series of experiments to compare the common
edits made by humans and GPT-k, evaluate the performance of GPT-k in prompting
T2I, and examine factors that may influence this process. We found that GPT-k
models focus more on inserting modifiers while humans tend to replace words and
phrases, which includes changes to the subject matter. Experimental results
show that GPT-k are more effective in adjusting modifiers rather than
predicting spontaneous changes in the primary subject matters. Adopting the
edit suggested by GPT-k models may reduce the percentage of remaining edits by
20-30%.Comment: EMNLP 202
- …
