199 research outputs found
Rethinking Referring Object Removal
Referring object removal refers to removing the specific object in an image
referred by natural language expressions and filling the missing region with
reasonable semantics. To address this task, we construct the ComCOCO, a
synthetic dataset consisting of 136,495 referring expressions for 34,615
objects in 23,951 image pairs. Each pair contains an image with referring
expressions and the ground truth after elimination. We further propose an
end-to-end syntax-aware hybrid mapping network with an encoding-decoding
structure. Linguistic features are hierarchically extracted at the syntactic
level and fused in the downsampling process of visual features with multi-head
attention. The feature-aligned pyramid network is leveraged to generate
segmentation masks and replace internal pixels with region affinity learned
from external semantics in high-level feature maps. Extensive experiments
demonstrate that our model outperforms diffusion models and two-stage methods
which process the segmentation and inpainting task separately by a significant
margin
SLNSpeech: solving extended speech separation problem by the help of sign language
A speech separation task can be roughly divided into audio-only separation
and audio-visual separation. In order to make speech separation technology
applied in the real scenario of the disabled, this paper presents an extended
speech separation problem which refers in particular to sign language assisted
speech separation. However, most existing datasets for speech separation are
audios and videos which contain audio and/or visual modalities. To address the
extended speech separation problem, we introduce a large-scale dataset named
Sign Language News Speech (SLNSpeech) dataset in which three modalities of
audio, visual, and sign language are coexisted. Then, we design a general deep
learning network for the self-supervised learning of three modalities,
particularly, using sign language embeddings together with audio or
audio-visual information for better solving the speech separation task.
Specifically, we use 3D residual convolutional network to extract sign language
features and use pretrained VGGNet model to exact visual features. After that,
an improved U-Net with skip connections in feature extraction stage is applied
for learning the embeddings among the mixed spectrogram transformed from source
audios, the sign language features and visual features. Experiments results
show that, besides visual modality, sign language modality can also be used
alone to supervise speech separation task. Moreover, we also show the
effectiveness of sign language assisted speech separation when the visual
modality is disturbed. Source code will be released in
http://cheertt.top/homepage/Comment: 33 pages, 8 figures, 5 table
SpineCLUE: Automatic Vertebrae Identification Using Contrastive Learning and Uncertainty Estimation
Vertebrae identification in arbitrary fields-of-view plays a crucial role in
diagnosing spine disease. Most spine CT contain only local regions, such as the
neck, chest, and abdomen. Therefore, identification should not depend on
specific vertebrae or a particular number of vertebrae being visible. Existing
methods at the spine-level are unable to meet this challenge. In this paper, we
propose a three-stage method to address the challenges in 3D CT vertebrae
identification at vertebrae-level. By sequentially performing the tasks of
vertebrae localization, segmentation, and identification, the anatomical prior
information of the vertebrae is effectively utilized throughout the process.
Specifically, we introduce a dual-factor density clustering algorithm to
acquire localization information for individual vertebra, thereby facilitating
subsequent segmentation and identification processes. In addition, to tackle
the issue of interclass similarity and intra-class variability, we pre-train
our identification network by using a supervised contrastive learning method.
To further optimize the identification results, we estimated the uncertainty of
the classification network and utilized the message fusion module to combine
the uncertainty scores, while aggregating global information about the spine.
Our method achieves state-of-the-art results on the VerSe19 and VerSe20
challenge benchmarks. Additionally, our approach demonstrates outstanding
generalization performance on an collected dataset containing a wide range of
abnormal cases
Multiscale Low-Frequency Memory Network for Improved Feature Extraction in Convolutional Neural Networks
Deep learning and Convolutional Neural Networks (CNNs) have driven major
transformations in diverse research areas. However, their limitations in
handling low-frequency information present obstacles in certain tasks like
interpreting global structures or managing smooth transition images. Despite
the promising performance of transformer structures in numerous tasks, their
intricate optimization complexities highlight the persistent need for refined
CNN enhancements using limited resources. Responding to these complexities, we
introduce a novel framework, the Multiscale Low-Frequency Memory (MLFM)
Network, with the goal to harness the full potential of CNNs while keeping
their complexity unchanged. The MLFM efficiently preserves low-frequency
information, enhancing performance in targeted computer vision tasks. Central
to our MLFM is the Low-Frequency Memory Unit (LFMU), which stores various
low-frequency data and forms a parallel channel to the core network. A key
advantage of MLFM is its seamless compatibility with various prevalent
networks, requiring no alterations to their original core structure. Testing on
ImageNet demonstrated substantial accuracy improvements in multiple 2D CNNs,
including ResNet, MobileNet, EfficientNet, and ConvNeXt. Furthermore, we
showcase MLFM's versatility beyond traditional image classification by
successfully integrating it into image-to-image translation tasks, specifically
in semantic segmentation networks like FCN and U-Net. In conclusion, our work
signifies a pivotal stride in the journey of optimizing the efficacy and
efficiency of CNNs with limited resources. This research builds upon the
existing CNN foundations and paves the way for future advancements in computer
vision. Our codes are available at https://github.com/AlphaWuSeu/ MLFM.Comment: 9 pages, 10 figures,6 tables. AAAI 2024 conferenc
Noise reduction of diffusion tensor images by sparse representation and dictionary learning
- …
