230 research outputs found
ID-Pose: Sparse-view Camera Pose Estimation by Inverting Diffusion Models
Given sparse views of an object, estimating their camera poses is a
long-standing and intractable problem. We harness the pre-trained diffusion
model of novel views conditioned on viewpoints (Zero-1-to-3). We present
ID-Pose which inverses the denoising diffusion process to estimate the relative
pose given two input images. ID-Pose adds a noise on one image, and predicts
the noise conditioned on the other image and a decision variable for the pose.
The prediction error is used as the objective to find the optimal pose with the
gradient descent method. ID-Pose can handle more than two images and estimate
each of the poses with multiple image pairs from triangular relationships.
ID-Pose requires no training and generalizes to real-world images. We conduct
experiments using high-quality real-scanned 3D objects, where ID-Pose
significantly outperforms state-of-the-art methods.Comment: 7 pages. Github: https://xt4d.github.io/id-pose
Modality Unifying Network for Visible-Infrared Person Re-Identification
Abstract
Visible-infrared person re-identification (VI-ReID) is a challenging task due to large cross-modality discrepancies and intra-class variations. Existing methods mainly focus on learning modality-shared representations by embedding different modalities into the same feature space. As a result, the learned feature emphasizes the common patterns across modalities while suppressing modality-specific and identity-aware information that is valuable for Re-ID. To address these issues, we propose a novel Modality Unifying Network (MUN) to explore a robust auxiliary modality for VI-ReID. First, the auxiliary modality is generated by combining the proposed cross-modality learner and intra-modality learner, which can dynamically model the modality-specific and modality-shared representations to alleviate both cross-modality and intra-modality variations. Second, by aligning identity centres across the three modalities, an identity alignment loss function is proposed to discover the discriminative feature representations. Third, a modality alignment loss is introduced to consistently reduce the distribution distance of visible and infrared images by modality prototype modeling. Extensive experiments on multiple public datasets demonstrate that the proposed method surpasses the current state-of-the-art methods by a significant margin.Abstract
Visible-infrared person re-identification (VI-ReID) is a challenging task due to large cross-modality discrepancies and intra-class variations. Existing methods mainly focus on learning modality-shared representations by embedding different modalities into the same feature space. As a result, the learned feature emphasizes the common patterns across modalities while suppressing modality-specific and identity-aware information that is valuable for Re-ID. To address these issues, we propose a novel Modality Unifying Network (MUN) to explore a robust auxiliary modality for VI-ReID. First, the auxiliary modality is generated by combining the proposed cross-modality learner and intra-modality learner, which can dynamically model the modality-specific and modality-shared representations to alleviate both cross-modality and intra-modality variations. Second, by aligning identity centres across the three modalities, an identity alignment loss function is proposed to discover the discriminative feature representations. Third, a modality alignment loss is introduced to consistently reduce the distribution distance of visible and infrared images by modality prototype modeling. Extensive experiments on multiple public datasets demonstrate that the proposed method surpasses the current state-of-the-art methods by a significant margin
InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models
We present InstantMesh, a feed-forward framework for instant 3D mesh
generation from a single image, featuring state-of-the-art generation quality
and significant training scalability. By synergizing the strengths of an
off-the-shelf multiview diffusion model and a sparse-view reconstruction model
based on the LRM architecture, InstantMesh is able to create diverse 3D assets
within 10 seconds. To enhance the training efficiency and exploit more
geometric supervisions, e.g, depths and normals, we integrate a differentiable
iso-surface extraction module into our framework and directly optimize on the
mesh representation. Experimental results on public datasets demonstrate that
InstantMesh significantly outperforms other latest image-to-3D baselines, both
qualitatively and quantitatively. We release all the code, weights, and demo of
InstantMesh, with the intention that it can make substantial contributions to
the community of 3D generative AI and empower both researchers and content
creators.Comment: Technical report. Project: https://github.com/TencentARC/InstantMes
Fast Updating Truncated SVD for Representation Learning with Sparse Matrices
Updating a truncated Singular Value Decomposition (SVD) is crucial in
representation learning, especially when dealing with large-scale data matrices
that continuously evolve in practical scenarios. Aligning SVD-based models with
fast-paced updates becomes increasingly important. Existing methods for
updating truncated SVDs employ Rayleigh-Ritz projection procedures, where
projection matrices are augmented based on original singular vectors. However,
these methods suffer from inefficiency due to the densification of the update
matrix and the application of the projection to all singular vectors. To
address these limitations, we introduce a novel method for dynamically
approximating the truncated SVD of a sparse and temporally evolving matrix. Our
approach leverages sparsity in the orthogonalization process of augmented
matrices and utilizes an extended decomposition to independently store
projections in the column space of singular vectors. Numerical experiments
demonstrate a remarkable efficiency improvement of an order of magnitude
compared to previous methods. Remarkably, this improvement is achieved while
maintaining a comparable precision to existing approaches
Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models
Recent CLIP-guided 3D optimization methods, such as DreamFields and
PureCLIPNeRF, have achieved impressive results in zero-shot text-to-3D
synthesis. However, due to scratch training and random initialization without
prior knowledge, these methods often fail to generate accurate and faithful 3D
structures that conform to the input text. In this paper, we make the first
attempt to introduce explicit 3D shape priors into the CLIP-guided 3D
optimization process. Specifically, we first generate a high-quality 3D shape
from the input text in the text-to-shape stage as a 3D shape prior. We then use
it as the initialization of a neural radiance field and optimize it with the
full prompt. To address the challenging text-to-shape generation task, we
present a simple yet effective approach that directly bridges the text and
image modalities with a powerful text-to-image diffusion model. To narrow the
style domain gap between the images synthesized by the text-to-image diffusion
model and shape renderings used to train the image-to-shape generator, we
further propose to jointly optimize a learnable text prompt and fine-tune the
text-to-image diffusion model for rendering-style image generation. Our method,
Dream3D, is capable of generating imaginative 3D content with superior visual
quality and shape accuracy compared to state-of-the-art methods.Comment: Accepted by CVPR 2023. Project page:
https://bluestyle97.github.io/dream3d
EndoSparse: Real-Time Sparse View Synthesis of Endoscopic Scenes using Gaussian Splatting
3D reconstruction of biological tissues from a collection of endoscopic
images is a key to unlock various important downstream surgical applications
with 3D capabilities. Existing methods employ various advanced neural rendering
techniques for photorealistic view synthesis, but they often struggle to
recover accurate 3D representations when only sparse observations are
available, which is usually the case in real-world clinical scenarios. To
tackle this {sparsity} challenge, we propose a framework leveraging the prior
knowledge from multiple foundation models during the reconstruction process,
dubbed as \textit{EndoSparse}. Experimental results indicate that our proposed
strategy significantly improves the geometric and appearance quality under
challenging sparse-view conditions, including using only three views. In
rigorous benchmarking experiments against state-of-the-art methods,
\textit{EndoSparse} achieves superior results in terms of accurate geometry,
realistic appearance, and rendering efficiency, confirming the robustness to
sparse-view limitations in endoscopic reconstruction. \textit{EndoSparse}
signifies a steady step towards the practical deployment of neural 3D
reconstruction in real-world clinical scenarios. Project page:
https://endo-sparse.github.io/.Comment: Accpeted by MICCAI202
Sparse3D: Distilling Multiview-Consistent Diffusion for Object Reconstruction from Sparse Views
Reconstructing 3D objects from extremely sparse views is a long-standing and
challenging problem. While recent techniques employ image diffusion models for
generating plausible images at novel viewpoints or for distilling pre-trained
diffusion priors into 3D representations using score distillation sampling
(SDS), these methods often struggle to simultaneously achieve high-quality,
consistent, and detailed results for both novel-view synthesis (NVS) and
geometry. In this work, we present Sparse3D, a novel 3D reconstruction method
tailored for sparse view inputs. Our approach distills robust priors from a
multiview-consistent diffusion model to refine a neural radiance field.
Specifically, we employ a controller that harnesses epipolar features from
input views, guiding a pre-trained diffusion model, such as Stable Diffusion,
to produce novel-view images that maintain 3D consistency with the input. By
tapping into 2D priors from powerful image diffusion models, our integrated
model consistently delivers high-quality results, even when faced with
open-world objects. To address the blurriness introduced by conventional SDS,
we introduce the category-score distillation sampling (C-SDS) to enhance
detail. We conduct experiments on CO3DV2 which is a multi-view dataset of
real-world objects. Both quantitative and qualitative evaluations demonstrate
that our approach outperforms previous state-of-the-art works on the metrics
regarding NVS and geometry reconstruction
Advances in 3D Generation: A Survey
Generating 3D models lies at the core of computer graphics and has been the
focus of decades of research. With the emergence of advanced neural
representations and generative models, the field of 3D content generation is
developing rapidly, enabling the creation of increasingly high-quality and
diverse 3D models. The rapid growth of this field makes it difficult to stay
abreast of all recent developments. In this survey, we aim to introduce the
fundamental methodologies of 3D generation methods and establish a structured
roadmap, encompassing 3D representation, generation methods, datasets, and
corresponding applications. Specifically, we introduce the 3D representations
that serve as the backbone for 3D generation. Furthermore, we provide a
comprehensive overview of the rapidly growing literature on generation methods,
categorized by the type of algorithmic paradigms, including feedforward
generation, optimization-based generation, procedural generation, and
generative novel view synthesis. Lastly, we discuss available datasets,
applications, and open challenges. We hope this survey will help readers
explore this exciting topic and foster further advancements in the field of 3D
content generation.Comment: 33 pages, 12 figure
- …
