306 research outputs found
Hearing Anything Anywhere
Recent years have seen immense progress in 3D computer vision and computer
graphics, with emerging tools that can virtualize real-world 3D environments
for numerous Mixed Reality (XR) applications. However, alongside immersive
visual experiences, immersive auditory experiences are equally vital to our
holistic perception of an environment. In this paper, we aim to reconstruct the
spatial acoustic characteristics of an arbitrary environment given only a
sparse set of (roughly 12) room impulse response (RIR) recordings and a planar
reconstruction of the scene, a setup that is easily achievable by ordinary
users. To this end, we introduce DiffRIR, a differentiable RIR rendering
framework with interpretable parametric models of salient acoustic features of
the scene, including sound source directivity and surface reflectivity. This
allows us to synthesize novel auditory experiences through the space with any
source audio. To evaluate our method, we collect a dataset of RIR recordings
and music in four diverse, real environments. We show that our model
outperforms state-ofthe-art baselines on rendering monaural and binaural RIRs
and music at unseen locations, and learns physically interpretable parameters
characterizing acoustic properties of the sound source and surfaces in the
scene.Comment: CVPR 2024. The first two authors contributed equally. Project page:
https://masonlwang.com/hearinganythinganywhere
SoundCam: A Dataset for Finding Humans Using Room Acoustics
A room's acoustic properties are a product of the room's geometry, the
objects within the room, and their specific positions. A room's acoustic
properties can be characterized by its impulse response (RIR) between a source
and listener location, or roughly inferred from recordings of natural signals
present in the room. Variations in the positions of objects in a room can
effect measurable changes in the room's acoustic properties, as characterized
by the RIR. Existing datasets of RIRs either do not systematically vary
positions of objects in an environment, or they consist of only simulated RIRs.
We present SoundCam, the largest dataset of unique RIRs from in-the-wild rooms
publicly released to date. It includes 5,000 10-channel real-world measurements
of room impulse responses and 2,000 10-channel recordings of music in three
different rooms, including a controlled acoustic lab, an in-the-wild living
room, and a conference room, with different humans in positions throughout each
room. We show that these measurements can be used for interesting tasks, such
as detecting and identifying humans, and tracking their positions.Comment: In NeurIPS 2023 Datasets and Benchmarks Track. Project page:
https://masonlwang.com/soundcam/. Wang and Clarke contributed equally to this
wor
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
Large language models (LLMs) are shown to possess a wealth of actionable
knowledge that can be extracted for robot manipulation in the form of reasoning
and planning. Despite the progress, most still rely on pre-defined motion
primitives to carry out the physical interactions with the environment, which
remains a major bottleneck. In this work, we aim to synthesize robot
trajectories, i.e., a dense sequence of 6-DoF end-effector waypoints, for a
large variety of manipulation tasks given an open-set of instructions and an
open-set of objects. We achieve this by first observing that LLMs excel at
inferring affordances and constraints given a free-form language instruction.
More importantly, by leveraging their code-writing capabilities, they can
interact with a vision-language model (VLM) to compose 3D value maps to ground
the knowledge into the observation space of the agent. The composed value maps
are then used in a model-based planning framework to zero-shot synthesize
closed-loop robot trajectories with robustness to dynamic perturbations. We
further demonstrate how the proposed framework can benefit from online
experiences by efficiently learning a dynamics model for scenes that involve
contact-rich interactions. We present a large-scale study of the proposed
method in both simulated and real-robot environments, showcasing the ability to
perform a large variety of everyday manipulation tasks specified in free-form
natural language. Videos and code at https://voxposer.github.i
Primitive Skill-based Robot Learning from Human Evaluative Feedback
Reinforcement learning (RL) algorithms face significant challenges when
dealing with long-horizon robot manipulation tasks in real-world environments
due to sample inefficiency and safety issues. To overcome these challenges, we
propose a novel framework, SEED, which leverages two approaches: reinforcement
learning from human feedback (RLHF) and primitive skill-based reinforcement
learning. Both approaches are particularly effective in addressing sparse
reward issues and the complexities involved in long-horizon tasks. By combining
them, SEED reduces the human effort required in RLHF and increases safety in
training robot manipulation with RL in real-world settings. Additionally,
parameterized skills provide a clear view of the agent's high-level intentions,
allowing humans to evaluate skill choices before they are executed. This
feature makes the training process even safer and more efficient. To evaluate
the performance of SEED, we conducted extensive experiments on five
manipulation tasks with varying levels of complexity. Our results show that
SEED significantly outperforms state-of-the-art RL algorithms in sample
efficiency and safety. In addition, SEED also exhibits a substantial reduction
of human effort compared to other RLHF methods. Further details and video
results can be found at https://seediros23.github.io/
NOIR: Neural Signal Operated Intelligent Robots for Everyday Activities
We present Neural Signal Operated Intelligent Robots (NOIR), a
general-purpose, intelligent brain-robot interface system that enables humans
to command robots to perform everyday activities through brain signals. Through
this interface, humans communicate their intended objects of interest and
actions to the robots using electroencephalography (EEG). Our novel system
demonstrates success in an expansive array of 20 challenging, everyday
household activities, including cooking, cleaning, personal care, and
entertainment. The effectiveness of the system is improved by its synergistic
integration of robot learning algorithms, allowing for NOIR to adapt to
individual users and predict their intentions. Our work enhances the way humans
interact with robots, replacing traditional channels of interaction with
direct, neural communication. Project website: https://noir-corl.github.io/
Differentiable Physics Simulation of Dynamics-Augmented Neural Objects
We present a differentiable pipeline for simulating the motion of objects
that represent their geometry as a continuous density field parameterized as a
deep network. This includes Neural Radiance Fields (NeRFs), and other related
models. From the density field, we estimate the dynamical properties of the
object, including its mass, center of mass, and inertia matrix. We then
introduce a differentiable contact model based on the density field for
computing normal and friction forces resulting from collisions. This allows a
robot to autonomously build object models that are visually and
\emph{dynamically} accurate from still images and videos of objects in motion.
The resulting Dynamics-Augmented Neural Objects (DANOs) are simulated with an
existing differentiable simulation engine, Dojo, interacting with other
standard simulation objects, such as spheres, planes, and robots specified as
URDFs. A robot can use this simulation to optimize grasps and manipulation
trajectories of neural objects, or to improve the neural object models through
gradient-based real-to-simulation transfer. We demonstrate the pipeline to
learn the coefficient of friction of a bar of soap from a real video of the
soap sliding on a table. We also learn the coefficient of friction and mass of
a Stanford bunny through interactions with a Panda robot arm from synthetic
data, and we optimize trajectories in simulation for the Panda arm to push the
bunny to a goal location
Learning Object-Centric Neural Scattering Functions for Free-viewpoint Relighting and Scene Composition
Photorealistic object appearance modeling from 2D images is a constant topic
in vision and graphics. While neural implicit methods (such as Neural Radiance
Fields) have shown high-fidelity view synthesis results, they cannot relight
the captured objects. More recent neural inverse rendering approaches have
enabled object relighting, but they represent surface properties as simple
BRDFs, and therefore cannot handle translucent objects. We propose
Object-Centric Neural Scattering Functions (OSFs) for learning to reconstruct
object appearance from only images. OSFs not only support free-viewpoint object
relighting, but also can model both opaque and translucent objects. While
accurately modeling subsurface light transport for translucent objects can be
highly complex and even intractable for neural methods, OSFs learn to
approximate the radiance transfer from a distant light to an outgoing direction
at any spatial location. This approximation avoids explicitly modeling complex
subsurface scattering, making learning a neural implicit model tractable.
Experiments on real and synthetic data show that OSFs accurately reconstruct
appearances for both opaque and translucent objects, allowing faithful
free-viewpoint relighting as well as scene composition. Project website:
https://kovenyu.com/osf/Comment: Project website: https://kovenyu.com/osf/ Journal extension of
arXiv:2012.08503. The first two authors contributed equally to this wor
Sonicverse: A Multisensory Simulation Platform for Embodied Household Agents that See and Hear
Developing embodied agents in simulation has been a key research topic in
recent years. Exciting new tasks, algorithms, and benchmarks have been
developed in various simulators. However, most of them assume deaf agents in
silent environments, while we humans perceive the world with multiple senses.
We introduce Sonicverse, a multisensory simulation platform with integrated
audio-visual simulation for training household agents that can both see and
hear. Sonicverse models realistic continuous audio rendering in 3D environments
in real-time. Together with a new audio-visual VR interface that allows humans
to interact with agents with audio, Sonicverse enables a series of embodied AI
tasks that need audio-visual perception. For semantic audio-visual navigation
in particular, we also propose a new multi-task learning model that achieves
state-of-the-art performance. In addition, we demonstrate Sonicverse's realism
via sim-to-real transfer, which has not been achieved by other simulators: an
agent trained in Sonicverse can successfully perform audio-visual navigation in
real-world environments. Sonicverse is available at:
https://github.com/StanfordVL/Sonicverse.Comment: In ICRA 2023. Project page:
https://ai.stanford.edu/~rhgao/sonicverse/. Code:
https://github.com/StanfordVL/sonicverse. Gao and Li contributed equally to
this work and are in alphabetical orde
Modeling Dynamic Environments with Scene Graph Memory
Embodied AI agents that search for objects in large environments such as
households often need to make efficient decisions by predicting object
locations based on partial information. We pose this as a new type of link
prediction problem: link prediction on partially observable dynamic graphs. Our
graph is a representation of a scene in which rooms and objects are nodes, and
their relationships are encoded in the edges; only parts of the changing graph
are known to the agent at each timestep. This partial observability poses a
challenge to existing link prediction approaches, which we address. We propose
a novel state representation -- Scene Graph Memory (SGM) -- with captures the
agent's accumulated set of observations, as well as a neural net architecture
called a Node Edge Predictor (NEP) that extracts information from the SGM to
search efficiently. We evaluate our method in the Dynamic House Simulator, a
new benchmark that creates diverse dynamic graphs following the semantic
patterns typically seen at homes, and show that NEP can be trained to predict
the locations of objects in a variety of environments with diverse object
movement dynamics, outperforming baselines both in terms of new scene
adaptability and overall accuracy. The codebase and more can be found at
https://www.scenegraphmemory.com
- …
