874 research outputs found
CleanNet: Transfer Learning for Scalable Image Classifier Training with Label Noise
In this paper, we study the problem of learning image classification models
with label noise. Existing approaches depending on human supervision are
generally not scalable as manually identifying correct or incorrect labels is
time-consuming, whereas approaches not relying on human supervision are
scalable but less effective. To reduce the amount of human supervision for
label noise cleaning, we introduce CleanNet, a joint neural embedding network,
which only requires a fraction of the classes being manually verified to
provide the knowledge of label noise that can be transferred to other classes.
We further integrate CleanNet and conventional convolutional neural network
classifier into one framework for image classification learning. We demonstrate
the effectiveness of the proposed algorithm on both of the label noise
detection task and the image classification on noisy data task on several
large-scale datasets. Experimental results show that CleanNet can reduce label
noise detection error rate on held-out classes where no human supervision
available by 41.5% compared to current weakly supervised methods. It also
achieves 47% of the performance gain of verifying all images with only 3.2%
images verified on an image classification task. Source code and dataset will
be available at kuanghuei.github.io/CleanNetProject.Comment: Accepted to CVPR 201
Several Issues on Hieroglyph of Naxi Ethnic Minority
Hieroglyph of Naxi ethnic minority is the picture text, which has been so far the only “living hieroglyph”. Naxi Hieroglyph is the general name of Dongba Script, Geba Script Malimasha Script as well as Ruanke Script. Moreover, the creation of Naxi Hieroglyph is closely related to the migration routes of Naxi Geba Script, based on Do ancestors, which corresponds with the dialect areas of Naxi ethnic language, and its creation can date back to 11th century. Geba Script, is created when contacting with foreign culture, which carries the characteristics of Chinese and Tibetan writings
Thorium-doping induced superconductivity up to 56 K in Gd1-xThxFeAsO
Following the discovery of superconductivity in an iron-based arsenide
LaO1-xFxFeAs with a superconducting transition temperature (Tc) of 26 K[1], Tc
was pushed up surprisingly to above 40 K by either applying pressure[2] or
replacing La with Sm[3], Ce[4], Nd[5] and Pr[6]. The maximum Tc has climbed to
55 K, observed in SmO1-xFxFeAs[7, 8] and SmFeAsO1-x[9]. The value of Tc was
found to increase with decreasing lattice parameters in LnFeAsO1-xFx (Ln stands
for the lanthanide elements) at an apparently optimal doping level. However,
the F- doping in GdFeAsO is particularly difficult[10,11] due to the lattice
mismatch between the Gd2O2 layers and Fe2As2 layers. Here we report observation
of superconductivity with Tc as high as 56 K by the Th4+ substitution for Gd3+
in GdFeAsO. The incorporation of relatively large Th4+ ions relaxes the lattice
mismatch, hence induces the high temperature superconductivity.Comment: 4 pages, 3 figure
Doubly Robust Conditional Independence Testing with Generative Neural Networks
This article addresses the problem of testing the conditional independence of
two generic random vectors and given a third random vector , which
plays an important role in statistical and machine learning applications. We
propose a new non-parametric testing procedure that avoids explicitly
estimating any conditional distributions but instead requires sampling from the
two marginal conditional distributions of given and given . We
further propose using a generative neural network (GNN) framework to sample
from these approximated marginal conditional distributions, which tends to
mitigate the curse of dimensionality due to its adaptivity to any
low-dimensional structures and smoothness underlying the data. Theoretically,
our test statistic is shown to enjoy a doubly robust property against GNN
approximation errors, meaning that the test statistic retains all desirable
properties of the oracle test statistic utilizing the true marginal conditional
distributions, as long as the product of the two approximation errors decays to
zero faster than the parametric rate. Asymptotic properties of our statistic
and the consistency of a bootstrap procedure are derived under both null and
local alternatives. Extensive numerical experiments and real data analysis
illustrate the effectiveness and broad applicability of our proposed test
Nuclear tunneling effects of charge transport in rubrene, tetracene, and pentacene
The mechanism of charge transport in organic materials is still controversial from both experimental and theoretical perspectives. At room temperature, molecular deformations interact strongly with the charge carrier both through intermolecular and intramolecular phonons, suggesting a thermally activated hopping mechanism as described by the Marcus electron transfer theory. However, several experimental measurements have indicated that the electronic transport behaves in a "bandlike" manner, as indicated by a decrease in mobility with increasing temperature, in contradiction to the Marcus description. Bandlike first-principles calculations based on the Holstein-Peierls model tend to overestimate the charge mobility by about 2 orders of magnitude. Here, a hopping model is derived that not only quantitatively describes the charge mobility but also explains the observed bandlike behavior. This model uses the quantum version of charge-transfer theory coupled with a random-walk simulation of charge diffusion. The results bridge the gap between the two extreme mechanisms. This first-principles method predicts the room-temperature hole mobilities to be 2.4, 2.0, and 0.67 cm(2)/V s, for rubrene, pentacene, and tetracene, respectively, in good agreement with experiment
Large Search Model: Redefining Search Stack in the Era of LLMs
Modern search engines are built on a stack of different components, including
query understanding, retrieval, multi-stage ranking, and question answering,
among others. These components are often optimized and deployed independently.
In this paper, we introduce a novel conceptual framework called large search
model, which redefines the conventional search stack by unifying search tasks
with one large language model (LLM). All tasks are formulated as autoregressive
text generation problems, allowing for the customization of tasks through the
use of natural language prompts. This proposed framework capitalizes on the
strong language understanding and reasoning capabilities of LLMs, offering the
potential to enhance search result quality while simultaneously simplifying the
existing cumbersome search stack. To substantiate the feasibility of this
framework, we present a series of proof-of-concept experiments and discuss the
potential challenges associated with implementing this approach within
real-world search systems.Comment: SIGIR Forum, Vol. 57 No. 2 - December 202
Improving Text Embeddings with Large Language Models
In this paper, we introduce a novel and simple method for obtaining
high-quality text embeddings using only synthetic data and less than 1k
training steps. Unlike existing methods that often depend on multi-stage
intermediate pre-training with billions of weakly-supervised text pairs,
followed by fine-tuning with a few labeled datasets, our method does not
require building complex training pipelines or relying on manually collected
datasets that are often constrained by task diversity and language coverage. We
leverage proprietary LLMs to generate diverse synthetic data for hundreds of
thousands of text embedding tasks across 93 languages. We then fine-tune
open-source decoder-only LLMs on the synthetic data using standard contrastive
loss. Experiments demonstrate that our method achieves strong performance on
highly competitive text embedding benchmarks without using any labeled data.
Furthermore, when fine-tuned with a mixture of synthetic and labeled data, our
model sets new state-of-the-art results on the BEIR and MTEB benchmarks.Comment: Accepted by ACL 202
- …
