41 research outputs found
Improved training of energy-based models
L'estimation du maximum de vraisemblance des modèles basés sur l'énergie est un problème difficile à résoudre en raison de l'insolubilité du gradient du logarithmique de la vraisemblance. Dans ce travail, nous proposons d'apprendre à la fois la fonction d'énergie et un mécanisme d'échantillonnage approximatif amorti à l'aide d'un réseau de générateurs neuronaux, qui fournit une approximation efficace du gradient de la log-vraisemblance. L'objectif qui en résulte exige la maximisation de l'entropie des échantillons générés, que nous réalisons en utilisant des estimateurs d'information mutuelle non paramétriques récemment proposés. Enfin, pour stabiliser le jeu antagoniste qui en résulte, nous utilisons une pénalité du gradient, centrée en zéro, dérivée comme condition nécessaire issue de la littérature sur l'alignement des scores.
La technique proposée peut générer des images nettes avec des scores d'Inception et de FID compétitifs par rapport aux techniques récentes de GAN, ne souffrant pas d'effondrement de mode, et compétitive par rapport aux techniques de détection d'anomalies les plus récentes.
Le chapitre 1 introduit les concepts essentiels à la compréhension des travaux présentés dans cette thèse, tels que les modèles graphiques fondés sur l'énergie, les méthodes de Monte-Carlo par chaînes de Markov, les réseaux antagonistes génératifs et l'estimation de l'information mutuelle. Le chapitre 2 contient un article détaillant notre travail sur l'amélioration de l'entraînement des fonctions d'énergie. Enfin, le chapitre 3 présente quelques conclusions tirées de ce travail de thèse, la portée des travaux futurs, ainsi que des questions ouvertes qui restent sans réponse.Maximum likelihood estimation of energy-based models is a challenging problem due to the intractability of the log-likelihood gradient. In this work, we propose learning both the energy function and an amortized approximate sampling mechanism using a neural generator network, which provides an efficient approximation of the log-likelihood gradient. The resulting objective requires maximizing entropy of the generated samples, which we perform using recently proposed nonparametric mutual information estimators. Finally, to stabilize the resulting adversarial game, we use a zero-centered gradient penalty derived as a necessary condition from the score matching literature.
The proposed technique can generate sharp images with Inception and FID scores competitive with recent GAN techniques, does not suffer from mode collapse, and is competitive with state-of-the-art anomaly detection techniques
The clinicopathological study of postmenopausal bleeding
Background: Postmenopausal bleeding (PMB) represents one of the most common reasons for referral to gynaecological services, largely due to suspicion of an underlying endometrial malignancy.Methods: The data was collected from 100 patients with postmenopausal bleeding per vaginum attending the outpatient department or admitted for evaluation under obstetrics and gynaecology in this prospective study. Written and informed consent was taken from all the patients enrolled in the study. They were evaluated by history, clinical examination and investigations like transvaginal sonography, endometrial biopsy, fractional curettage, Papanicolau smear done for all subjects and the specimens collected was sent to the department of pathology for examination and reporting. Descriptive statistics was applied and analyzed by percentages and chi square test.Results: In patients with post-menopausal bleeding, atrophic endometrium was seen in 31%, proliferative endometrium in 13%, isthmic endometrium in 5%, polyp in 5%, simple hyperplasia without atypia in 35%, simple hyperplasia with atypia in 3%, complex hyperplasia without atypia in 1%, complex hyperplasia with atypia in 1% and endometrial carcinoma in 6% of the patients with PMB. Benign conditions were seen in 94% and malignancy was seen in 6% cases.Conclusions: The most common causes for postmenopausal bleeding were endometrial hyperplasia (40%), atrophic endometrium (31%), isthmic endometrium (5%), polyp (5%), proliferative endometrium (13%) and endometrial carcinoma (6%). A definitive diagnosis of PMB can be made by histological evaluation. Obesity, hypertension, diabetes mellitus and age since menopause are the risk factors for PMB
High-Fidelity Audio Compression with Improved RVQGAN
Language models have been successfully used to model natural signals, such as
images, speech, and music. A key component of these models is a high quality
neural compression model that can compress high-dimensional natural signals
into lower dimensional discrete tokens. To that end, we introduce a
high-fidelity universal neural audio compression algorithm that achieves ~90x
compression of 44.1 KHz audio into tokens at just 8kbps bandwidth. We achieve
this by combining advances in high-fidelity audio generation with better vector
quantization techniques from the image domain, along with improved adversarial
and reconstruction losses. We compress all domains (speech, environment, music,
etc.) with a single universal model, making it widely applicable to generative
modeling of all audio. We compare with competing audio compression algorithms,
and find our method outperforms them significantly. We provide thorough
ablations for every design choice, as well as open-source code and trained
model weights. We hope our work can lay the foundation for the next generation
of high-fidelity audio modeling.Comment: Accepted at NeurIPS 2023 (spotlight
Towards Automatic Face-to-Face Translation
In light of the recent breakthroughs in automatic machine translation
systems, we propose a novel approach that we term as "Face-to-Face
Translation". As today's digital communication becomes increasingly visual, we
argue that there is a need for systems that can automatically translate a video
of a person speaking in language A into a target language B with realistic lip
synchronization. In this work, we create an automatic pipeline for this problem
and demonstrate its impact on multiple real-world applications. First, we build
a working speech-to-speech translation system by bringing together multiple
existing modules from speech and language. We then move towards "Face-to-Face
Translation" by incorporating a novel visual module, LipGAN for generating
realistic talking faces from the translated audio. Quantitative evaluation of
LipGAN on the standard LRW test set shows that it significantly outperforms
existing approaches across all standard metrics. We also subject our
Face-to-Face Translation pipeline, to multiple human evaluations and show that
it can significantly improve the overall user experience for consuming and
interacting with multimodal content across languages. Code, models and demo
video are made publicly available.
Demo video: https://www.youtube.com/watch?v=aHG6Oei8jF0
Code and models: https://github.com/Rudrabha/LipGANComment: 9 pages (including references), 5 figures, Published in ACM
Multimedia, 201
A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild
In this work, we investigate the problem of lip-syncing a talking face video
of an arbitrary identity to match a target speech segment. Current works excel
at producing accurate lip movements on a static image or videos of specific
people seen during the training phase. However, they fail to accurately morph
the lip movements of arbitrary identities in dynamic, unconstrained talking
face videos, resulting in significant parts of the video being out-of-sync with
the new audio. We identify key reasons pertaining to this and hence resolve
them by learning from a powerful lip-sync discriminator. Next, we propose new,
rigorous evaluation benchmarks and metrics to accurately measure lip
synchronization in unconstrained videos. Extensive quantitative evaluations on
our challenging benchmarks show that the lip-sync accuracy of the videos
generated by our Wav2Lip model is almost as good as real synced videos. We
provide a demo video clearly showing the substantial impact of our Wav2Lip
model and evaluation benchmarks on our website:
\url{cvit.iiit.ac.in/research/projects/cvit-projects/a-lip-sync-expert-is-all-you-need-for-speech-to-lip-generation-in-the-wild}.
The code and models are released at this GitHub repository:
\url{github.com/Rudrabha/Wav2Lip}. You can also try out the interactive demo at
this link: \url{bhaasha.iiit.ac.in/lipsync}.Comment: 9 pages (including references), 3 figures, Accepted in ACM
Multimedia, 202
DualLip: A System for Joint Lip Reading and Generation
Lip reading aims to recognize text from talking lip, while lip generation
aims to synthesize talking lip according to text, which is a key component in
talking face generation and is a dual task of lip reading. In this paper, we
develop DualLip, a system that jointly improves lip reading and generation by
leveraging the task duality and using unlabeled text and lip video data. The
key ideas of the DualLip include: 1) Generate lip video from unlabeled text
with a lip generation model, and use the pseudo pairs to improve lip reading;
2) Generate text from unlabeled lip video with a lip reading model, and use the
pseudo pairs to improve lip generation. We further extend DualLip to talking
face generation with two additionally introduced components: lip to face
generation and text to speech generation. Experiments on GRID and TCD-TIMIT
demonstrate the effectiveness of DualLip on improving lip reading, lip
generation, and talking face generation by utilizing unlabeled data.
Specifically, the lip generation model in our DualLip system trained with
only10% paired data surpasses the performance of that trained with the whole
paired data. And on the GRID benchmark of lip reading, we achieve 1.16%
character error rate and 2.71% word error rate, outperforming the
state-of-the-art models using the same amount of paired data.Comment: Accepted by ACM Multimedia 202
