134 research outputs found
An evaluation of entropy measures for microphone identification
Research findings have shown that microphones can be uniquely identified by audio recordings since physical features of the microphone components leave repeatable and distinguishable traces on the audio stream. This property can be exploited in security applications to perform the identification of a mobile phone through the built-in microphone. The problem is to determine an accurate but also efficient representation of the physical characteristics, which is not known a priori. Usually there is a trade-off between the identification accuracy and the time requested to perform the classification. Various approaches have been used in literature to deal with it, ranging from the application of handcrafted statistical features to the recent application of deep learning techniques. This paper evaluates the application of different entropy measures (Shannon Entropy, Permutation Entropy, Dispersion Entropy, Approximate Entropy, Sample Entropy, and Fuzzy Entropy) and their suitability for microphone classification. The analysis is validated against an experimental dataset of built-in microphones of 34 mobile phones, stimulated by three different audio signals. The findings show that selected entropy measures can provide a very high identification accuracy in comparison to other statistical features and that they can be robust against the presence of noise. This paper performs an extensive analysis based on filter features selection methods to identify the most discriminating entropy measures and the related hyper-parameters (e.g., embedding dimension). Results on the trade-off between accuracy and classification time are also presented
Media forensics on social media platforms: a survey
The dependability of visual information on the web and the authenticity of digital media appearing virally in social media platforms has been raising unprecedented concerns. As a result, in the last years the multimedia forensics research community pursued the ambition to scale the forensic analysis to real-world web-based open systems. This survey aims at describing the work done so far on the analysis of shared data, covering three main aspects: forensics techniques performing source identification and integrity verification on media uploaded on social networks, platform provenance analysis allowing to identify sharing platforms, and multimedia verification algorithms assessing the credibility of media objects in relation to its associated textual information. The achieved results are highlighted together with current open issues and research challenges to be addressed in order to advance the field in the next future
CMDD: A novel multimodal two-stream CNN deepfakes detector
Researchers commonly model deepfake detection as a binary classification problem, using an unimodal network for each type of manipulated modality (such as auditory and visual) and a final ensemble of their predictions. In this paper, we focus our attention on the simultaneous detection of relationships between audio and visual cues, leading to the extraction of more comprehensive information to expose deepfakes. We propose the Convolutional Multimodal deepfake detection model (CMDD), a novel multimodal model that relies on the power of two Convolution Neural Networks (CNNs) to concurrently extract and process spatial and temporal features. We compare it with two baseline models: DeepFakeCVT, which uses two CNNs and a final Vision Transformer, and DeepMerge, which employs a score fusion of each unimodal CNN model. The multimodal FakeAVCeleb dataset was used to train and test our model, resulting in a model accuracy of 98.9% that places our model in the top 3 ranking of models evaluated on FakeAVCeleb
METER: a mobile vision transformer architecture for monocular depth estimation
Depth estimation is a fundamental knowledge for autonomous systems that need
to assess their own state and perceive the surrounding environment. Deep
learning algorithms for depth estimation have gained significant interest in
recent years, owing to the potential benefits of this methodology in overcoming
the limitations of active depth sensing systems. Moreover, due to the low cost
and size of monocular cameras, researchers have focused their attention on
monocular depth estimation (MDE), which consists in estimating a dense depth
map from a single RGB video frame. State of the art MDE models typically rely
on vision transformers (ViT) architectures that are highly deep and complex,
making them unsuitable for fast inference on devices with hardware constraints.
Purposely, in this paper, we address the problem of exploiting ViT in MDE on
embedded devices. Those systems are usually characterized by limited memory
capabilities and low-power CPU/GPU. We propose METER, a novel lightweight
vision transformer architecture capable of achieving state of the art
estimations and low latency inference performances on the considered embedded
hardwares: NVIDIA Jetson TX1 and NVIDIA Jetson Nano. We provide a solution
consisting of three alternative configurations of METER, a novel loss function
to balance pixel estimation and reconstruction of image details, and a new data
augmentation strategy to improve the overall final predictions. The proposed
method outperforms previous lightweight works over the two benchmark datasets:
the indoor NYU Depth v2 and the outdoor KITTI
Continuous fake media detection: Adapting deepfake detectors to new generative techniques
Generative techniques continue to evolve at an impressively high rate, driven by the hype about these technologies. This rapid advancement severely limits the application of deepfake detectors, which, despite numerous efforts by the scientific community, struggle to achieve sufficiently robust performance against the ever-changing content. To address these limitations, in this paper, we propose an analysis of two continuous learning techniques on a Short and a Long sequence of fake media. Both sequences include a complex and heterogeneous range of deepfakes (generated images and videos) from GANs, computer graphics techniques, and unknown sources. Our experiments show that continual learning could be important in mitigating the need for generalizability. In fact, we show that, although with some limitations, continual learning methods help to maintain good performance across the entire training sequence. For these techniques to work in a sufficiently robust way, however, it is necessary that the tasks in the sequence share similarities. In fact, according to our experiments, the order and similarity of the tasks can affect the performance of the models over time. To address this problem, we show that it is possible to group tasks based on their similarity. This small measure allows for a significant improvement even in longer sequences. This result suggests that continual techniques can be combined with the most promising detection methods, allowing them to catch up with the latest generative techniques. In addition to this, we propose an overview of how this learning approach can be integrated into a deepfake detection pipeline for continuous integration and continuous deployment (CI/CD). This allows you to keep track of different funds, such as social networks, new generative tools, or third-party datasets, and through the integration of continuous learning, allows constant maintenance of the detectors
ABC-CapsNet: Attention based Cascaded Capsule Network for Audio Deepfake Detection
In response to the escalating challenge of audio deepfake detection, this study introduces ABC-CapsNet (Attention-Based Cascaded Capsule Network), a novel architecture that merges the perceptual strengths of Mel spectrograms with the robust feature extraction capabilities of VGG18, enhanced by a strategically placed attention mechanism. This architecture pioneers the use of cascaded capsule networks to delve deeper into complex audio data patterns, setting a new standard in the precision of identifying manipulated audio content. Distinctively, ABC-CapsNet not only addresses the inherent limitations found in traditional CNN models but also showcases remarkable effectiveness across diverse datasets. The proposed method achieved an equal error rate EER of 0.06% on the ASVspoof2019 dataset and an EER of 0.04% on the FoR dataset, underscoring the superior accuracy and reliability of the proposed system in combating the sophisticated threat of audio deepfakes
A guided-based approach for deepfake detection: RGB-depth integration via features fusion
Deep fake technology paves the way for a new generation of super realistic artificial content. While this opens the door to extraordinary new applications, the malicious use of deepfakes allows for far more realistic disinformation attacks than ever before. In this paper, we start from the intuition that generating fake content introduces possible inconsistencies in the depth of the generated images. This extra information provides valuable spatial and semantic cues that can reveal inconsistencies facial generative methods introduce. To test this idea, we evaluate different strategies for integrating depth information into an RGB detector and we propose an attention mechanism that makes it possible to integrate information from depth effectively. In addition to being more accurate than an RGB model, our Masked Depthfake Network method is +3.2% more robust against common adversarial attacks on average than a typical RGB detector. Furthermore, we show how this technique allows the model to learn more discriminative features than RGB alone
A Survey on Efficient Vision Transformers: Algorithms, Techniques, and Performance Benchmarking
Vision Transformer (ViT) architectures are becoming increasingly popular and widely employed to tackle computer vision applications. Their main feature is the capacity to extract global information through the self-attention mechanism, outperforming earlier convolutional neural networks. However, ViT deployment and performance have grown steadily with their size, number of trainable parameters, and operations. Furthermore, self-attention's computational and memory cost quadratically increases with the image resolution. Generally speaking, it is challenging to employ these architectures in real-world applications due to many hardware and environmental restrictions, such as processing and computational capabilities. Therefore, this survey investigates the most efficient methodologies to ensure sub-optimal estimation performances. More in detail, four efficient categories will be analyzed: compact architecture, pruning, knowledge distillation, and quantization strategies. Moreover, a new metric called Efficient Error Rate has been introduced in order to normalize and compare models' features that affect hardware devices at inference time, such as the number of parameters, bits, FLOPs, and model size. Summarizing, this paper firstly mathematically defines the strategies used to make Vision Transformer efficient, describes and discusses state-of-the-art methodologies, and analyzes their performances over different application scenarios. Toward the end of this paper, we also discuss open challenges and promising research directions
Learning from Unlabelled data with Transformers: Domain Adaptation for Semantic Segmentation of High Resolution Aerial Images
Data from satellites or aerial vehicles are most of the times unlabelled. Annotating such data accurately is difficult, requires expertise, and is costly in terms of time. Even if Earth Observation (EO) data were correctly labelled, labels might change over time. Learning from unlabelled data within a semi-supervised learning framework for segmentation of aerial images is challenging. In this paper, we develop a new model for semantic segmentation of unlabelled images, the Non-annotated Earth Observation Semantic Segmentation (NEOS) model. NEOS performs domain adaptation as the target domain does not have ground truth masks. The distribution inconsistencies between the target and source domains are due to differences in acquisition scenes, environment conditions, sensors, and times. Our model aligns the learned representations of the different domains to make them coincide. The evaluation results show that it is successful and outperforms other models for semantic segmentation of unlabelled data
Multi Pattern Features-Based Spoofing Detection Mechanism Using One Class Learning
Automatic Speaker Verification systems are prone to various voice spoofing attacks such as replays, voice conversion (VC) and speech synthesis. Malicious users can perform specific tasks such as controlling the bank account of someone, taking control of a smart home, and similar activities, by using advanced audio manipulation techniques. This study presents a Multi-Pattern Features Based Spoofing detection mechanism using the modified ResNet architecture and OC-Softmax layer to detect various LA and PA spoofing attacks. We proposed a novel Pattern features-based audio spoof detection scheme. The scheme contains three branches to evaluate different patterns on a Mel spectrogram of the audio file. This is the first work for the audio spoofing detection task using three different pattern representations of Mel spectrogram with modified ResNet architecture and OC-Softmax layer. Through the proposed network, we can extract pattern images from the Mel spectrogram and gives each of them into modified ResNet architecture. At the last step of each network, we use OC-Softmax to obtain a score for the current pattern image and then the method fuses three scores to label the input audio. Experimental results on the ASVspoof 2019 and ASVspoof 2021 corpuses show that the proposed method achieves better results in the challenges of ASVspoof 2019 than state-of-the-art methods. For example, in the logical access scenario, our model improves the tandem decision cost function and equal error rate scores by 0.06% and 2.14%, respectively, compared with state-of-the-art methods. Additionally, experiments illustrate that the proposed fused decision improved the performance of the system
- …
