Search CORE

79 research outputs found

Twitter Bots’ Detection with Benford’s Law and Machine Learning

Author: Bhosale Sanmesh
Di Troia Fabio
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

Online Social Networks (OSNs) have grown exponentially in terms of active users and have now become an influential factor in the formation of public opinions. For this reason, the use of bots and botnets for spreading misinformation on OSNs has become a widespread concern. Identifying bots and botnets on Twitter can require complex statistical methods to score a profile based on multiple features. Benford’s Law, or the Law of Anomalous Numbers, states that, in any naturally occurring sequence of numbers, the First Significant Leading Digit (FSLD) frequency follows a particular pattern such that they are unevenly distributed and reducing. This principle can be applied to the first-degree egocentric network of a Twitter profile to assess its conformity to such law and, thus, classify it as a bot profile or normal profile. This paper focuses on leveraging Benford’s Law in combination with various Machine Learning (ML) classifiers to identify bot profiles on Twitter. In addition, a comparison with other statistical methods is produced to confirm our classification results

SJSU ScholarWorks

Fake Malware Generation Using HMM and GAN

Author: Di Troia Fabio
Trehan Harshit
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

In the past decade, the number of malware attacks have grown considerably and, more importantly, evolved. Many researchers have successfully integrated state-of-the-art machine learning techniques to combat this ever present and rising threat to information security. However, the lack of enough data to appropriately train these machine learning models is one big challenge that is still present. Generative modelling has proven to be very efficient at generating image-like synthesized data that can match the actual data distribution. In this paper, we aim to generate malware samples as opcode sequences and attempt to differentiate them from the real ones with the goal to build fake malware data that can be used to effectively train the machine learning models. We use and compare different Generative Adversarial Networks (GAN) algorithms and Hidden Markov Models (HMM) to generate such fake samples obtaining promising results

Crossref

SJSU ScholarWorks

Word Embeddings for Fake Malware Generation

Author: Di Troia Fabio
Tran Quang Duy
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

Signature and anomaly-based techniques are the fundamental methods to detect malware. However, in recent years this type of threat has advanced to become more complex and sophisticated, making these techniques less effective. For this reason, researchers have resorted to state-of-the-art machine learning techniques to combat the threat of information security. Nevertheless, despite the integration of the machine learning models, there is still a shortage of data in training that prevents these models from performing at their peak. In the past, generative models have been found to be highly effective at generating image-like data that are similar to the actual data distribution. In this paper, we leverage the knowledge of generative modeling on opcode sequences and aim to generate malware samples by taking advantage of the contextualized embeddings from BERT. We obtained promising results when differentiating between real and generated samples. We observe that generated malware has such similar characteristics to actual malware that the classifiers are having difficulty in distinguishing between the two, in which the classifiers falsely identify the generated malware as actual malware almost of the time

SJSU ScholarWorks

A Blockchain-Based Tamper-Resistant Logging Framework

Author: Austin Thomas H.
Di Troia Fabio
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

Since its introduction in Bitcoin, the blockchain has proven to be a versatile data structure. In its role as an immutable ledger, it has grown beyond its initial use in financial transactions to be used in recording a wide variety of other useful information. In this paper, we explore the application of the blockchain outside of its traditional decentralized, financial domain. We show how, even with only a single “mining” node, a proof-of-work blockchain can be the cornerstone of a tamper resistant logging framework. By attaching a proof-of-work to blocks of logging messages, we make it increasingly difficult for an attacker to modify those logs even after totally compromising the system. Furthermore, we discuss various strategies an attacker might take to modify the logs without detection and show how effective those evasion techniques are against statistical analysis

SJSU ScholarWorks

Machine learning classification for advanced malware detection

Author: Di Troia Fabio
Publication venue: Kingston University
Publication date
Field of study

This introductory document discusses topics related to malware detection via the application of machine learning algorithms. It is intended as a supplement to the published work submitted (a complete list of which can be found in Table 1) and outlines the motivation behind the experiments. The document begins with the following sections: • Section 2 presents a preliminary discussion of the research methodology employed. • Section 3 presents the background analysis of malware detection in general, and the use of machine learning. • Section 4 provides a brief introduction of the most common machine learning algorithms in current use. The remaining sections present the main body of the experimental work, which lead to the conclusions in Section 10. • Section 5 analyzes different initialization strategies for machine learning models, with a view to ensuring that the most effective training and testing strategy is employed. Following this, a purely dynamic approach is proposed, which results in perfect classification of the samples against benign files, and therefore provides a baseline against which the performance of subsequent static approaches can be compared. • Section 6 introduces the static-based tests, beginning with the challenging problem of zero-day detection samples, i.e. malware samples for which not enough data has been gathered yet to train the machine learning models. • Section 7 describes the testing of several different approaches to static malware detection. During these tests, the effectiveness of these algorithms is analyzed and compared with other means of classification. 7 • Section 8 proposes and compares techniques to boost the detection accuracy by combining the scores obtained from other detection algorithms, with a view to improving static classification scores and thus reach the perfect detection obtained with dynamic features. • Section 9 tests the effectiveness of generic malware models by assessing the detection effectiveness of a generic malware model trained on several different families. The experiments are intended to introduce a more realistic scenario where a single, comprehensive, machine learning model is used to detect several families. This Section shows the difficulty to build a single model to detect several malware families

Kingston University Research Repository

Robustness of Image-Based Malware Analysis

Author: Di Troia Fabio
Stamp Mark
Tran Katrina
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

In previous work, “gist descriptor” features extracted from images have been used in malware classification problems and have shown promising results. In this research, we determine whether gist descriptors are robust with respect to malware obfuscation techniques, as compared to Convolutional Neural Networks (CNN) trained directly on malware images. Using the Python Image Library (PIL), we create images from malware executables and from malware that we obfuscate. We conduct experiments to compare classifying these images with a CNN as opposed to extracting the gist descriptor features from these images to use in classification. For the gist descriptors, we consider a variety of classification algorithms including k-nearest neighbors, random forest, support vector machine, and multi-layer perceptron. We find that gist descriptors are more robust than CNNs, with respect to the obfuscation techniques that we consider

SJSU ScholarWorks

Malware classification using long short-term memory models

Author: Dang Dennis
Di Troia Fabio
Stamp Mark
Publication venue: 'Scitepress'
Publication date: 01/01/2021
Field of study

Signature and anomaly based techniques are the quintessential approaches to malware detection. However, these techniques have become increasingly ineffective as malware has become more sophisticated and complex. Researchers have therefore turned to deep learning to construct better performing model. In this paper, we create four different long-short term memory (LSTM) based models and train each to classify malware samples from 20 families. Our features consist of opcodes extracted from malware executables. We employ techniques used in natural language processing (NLP), including word embedding and bidirection LSTMs (biLSTM), and we also use convolutional neural networks (CNN). We find that a model consisting of word embedding, biLSTMs, and CNN layers performs best in our malware classification experiments

arXiv.org e-Print Archive

Crossref

SJSU ScholarWorks

Black box analysis of android malware detectors

Author: Di Troia Fabio
Nellaivadivelu Guruswamy
Stamp Mark
Publication venue: SJSU ScholarWorks
Publication date: 05/03/2020
Field of study

If a malware detector relies heavily on a feature that is obfuscated in a given malware sample, then the detector will likely fail to correctly classify the malware. In this research, we obfuscate selected features of known Android malware samples and determine whether these obfuscated samples can still be reliably detected. Using this approach, we discover which features are most significant for various sets of Android malware detectors, in effect, performing a black box analysis of these detectors. We find that there is a surprisingly high degree of variability among the key features used by popular malware detectors

SJSU ScholarWorks

Hidden Markov Models with Random Restarts vs Boosting for Malware Detection

Author: Di Troia Fabio
Raghavan Aditya
Stamp Mark
Publication venue
Publication date: 17/07/2023
Field of study

Effective and efficient malware detection is at the forefront of research into building secure digital systems. As with many other fields, malware detection research has seen a dramatic increase in the application of machine learning algorithms. One machine learning technique that has been used widely in the field of pattern matching in general-and malware detection in particular-is hidden Markov models (HMMs). HMM training is based on a hill climb, and hence we can often improve a model by training multiple times with different initial values. In this research, we compare boosted HMMs (using AdaBoost) to HMMs trained with multiple random restarts, in the context of malware detection. These techniques are applied to a variety of challenging malware datasets. We find that random restarts perform surprisingly well in comparison to boosting. Only in the most difficult "cold start" cases (where training data is severely limited) does boosting appear to offer sufficient improvement to justify its higher computational cost in the scoring phase

arXiv.org e-Print Archive