79 research outputs found
Twitter Bots’ Detection with Benford’s Law and Machine Learning
Online Social Networks (OSNs) have grown exponentially in terms of active users and have now become an influential factor in the formation of public opinions. For this reason, the use of bots and botnets for spreading misinformation on OSNs has become a widespread concern. Identifying bots and botnets on Twitter can require complex statistical methods to score a profile based on multiple features. Benford’s Law, or the Law of Anomalous Numbers, states that, in any naturally occurring sequence of numbers, the First Significant Leading Digit (FSLD) frequency follows a particular pattern such that they are unevenly distributed and reducing. This principle can be applied to the first-degree egocentric network of a Twitter profile to assess its conformity to such law and, thus, classify it as a bot profile or normal profile. This paper focuses on leveraging Benford’s Law in combination with various Machine Learning (ML) classifiers to identify bot profiles on Twitter. In addition, a comparison with other statistical methods is produced to confirm our classification results
Fake Malware Generation Using HMM and GAN
In the past decade, the number of malware attacks have grown considerably and, more importantly, evolved. Many researchers have successfully integrated state-of-the-art machine learning techniques to combat this ever present and rising threat to information security. However, the lack of enough data to appropriately train these machine learning models is one big challenge that is still present. Generative modelling has proven to be very efficient at generating image-like synthesized data that can match the actual data distribution. In this paper, we aim to generate malware samples as opcode sequences and attempt to differentiate them from the real ones with the goal to build fake malware data that can be used to effectively train the machine learning models. We use and compare different Generative Adversarial Networks (GAN) algorithms and Hidden Markov Models (HMM) to generate such fake samples obtaining promising results
Word Embeddings for Fake Malware Generation
Signature and anomaly-based techniques are the fundamental methods to detect malware. However, in recent years this type of threat has advanced to become more complex and sophisticated, making these techniques less effective. For this reason, researchers have resorted to state-of-the-art machine learning techniques to combat the threat of information security. Nevertheless, despite the integration of the machine learning models, there is still a shortage of data in training that prevents these models from performing at their peak. In the past, generative models have been found to be highly effective at generating image-like data that are similar to the actual data distribution. In this paper, we leverage the knowledge of generative modeling on opcode sequences and aim to generate malware samples by taking advantage of the contextualized embeddings from BERT. We obtained promising results when differentiating between real and generated samples. We observe that generated malware has such similar characteristics to actual malware that the classifiers are having difficulty in distinguishing between the two, in which the classifiers falsely identify the generated malware as actual malware almost of the time
A Blockchain-Based Tamper-Resistant Logging Framework
Since its introduction in Bitcoin, the blockchain has proven to be a versatile data structure. In its role as an immutable ledger, it has grown beyond its initial use in financial transactions to be used in recording a wide variety of other useful information. In this paper, we explore the application of the blockchain outside of its traditional decentralized, financial domain. We show how, even with only a single “mining” node, a proof-of-work blockchain can be the cornerstone of a tamper resistant logging framework. By attaching a proof-of-work to blocks of logging messages, we make it increasingly difficult for an attacker to modify those logs even after totally compromising the system. Furthermore, we discuss various strategies an attacker might take to modify the logs without detection and show how effective those evasion techniques are against statistical analysis
Machine learning classification for advanced malware detection
This introductory document discusses topics related to malware detection via the application
of machine learning algorithms. It is intended as a supplement to the published work
submitted (a complete list of which can be found in Table 1) and outlines the motivation
behind the experiments.
The document begins with the following sections:
• Section 2 presents a preliminary discussion of the research methodology employed.
• Section 3 presents the background analysis of malware detection in general, and the
use of machine learning.
• Section 4 provides a brief introduction of the most common machine learning
algorithms in current use.
The remaining sections present the main body of the experimental work, which lead to the
conclusions in Section 10.
• Section 5 analyzes different initialization strategies for machine learning models, with
a view to ensuring that the most effective training and testing strategy is employed.
Following this, a purely dynamic approach is proposed, which results in perfect
classification of the samples against benign files, and therefore provides a baseline
against which the performance of subsequent static approaches can be compared.
• Section 6 introduces the static-based tests, beginning with the challenging problem of
zero-day detection samples, i.e. malware samples for which not enough data has been
gathered yet to train the machine learning models.
• Section 7 describes the testing of several different approaches to static malware
detection. During these tests, the effectiveness of these algorithms is analyzed and
compared with other means of classification.
7
• Section 8 proposes and compares techniques to boost the detection accuracy by
combining the scores obtained from other detection algorithms, with a view to
improving static classification scores and thus reach the perfect detection obtained
with dynamic features.
• Section 9 tests the effectiveness of generic malware models by assessing the detection
effectiveness of a generic malware model trained on several different families. The
experiments are intended to introduce a more realistic scenario where a single,
comprehensive, machine learning model is used to detect several families. This
Section shows the difficulty to build a single model to detect several malware families
Robustness of Image-Based Malware Analysis
In previous work, “gist descriptor” features extracted from images have been used in malware classification problems and have shown promising results. In this research, we determine whether gist descriptors are robust with respect to malware obfuscation techniques, as compared to Convolutional Neural Networks (CNN) trained directly on malware images. Using the Python Image Library (PIL), we create images from malware executables and from malware that we obfuscate. We conduct experiments to compare classifying these images with a CNN as opposed to extracting the gist descriptor features from these images to use in classification. For the gist descriptors, we consider a variety of classification algorithms including k-nearest neighbors, random forest, support vector machine, and multi-layer perceptron. We find that gist descriptors are more robust than CNNs, with respect to the obfuscation techniques that we consider
Malware classification using long short-term memory models
Signature and anomaly based techniques are the quintessential approaches to malware detection. However, these techniques have become increasingly ineffective as malware has become more sophisticated and complex. Researchers have therefore turned to deep learning to construct better performing model. In this paper, we create four different long-short term memory (LSTM) based models and train each to classify malware samples from 20 families. Our features consist of opcodes extracted from malware executables. We employ techniques used in natural language processing (NLP), including word embedding and bidirection LSTMs (biLSTM), and we also use convolutional neural networks (CNN). We find that a model consisting of word embedding, biLSTMs, and CNN layers performs best in our malware classification experiments
Black box analysis of android malware detectors
If a malware detector relies heavily on a feature that is obfuscated in a given malware sample, then the detector will likely fail to correctly classify the malware. In this research, we obfuscate selected features of known Android malware samples and determine whether these obfuscated samples can still be reliably detected. Using this approach, we discover which features are most significant for various sets of Android malware detectors, in effect, performing a black box analysis of these detectors. We find that there is a surprisingly high degree of variability among the key features used by popular malware detectors
Hidden Markov Models with Random Restarts vs Boosting for Malware Detection
Effective and efficient malware detection is at the forefront of research
into building secure digital systems. As with many other fields, malware
detection research has seen a dramatic increase in the application of machine
learning algorithms. One machine learning technique that has been used widely
in the field of pattern matching in general-and malware detection in
particular-is hidden Markov models (HMMs). HMM training is based on a hill
climb, and hence we can often improve a model by training multiple times with
different initial values. In this research, we compare boosted HMMs (using
AdaBoost) to HMMs trained with multiple random restarts, in the context of
malware detection. These techniques are applied to a variety of challenging
malware datasets. We find that random restarts perform surprisingly well in
comparison to boosting. Only in the most difficult "cold start" cases (where
training data is severely limited) does boosting appear to offer sufficient
improvement to justify its higher computational cost in the scoring phase
- …
