33 research outputs found
SourcererCC: Scaling Code Clone Detection to Big Code
Despite a decade of active research, there is a marked lack in clone
detectors that scale to very large repositories of source code, in particular
for detecting near-miss clones where significant editing activities may take
place in the cloned code. We present SourcererCC, a token-based clone detector
that targets three clone types, and exploits an index to achieve scalability to
large inter-project repositories using a standard workstation. SourcererCC uses
an optimized inverted-index to quickly query the potential clones of a given
code block. Filtering heuristics based on token ordering are used to
significantly reduce the size of the index, the number of code-block
comparisons needed to detect the clones, as well as the number of required
token-comparisons needed to judge a potential clone.
We evaluate the scalability, execution time, recall and precision of
SourcererCC, and compare it to four publicly available and state-of-the-art
tools. To measure recall, we use two recent benchmarks, (1) a large benchmark
of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of
thousands of fine-grained artificial clones. We find SourcererCC has both high
recall and precision, and is able to scale to a large inter-project repository
(250MLOC) using a standard workstation.Comment: Accepted for publication at ICSE'16 (preprint, unrevised
Large-Scale Clone Detection and Benchmarking
Code clones are pairs of code fragments that are similar. They are created when developers re-use code by copy and paste, although clones are known to occur for a variety of reasons. Clones have a negative impact on software quality and development by leading to the needless duplication of software maintenance and evolution efforts, causing the duplication and propagation of existing bugs throughout a software system, and even leading to new bugs when duplicate code is not evolved in parallel. It is important that developers detect their clones so that they can be managed and their harm mitigated. This need has been recognized by the many clone detectors available in the literature. Additionally, clone detection in large-scale inter-project repositories has been shown to have many potential applications such as mining for new APIs, license violation detection, similar application detection, code completion, API recommendation and usage support, and so on.
Despite this great interest in clone detection, there has been very little evaluation of the performance of the clone detection tools, including the creation of clone benchmarks. As well, very few clone detectors have been proposed for the large-scale inter-project use cases. In particular, the existing large-scale clone detectors require extraordinary hardware, long execution times, lack support for common clone types, and are not adaptable to target and explore the emerging large-scale inter-project use-cases. As well, none of the existing benchmarks could evaluate clone detection for these scenarios.
We address these problems in this thesis by introducing new clone benchmarks using both synthetic and real clone data, including a benchmark for evaluating the large-scale inter-project use-case. We use these benchmarks to conduct comprehensive tool evaluation and comparison studies considering the state of the art tools. We introduce a new clone detector for fast, scalable and user-guided detection in large inter-project datasets, which we extensively evaluate using our benchmarks and compare against the state of the art.
In the first part of this thesis, we introduce a synthetic clone benchmark we call the Mutation and Injection Framework which measures the recall of clone detection tools at a very fine granularity using artificial clones in a mutation-analysis procedure. We use the Mutation Framework to evaluate the state of the art clone detectors, and compare its results against the previous clone benchmarks. We demonstrate that the Mutation Framework enables accurate, precise and bias-free clone benchmarking experiments, and show that the previous benchmarks are outdated and inappropriate for evaluating modern clone detection tools. We also show that the Mutation Framework can be adapted with custom mutation operators to evaluate tools for any kind of clone.
In the second part of this thesis, we introduce BigCloneBench, a large benchmark of 8 million real clones in a large inter-project source datasets (IJaDataset: 25K projects, 250MLOC). We built this benchmark by mining IJaDataset for functions implementing commonly needed functionalities. This benchmark can evaluate clone detection tools for all types of clones, for intra-project vs inter-project clones, for semantic clones, and for clones across the entire spectrum of syntactical similarity. It is also the only benchmark capable of evaluating clone detectors for the emerging large-scale inter-project clone detection use-case. We use this benchmark to thoroughly evaluate the state of the art tools, and demonstrate why both synthetic (Mutation Framework) and real-world (BigCloneBench) benchmarks are needed.
In the third part of this thesis, we explore the scaling of clone detection to large inter-project source datasets. In our first study we introduce the Shuffling Framework, a strategy for scaling the existing natively non-scalable clone detection tools to large-scale inter-project datasets, but at the cost of a reduction in recall performance and requiring a small compute cluster. The Shuffling Framework exploits non-deterministic input partitioning, partition shuffling, inverted clone indexing and coarse-grained similarity metrics to achieve scalability. In our second study, we introduce our premier large-scale clone detection tool, CloneWorks, which enables fast, scalable and user-guided clone detection in large-scale inter-project datasets. CloneWorks achieves fast and scalable clone detection on an average personal workstation using the Jaccard similarity coefficient, the sub-block filtering heuristic, an inverted clone index, and index-based input partitioning heuristic. CloneWorks is one of the only tools to scale to an inter-project dataset of 250MLOC on an average workstation, and has the fastest detection time at just 2-10 hours, while also achieving the best recall and precision performances as per our clone benchmarks. CloneWorks uses a user-guided approach, which gives the user full control over the transformations applied to their source-code before clone detection in order to target any type or kind of clones. CloneWorks includes transformations such as tunable pretty-printing, adaptable identifier renaming, syntax abstraction and filtering, and can be extended by a plug-in architecture. Through scenarios and case studies we evaluate this user-guided aspect, and find it is adaptable has high precision
Efficiently Measuring an Accurate and Generalized Clone Detection Precision using Clone Clustering
Abstract-An important measure of clone detection performance is precision. However, there has been a marked lack of research into methods of efficiently and accurately measuring the precision of a clone detection tool. Instead, tool authors simply validate a small random sample of the clones their tools detected in a subject software system. Since there could be many thousands of clones reported by the tool, such a small random sample cannot guarantee an accurate and generalized measure of the tool's precision for all the varieties of clones that can occur in any arbitrary software system. In this paper, we propose a machine-learning based approach that can cluster similar clones together, and which can be used to maximize the variety of clones examined when measuring precision, while significantly reducing the biases a specific subject system has on the generality of the precision measured. Our technique reduces the efforts in measuring precision, while doubling the variety of clones validated and reducing biases that harm the generality of the measure by up to an order of magnitude. Our case study with the NiCad clone detector and the Java class library shows that our approach is effective in efficiently measuring an accurate and generalized precision of a subject clone detection tool
Large-Scale Clone Detection and Benchmarking
Code clones are pairs of code fragments that are similar. They are created when developers re-use code by copy and paste, although clones are known to occur for a variety of reasons. Clones have a negative impact on software quality and development by leading to the needless duplication of software maintenance and evolution efforts, causing the duplication and propagation of existing bugs throughout a software system, and even leading to new bugs when duplicate code is not evolved in parallel. It is important that developers detect their clones so that they can be managed and their harm mitigated. This need has been recognized by the many clone detectors available in the literature. Additionally, clone detection in large-scale inter-project repositories has been shown to have many potential applications such as mining for new APIs, license violation detection, similar application detection, code completion, API recommendation and usage support, and so on.
Despite this great interest in clone detection, there has been very little evaluation of the performance of the clone detection tools, including the creation of clone benchmarks. As well, very few clone detectors have been proposed for the large-scale inter-project use cases. In particular, the existing large-scale clone detectors require extraordinary hardware, long execution times, lack support for common clone types, and are not adaptable to target and explore the emerging large-scale inter-project use-cases. As well, none of the existing benchmarks could evaluate clone detection for these scenarios.
We address these problems in this thesis by introducing new clone benchmarks using both synthetic and real clone data, including a benchmark for evaluating the large-scale inter-project use-case. We use these benchmarks to conduct comprehensive tool evaluation and comparison studies considering the state of the art tools. We introduce a new clone detector for fast, scalable and user-guided detection in large inter-project datasets, which we extensively evaluate using our benchmarks and compare against the state of the art.
In the first part of this thesis, we introduce a synthetic clone benchmark we call the Mutation and Injection Framework which measures the recall of clone detection tools at a very fine granularity using artificial clones in a mutation-analysis procedure. We use the Mutation Framework to evaluate the state of the art clone detectors, and compare its results against the previous clone benchmarks. We demonstrate that the Mutation Framework enables accurate, precise and bias-free clone benchmarking experiments, and show that the previous benchmarks are outdated and inappropriate for evaluating modern clone detection tools. We also show that the Mutation Framework can be adapted with custom mutation operators to evaluate tools for any kind of clone.
In the second part of this thesis, we introduce BigCloneBench, a large benchmark of 8 million real clones in a large inter-project source datasets (IJaDataset: 25K projects, 250MLOC). We built this benchmark by mining IJaDataset for functions implementing commonly needed functionalities. This benchmark can evaluate clone detection tools for all types of clones, for intra-project vs inter-project clones, for semantic clones, and for clones across the entire spectrum of syntactical similarity. It is also the only benchmark capable of evaluating clone detectors for the emerging large-scale inter-project clone detection use-case. We use this benchmark to thoroughly evaluate the state of the art tools, and demonstrate why both synthetic (Mutation Framework) and real-world (BigCloneBench) benchmarks are needed.
In the third part of this thesis, we explore the scaling of clone detection to large inter-project source datasets. In our first study we introduce the Shuffling Framework, a strategy for scaling the existing natively non-scalable clone detection tools to large-scale inter-project datasets, but at the cost of a reduction in recall performance and requiring a small compute cluster. The Shuffling Framework exploits non-deterministic input partitioning, partition shuffling, inverted clone indexing and coarse-grained similarity metrics to achieve scalability. In our second study, we introduce our premier large-scale clone detection tool, CloneWorks, which enables fast, scalable and user-guided clone detection in large-scale inter-project datasets. CloneWorks achieves fast and scalable clone detection on an average personal workstation using the Jaccard similarity coefficient, the sub-block filtering heuristic, an inverted clone index, and index-based input partitioning heuristic. CloneWorks is one of the only tools to scale to an inter-project dataset of 250MLOC on an average workstation, and has the fastest detection time at just 2-10 hours, while also achieving the best recall and precision performances as per our clone benchmarks. CloneWorks uses a user-guided approach, which gives the user full control over the transformations applied to their source-code before clone detection in order to target any type or kind of clones. CloneWorks includes transformations such as tunable pretty-printing, adaptable identifier renaming, syntax abstraction and filtering, and can be extended by a plug-in architecture. Through scenarios and case studies we evaluate this user-guided aspect, and find it is adaptable has high precision
