419 research outputs found
Semantic Source Code Models Using Identifier Embeddings
The emergence of online open source repositories in the recent years has led
to an explosion in the volume of openly available source code, coupled with
metadata that relate to a variety of software development activities. As an
effect, in line with recent advances in machine learning research, software
maintenance activities are switching from symbolic formal methods to
data-driven methods. In this context, the rich semantics hidden in source code
identifiers provide opportunities for building semantic representations of code
which can assist tasks of code search and reuse. To this end, we deliver in the
form of pretrained vector space models, distributed code representations for
six popular programming languages, namely, Java, Python, PHP, C, C++, and C#.
The models are produced using fastText, a state-of-the-art library for learning
word representations. Each model is trained on data from a single programming
language; the code mined for producing all models amounts to over 13.000
repositories. We indicate dissimilarities between natural language and source
code, as well as variations in coding conventions in between the different
programming languages we processed. We describe how these heterogeneities
guided the data preprocessing decisions we took and the selection of the
training parameters in the released models. Finally, we propose potential
applications of the models and discuss limitations of the models.Comment: 16th International Conference on Mining Software Repositories (MSR
2019): Data Showcase Trac
Detecting Missing Dependencies and Notifiers in Puppet Programs
Puppet is a popular computer system configuration management tool. It
provides abstractions that enable administrators to setup their computer
systems declaratively. Its use suffers from two potential pitfalls. First, if
ordering constraints are not specified whenever an abstraction depends on
another, the non-deterministic application of abstractions can lead to race
conditions. Second, if a service is not tied to its resources through
notification constructs, the system may operate in a stale state whenever a
resource gets modified. Such faults can degrade a computing infrastructure's
availability and functionality.
We have developed an approach that identifies these issues through the
analysis of a Puppet program and its system call trace. Specifically, we
present a formal model for traces, which allows us to capture the interactions
of Puppet abstractions with the file system. By analyzing these interactions we
identify (1) abstractions that are related to each other (e.g., operate on the
same file), and (2) abstractions that should act as notifiers so that changes
are correctly propagated. We then check the relationships from the trace's
analysis against the program's dependency graph: a representation containing
all the ordering constraints and notifications declared in the program. If a
mismatch is detected, our system reports a potential fault.
We have evaluated our method on a large set of Puppet modules, and discovered
57 previously unknown issues in 30 of them. Benchmarking further shows that our
approach can analyze in minutes real-world configurations with a magnitude
measured in thousands of lines and millions of system calls
On the Feasibility of Transfer-learning Code Smells using Deep Learning
Context: A substantial amount of work has been done to detect smells in
source code using metrics-based and heuristics-based methods. Machine learning
methods have been recently applied to detect source code smells; however, the
current practices are considered far from mature. Objective: First, explore the
feasibility of applying deep learning models to detect smells without extensive
feature engineering, just by feeding the source code in tokenized form. Second,
investigate the possibility of applying transfer-learning in the context of
deep learning models for smell detection. Method: We use existing metric-based
state-of-the-art methods for detecting three implementation smells and one
design smell in C# code. Using these results as the annotated gold standard, we
train smell detection models on three different deep learning architectures.
These architectures use Convolution Neural Networks (CNNs) of one or two
dimensions, or Recurrent Neural Networks (RNNs) as their principal hidden
layers. For the first objective of our study, we perform training and
evaluation on C# samples, whereas for the second objective, we train the models
from C# code and evaluate the models over Java code samples. We perform the
experiments with various combinations of hyper-parameters for each model.
Results: We find it feasible to detect smells using deep learning methods. Our
comparative experiments find that there is no clearly superior method between
CNN-1D and CNN-2D. We also observe that performance of the deep learning models
is smell-specific. Our transfer-learning experiments show that
transfer-learning is definitely feasible for implementation smells with
performance comparable to that of direct-learning. This work opens up a new
paradigm to detect code smells by transfer-learning especially for the
programming languages where the comprehensive code smell detection tools are
not available
Identifying Bugs in Make and JVM-Oriented Builds
Incremental and parallel builds are crucial features of modern build systems.
Parallelism enables fast builds by running independent tasks simultaneously,
while incrementality saves time and computing resources by processing the build
operations that were affected by a particular code change. Writing build
definitions that lead to error-free incremental and parallel builds is a
challenging task. This is mainly because developers are often unable to predict
the effects of build operations on the file system and how different build
operations interact with each other. Faulty build scripts may seriously degrade
the reliability of automated builds, as they cause build failures, and
non-deterministic and incorrect build results.
To reason about arbitrary build executions, we present buildfs, a
generally-applicable model that takes into account the specification (as
declared in build scripts) and the actual behavior (low-level file system
operation) of build operations. We then formally define different types of
faults related to incremental and parallel builds in terms of the conditions
under which a file system operation violates the specification of a build
operation. Our testing approach, which relies on the proposed model, analyzes
the execution of single full build, translates it into buildfs, and uncovers
faults by checking for corresponding violations.
We evaluate the effectiveness, efficiency, and applicability of our approach
by examining hundreds of Make and Gradle projects. Notably, our method is the
first to handle Java-oriented build systems. The results indicate that our
approach is (1) able to uncover several important issues (245 issues found in
45 open-source projects have been confirmed and fixed by the upstream
developers), and (2) orders of magnitude faster than a state-of-the-art tool
for Make builds
Improving the quality of APIs through the analysis of software crash reports
Modern programs depend on APIS to implement a significant part of their functionality. Apart from the way developers use APIS to build their software, the stability of these programs relies on the APIS design and implementation. In this work, we evaluate the reliability of APIS, by examining software telemetry data, in the form of stack traces, coming from Android application crashes. We got 4.9 GB worth of crash data that thousands of applications send to a centralized crash report management service. We processed that data to extract approximately a million stack traces, stitching together parts of chained exceptions, and established heuristic rules to draw the border between applications and API calls.
We examined 80% of the stack traces to map the space of the most common application failure reasons. Our findings show that the top ones can be attributed to memory exhaustion, race conditions or deadlocks, and missing or corrupt resources. At the same time,
a significant number of our stack traces (over 10%) remains unclassified due to generic unchecked exceptions, which do not highlight the problems that lead to crashes. Finally, given the classes of crash causes we found, we argue that API design and implementation improvements, such as specific exceptions, non-blocking algorithms, and default resources, can eliminate common failures
Commands as AI Conversations
Developers and data scientists often struggle to write command-line inputs,
even though graphical interfaces or tools like ChatGPT can assist. The
solution? "ai-cli," an open-source system inspired by GitHub Copilot that
converts natural language prompts into executable commands for various Linux
command-line tools. By tapping into OpenAI's API, which allows interaction
through JSON HTTP requests, "ai-cli" transforms user queries into actionable
command-line instructions. However, integrating AI assistance across multiple
command-line tools, especially in open source settings, can be complex.
Historically, operating systems could mediate, but individual tool
functionality and the lack of a unified approach have made centralized
integration challenging. The "ai-cli" tool, by bridging this gap through
dynamic loading and linking with each program's Readline library API, makes
command-line interfaces smarter and more user-friendly, opening avenues for
further enhancement and cross-platform applicability.Comment: 5 page
Open Source Adoption In Large US Companies
Various organizations increasingly adopt open source software, both on desktop PCs and servers. Since the first movements in open source in the 1960’s its growth has lead to new approaches in software development, licensing, and distribution, as well as in software vendors’ business models. The literature includes very interesting studies regarding prospective benefits, business models and case studies. However, the adoption of open source in large, global companies and its relationship with factors such as profitability, revenues and industry sector has not yet been researched. This study aims to answer these questions based on data we collected from Fortune 1000 companies and provides a method that can be applied in similar contexts
Definitions of a Software Smell
Many authors have defined smells from their perspective. This document attempts to provide a consolidated list of such definitions
Global software development in the freeBSD project
Freebsd is a sophisticated operating system developed and maintained as open-source software by a team of more than 350 individuals located throughout the world. This study uses developer location data, the configuration management repository, and records from the issue database to examine the extent of global development and its effect on produc-tivity, quality, and developer cooperation. The key findings are that global development allows round-the-clock work, but there are some marked differences between the type of work performed at different regions. The effects of multiple dispersed developers on the quality of code and productiv-ity are negligible. Mentoring appears to be sometimes as-sociated with developers living closer together, but ad-hoc cooperation seems to work fine across continents
- …
