64 research outputs found
Data science
Even though it has only entered public perception relatively recently, the term "data science" already means many things to many people. This chapter explores both top-down and bottom-up views on the field, on the basis of which we define data science as "a unique blend of principles and methods from analytics, engineering, entrepreneurship and communication that aim at generating value from the data itself". The chapter then discusses the disciplines that contribute to this "blend", briefly outlining their contributions and giving pointers for readers interested in exploring their backgrounds further
The Newcomb-Benford Law in Its Relation to Some Common Distributions
An often reported, but nevertheless persistently striking observation, formalized as the Newcomb-Benford law (NBL), is that the frequencies with which the leading digits of numbers occur in a large variety of data are far away from being uniform. Most spectacular seems to be the fact that in many data the leading digit 1 occurs in nearly one third of all cases. Explanations for this uneven distribution of the leading digits were, among others, scale- and base-invariance. Little attention, however, found the interrelation between the distribution of the significant digits and the distribution of the observed variable. It is shown here by simulation that long right-tailed distributions of a random variable are compatible with the NBL, and that for distributions of the ratio of two random variables the fit generally improves. Distributions not putting most mass on small values of the random variable (e.g. symmetric distributions) fail to fit. Hence, the validity of the NBL needs the predominance of small values and, when thinking of real-world data, a majority of small entities. Analyses of data on stock prices, the areas and numbers of inhabitants of countries, and the starting page numbers of papers from a bibliography sustain this conclusion. In all, these findings may help to understand the mechanisms behind the NBL and the conditions needed for its validity. That this law is not only of scientific interest per se, but that, in addition, it has also substantial implications can be seen from those fields where it was suggested to be put into practice. These fields reach from the detection of irregularities in data (e.g. economic fraud) to optimizing the architecture of computers regarding number representation, storage, and round-off errors
Network Compression as a Quality Measure for Protein Interaction Networks
With the advent of large-scale protein interaction studies, there is much debate about data quality. Can different noise levels in the measurements be assessed by analyzing network structure? Because proteomic regulation is inherently co-operative, modular and redundant, it is inherently compressible when represented as a network. Here we propose that network compression can be used to compare false positive and false negative noise levels in protein interaction networks. We validate this hypothesis by first confirming the detrimental effect of false positives and false negatives. Second, we show that gold standard networks are more compressible. Third, we show that compressibility correlates with co-expression, co-localization, and shared function. Fourth, we also observe correlation with better protein tagging methods, physiological expression in contrast to over-expression of tagged proteins, and smart pooling approaches for yeast two-hybrid screens. Overall, this new measure is a proxy for both sensitivity and specificity and gives complementary information to standard measures such as average degree and clustering coefficients
Generating Random Floating-Point Numbers by Dividing Integers: A Case Study
International audienceA method widely used to obtain IEEE 754 binary floating-point numbers with a standard uniform distribution involves drawing an integer uniformly at random and dividing it by another larger integer. We survey the various instances of the algorithm that are used in actual software and point out their properties and drawbacks, particularly from the standpoint of numerical software testing and data anonymization
Murine monoclonal antibodies specific for conserved and non-conserved antigenic determinants of the human and murine Ku autoantigens
Vectorized algorithm for multidimensional Monte Carlo integration on modern GPU, CPU and MIC architectures
BoolTest: The Fast Randomness Testing Strategy Based on Boolean Functions with Application to DES, 3-DES, MD5, MD6 and SHA-256
Variance component estimation techniques compared for two mating designs with forest genetic architecture through computer simulation
A symmetry property for q -weighted Robinson–Schensted and other branching insertion algorithms
- …
