28 research outputs found
A Spark-based workflow for probabilistic record linkage of healthcare data *
ABSTRACT Several areas, such as science, economics, finance, business intelligence, health, and others are exploring big data as a way to produce new information, make better decisions, and move forward their related technologies and systems. Specifically in health, big data represents a challenging problem due to the poor quality of data in some circumstances and the need to retrieve, aggregate, and process a huge amount of data from disparate databases. In this work, we focused on Brazilian Public Health System and on large databases from Ministry of Health and Ministry of Social Development and Hunger Alleviation. We present our Spark-based approach to data processing and probabilistic record linkage of such databases in order to produce very accurate data marts. These data marts are used by statisticians and epidemiologists to assess the effectiveness of conditional cash transfer programs to poor families in respect with the occurrence of some diseases (tuberculosis, leprosy, and AIDS). The case study we made as a proof-of-concept presents a good performance with accurate results. For comparison, we also discuss an OpenMP-based implementation
Examining the quality of record linkage process using nationwide Brazilian administrative databases to build a large birth cohort.
BACKGROUND: Research using linked routine population-based data collected for non-research purposes has increased in recent years because they are a rich and detailed source of information. The objective of this study is to present an approach to prepare and link data from administrative sources in a middle-income country, to estimate its quality and to identify potential sources of bias by comparing linked and non-linked individuals. METHODS: We linked two administrative datasets with data covering the period 2001 to 2015, using maternal attributes (name, age, date of birth, and municipally of residence) from Brazil: live birth information system and the 100 Million Brazilian Cohort (created using administrative records from over 114 million individuals whose families applied for social assistance via the Unified Register for Social Programmes) implementing an in house developed linkage tool CIDACS-RL. We then estimated the proportion of highly probably link and examined the characteristics of missed-matches to identify any potential source of bias. RESULTS: A total of 27,699,891 live births were submited to linkage with maternal information recorded in the baseline of the 100 Million Brazilian Cohort dataset of those, 16,447,414 (59.4%) children were found registered in the 100 Million Brazilian Cohort dataset. The proportion of highly probably link ranged from 39.3% in 2001 to 82.1% in 2014. A substantial improvement in the linkage after the introduction of maternal date of birth attribute, in 2011, was observed. Our analyses indicated a slightly higher proportion of missing data among missed matches and a higher proportion of people living in an urban area and self-declared as Caucasian among linked pairs when compared with non-linked sets. DISCUSSION: We demonstrated that CIDACS-RL is capable of performing high quality linkage even with a limited number of common attributes, using indexation as a blocking strategy in larg e routine databases from a middle-income country. However, residual records occurred more among people under worse living conditions. The results presented in this study reinforce the need of evaluating linkage quality and when necessary to take linkage error into account for the analyses of any generated dataset
CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability
Background: Record linkage is the process of identifying and combining records about the same individual from two or more different datasets. While there are many open source and commercial data linkage tools, the volume and complexity of currently available datasets for linkage pose a huge challenge; hence, designing an efficient linkage tool with reasonable accuracy and scalability is required. Methods: We developed CIDACS-RL (Centre for Data and Knowledge Integration for Health – Record Linkage), a novel iterative deterministic record linkage algorithm based on a combination of indexing search and scoring algorithms (provided by Apache Lucene). We described how the algorithm works and compared its performance with four open source linkage tools (AtyImo, Febrl, FRIL and RecLink) in terms of sensitivity and positive predictive value using gold standard dataset. We also evaluated its accuracy and scalability using a case-study and its scalability and execution time using a simulated cohort in serial (single core) and multi-core (eight core) computation settings. Results: Overall, CIDACS-RL algorithm had a superior performance: positive predictive value (99.93% versus AtyImo 99.30%, RecLink 99.5%, Febrl 98.86%, and FRIL 96.17%) and sensitivity (99.87% versus AtyImo 98.91%, RecLink 73.75%, Febrl 90.58%, and FRIL 74.66%). In the case study, using a ROC curve to choose the most appropriate cut-off value (0.896), the obtained metrics were: sensitivity = 92.5% (95% CI 92.07–92.99), specificity = 93.5% (95% CI 93.08–93.8) and area under the curve (AUC) = 97% (95% CI 96.97–97.35). The multi-core computation was about four times faster (150 seconds) than the serial setting (550 seconds) when using a dataset of 20 million records. Conclusion: CIDACS-RL algorithm is an innovative linkage tool for huge datasets, with higher accuracy, improved scalability, and substantially shorter execution time compared to other existing linkage tools. In addition, CIDACS-RL can be deployed on standard computers without the need for high-speed processors and distributed infrastructures
National data linkage assessment of live births and deaths in Mexico: Estimating under-five mortality rate ratios for vulnerable newborns and trends from 2008 to 2019
BACKGROUND: Linked datasets that enable longitudinal assessments are scarce in low and middle-income countries. OBJECTIVES: We aimed to assess the linkage of administrative databases of live births and under-five child deaths to explore mortality and trends for preterm, small (SGA) and large for gestational age (LGA) in Mexico. METHODS: We linked individual-level datasets collected by National statistics from 2008 to 2019. Linkage was performed based on agreement on birthday, sex, residential address. We used the Centre for Data and Knowledge Integration for Health software to identify the best candidate pairs based on similarity. Accuracy was assessed by calculating the area under the receiver operating characteristic curve. We evaluated completeness by comparing the number of linked records with reported deaths. We described the percentage of linked records by baseline characteristics to identify potential bias. Using the linked dataset, we calculated mortality rate ratios (RR) in neonatal, infants, and children under-five according to gestational age, birthweight, and size. RESULTS: For the period 2008-2019, a total of 24,955,172 live births and 321,165 under-five deaths were available for linkage. We excluded 1,539,046 records (6.2%) with missing or implausible values. We succesfully linked 231,765 deaths (72.2%: range 57.1% in 2009 and 84.3% in 2011). The rate of neonatal mortality was higher for preterm compared with term (RR 3.83, 95% confidence interval, [CI] 3.78, 3.88) and for SGA compared with appropriate for gestational age (AGA) (RR 1.22 95% CI, 1.19, 1.24). Births at <28 weeks had the highest mortality (RR 35.92, 95% CI, 34.97, 36.88). LGA had no additional risk vs AGA among children under five (RR 0.92, 95% CI, 0.90, 0.93). CONCLUSIONS: We demonstrated the utility of linked data to understand neonatal vulnerability and child mortality. We created a linked dataset that would be a valuable resource for future population-based research
Cohort profile: the 100 million Brazilian cohort
The creation of The 100 Million Brazilian Cohort was motivated by the availability of high quality but dispersed social and health databases in Brazil and the need to integrate data and evaluate the impact of policies aiming to improve the social determinants of health (e.g. social protection policies) on health outcomes, overall and in subgroups of interest in a dynamic cohort.
• The baseline of The 100 Million Brazilian Cohort comprises 131 697 800 low-income individuals in 35 358 415 families from 2011 to 2018. The Cohort population is mostly composed of children and young adults, with a higher proportion of females than the general Brazilian population, who identify themselves as Brown and live in the urban area of the country.
• Exposure to social protection and the follow-up of individuals are obtained through: (i) deterministic linkage using the Social Identification Number (NIS) to link the Cohort baseline to social protection programmes and to periodically renewed socioeconomic information in Cadatro U ́ nico datasets; and/or (ii) non-deterministic linkage using the CIDACS-RL non-deterministic linkage tool, to link the Cohort baseline to administrative health care datasets such as mortality (Mortality Information System, SIM), disease notification (Information System for Notifiable Diseases, SINAN), birth information (Live Birth Information System, SINASC) and nutrition status (Food and Nutrition Surveillance System, SISVAN).
• So far, studies have used The 100 Million Brazilian Cohort to investigate the socioeconomic and demographic determinants of leprosy, leprosy treatment outcomes and low birthweight and to evaluate the impact of the Bolsa Familia Programme (BFP) on leprosy and child mortality. Other studies are now being conducted that are of utmost relevance to the health inequalities of Brazil and many low- and middle-income countries, and many research opportunities are being opened up with the linkage of a range of health outcomes
Administrative Data Linkage in Brazil: Potentials for Health Technology Assessment.
Health technology assessment (HTA) is the systematic evaluation of the properties and impacts of health technologies and interventions. In this article, we presented a discussion of HTA and its evolution in Brazil, as well as a description of secondary data sources available in Brazil with potential applications to generate evidence for HTA and policy decisions. Furthermore, we highlighted record linkage, ongoing record linkage initiatives in Brazil, and the main linkage tools developed and/or used in Brazilian data. Finally, we discussed the challenges and opportunities of using secondary data for research in the Brazilian context. In conclusion, we emphasized the availability of high quality data and an open, modern attitude toward the use of data for research and policy. This is supported by a rigorous but enabling legal framework that will allow the conduct of large-scale observational studies to evaluate clinical, economical, and social impacts of health technologies and social policies
Correlação probabilística implementada em spark para big data em saúde
A aplicação de técnicas de correlação probabilística em registros de saúde ou socioeconômicos de uma população tem sido uma prática comum entre epidemiologistas como
base para suas pesquisa não-experimentais. Entretanto, o crescimento do volume dos dados comum ao cenário imposto pelo Big Data provocou uma carˆencia por ferramentas computacionais capazes de lidar com esses imensos reposit´orios. Neste trabalho é descrita uma solução implementada no framework de processamento em cluster Spark para a correlação probabilística de registros de grandes bases de dados do Sistema Público de Saúde
brasileiro. Este trabalho está vinculado a um projeto que visa analisar a relação entre o Programam Bolsa Família e a incidência de doen¸cas associadas á pobreza, tais como hanseníase e tuberculose. Os resultados obtidos demonstram que esta implementação
provê qualidade competitiva em relação a outras ferramentas e abordagens existentes, comprovada pela superioridade das métricas de tempo de execução
A Novel Distance Measure for Heterogeneous Data: Time Series and Non-Temporal Data
<p>Amongst the several machine learning techniques, distance (or similarity) measures are used to calculate the proximity of objects in a dataset. By employing such a type of measure, it is possible to generate "clusters" in unsupervised learning techniques or classify the objects in supervised learning techniques. In general, these measures are projected considering only one type of data. Datasets from real-world applications can comprise a mixture of data, thus requiring different approaches for identifying such patterns and groups. Literature is centered around three main approaches for leading with heterogeneous data: i) using a unique distance measure to all data types, ii) using specific measures for each data type, or iii) converting all data types to a unique type and then applying the first approach. Conversely, applying machine learning techniques in a dataset with time series and non-temporal data is not trivial because temporal data can have different behaviors that influence distance measures. Therefore, this work proposes a measure that enables the calculation of the distance between objects comprised of times series and numerical data features. To develop this measure, we first sought to identify and analyze existing works with mixed data clustering approaches involving temporal data. Then, we combine measures, for a unique data type, to deal with heterogeneous datasets to generate a unique measure for time series and non-temporal data.</p>
A trainable model to assess the accuracy of probabilistic record linkage
Record linkage (RL) is the process of identifying and linking data that relates to the same physical entity across multiple heterogeneous data sources. Deterministic linkage methods rely on the presence of common uniquely identifying attributes across all sources while probabilistic approaches use non-unique attributes and calculates similarity indexes for pair wise comparisons. A key component of record linkage is accuracy assessment — the process of manually verifying and validating matched pairs to further refine linkage parameters and increase its overall effectiveness. This process however is time-consuming and impractical when applied to large administrative data sources where millions of records must be linked. Additionally, it is potentially biased as the gold standard used is often the reviewer’s intuition. In this paper, we present an approach for assessing and refining the accuracy of probabilistic linkage based on different supervised machine learning methods (decision trees, naïve Bayes, logistic regression, random forest, linear support vector machines and gradient boosted trees). We used data sets extracted from huge Brazilian socioeconomic and public health care data sources. These models were evaluated using receiver operating characteristic plots, sensitivity, specificity and positive predictive values collected from a 10-fold cross-validation method. Results show that logistic regression outperforms other classifiers and enables the creation of a generalized, very accurate model to validate linkage results.Bill & Melinda Gates Foundation, The Royal Society (UK), Wellcome Trust (UK), Medical Research Council (UK), CNPqLyo
