35 research outputs found
Earth Mover’s Distance (EMD): A True Metric for Comparing Biomarker Expression Levels in Cell Populations
Changes in the frequencies of cell subsets that (co)express characteristic biomarkers, or levels of the biomarkers on the subsets, are widely used as indices of drug response, disease prognosis, stem cell reconstitution, etc. However, although the currently available computational “gating” tools accurately reveal subset frequencies and marker expression levels, they fail to enable statistically reliable judgements as to whether these frequencies and expression levels differ significantly between/among subject groups. Here we introduce flow cytometry data analysis pipeline which includes the Earth Mover’s Distance (EMD) metric as solution to this problem. Well known as an informative quantitative measure of differences between distributions, we present three exemplary studies showing that EMD 1) reveals clinically-relevant shifts in two markers on blood basophils responding to an offending allergen; 2) shows that ablative tumor radiation induces significant changes in the murine colon cancer tumor microenvironment; and, 3) ranks immunological differences in mouse peritoneal cavity cells harvested from three genetically distinct mouse strains
Bi-allelic loss-of-function CACNA1B mutations in progressive epilepsy-dyskinesia
The occurrence of non-epileptic hyperkinetic movements in the context of developmental epileptic encephalopathies is an increasingly recognized phenomenon. Identification of causative mutations provides an important insight into common pathogenic mechanisms that cause both seizures and abnormal motor control. We report bi-allelic loss-of-function CACNA1B variants in six children from three unrelated families whose affected members present with a complex and progressive neurological syndrome. All affected individuals presented with epileptic encephalopathy, severe neurodevelopmental delay (often with regression), and a hyperkinetic movement disorder. Additional neurological features included postnatal microcephaly and hypotonia. Five children died in childhood or adolescence (mean age of death: 9 years), mainly as a result of secondary respiratory complications. CACNA1B encodes the pore-forming subunit of the pre-synaptic neuronal voltage-gated calcium channel Cav2.2/N-type, crucial for SNARE-mediated neurotransmission, particularly in the early postnatal period. Bi-allelic loss-of-function variants in CACNA1B are predicted to cause disruption of Ca2+ influx, leading to impaired synaptic neurotransmission. The resultant effect on neuronal function is likely to be important in the development of involuntary movements and epilepsy. Overall, our findings provide further evidence for the key role of Cav2.2 in normal human neurodevelopment
Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking
The potential of the diverse chemistries present in natural products (NP) for biotechnology and medicine remains untapped because NP databases are not searchable with raw data and the NP community has no way to share data other than in published papers. Although mass spectrometry techniques are well-suited to high-throughput characterization of natural products, there is a pressing need for an infrastructure to enable sharing and curation of data. We present Global Natural Products Social molecular networking (GNPS, http://gnps.ucsd.edu), an open-access knowledge base for community wide organization and sharing of raw, processed or identified tandem mass (MS/MS) spectrometry data. In GNPS crowdsourced curation of freely available community-wide reference MS libraries will underpin improved annotations. Data-driven social-networking should facilitate identification of spectra and foster collaborations. We also introduce the concept of ‘living data’ through continuous reanalysis of deposited data
EMD score increases linearly with the growing separation between two populations.
<p>Panel (a) of Fig 1 shows two normal distributions: a large population (black) and a smaller population (green). The green population starts with a mean at the same position as the black population, and increases along the x axis in fixed increments (2 standard deviations) in each of the successive panels. At each step, we calculate the probability binning (PB) statistic (T (χ)) [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0151859#pone.0151859.ref007" target="_blank">7</a>], which is based on p-values, and the Earth Mover’s Distance (EMD, described in detail Results section) between the “unstimulated” first panel in (a), and the joint distribution of the main (black) population with stimulated population (green). As the green population moves further from the black population, both the PB and the EMD increase monotonically. However, when the green population gets past 2 standard deviations from the black population, the PB plateaus as the two distributions have reached “maximum” separation based on the PB statistic. No additional movement of the green population will provide further evidence about the hypothesis that these two populations are the same, while a larger separation clearly carries biologically relevant information. Conversely, EMD continues to increase linearly with the growing separation of the green population. This example illustrates the two shortcomings of using p-values to quantitate change: Even a small change which may not be meaningful can be highly significant and thus produce a large value of the PB statistic, and larger changes may not increase this statistic further although from a biological point of view it would be desirable to do so. While the data for this figure were generated synthetically, one can imagine an experiment in which increasing amounts of a drug are applied causing a subset of cells to increase expression of a marker based on the amount of the drug. In order to correlate the amount of drug with the level of expression in a reliable fashion, one needs a true distance metric to measure the magnitude of the change in distributions. This figure appeared originally in the PhD thesis written by one of the authors (Noah Zimmerman), which was accepted in 2011 by Stanford University and is available online [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0151859#pone.0151859.ref014" target="_blank">14</a>]. Reprinted from [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0151859#pone.0151859.ref014" target="_blank">14</a>] under a CC BY license, with permission from Noah Zimmerman, original copyright 2011.</p
EMD scores detect relatively small differences between wild-type strains and detect the much larger differences between the wild-type mice and knockout mice.
<p><b>(a)</b> We used the following gating strategy (according to [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0151859#pone.0151859.ref026" target="_blank">26</a>]): dead—/FSC-A→FSC-H/FSC-A. Then we performed the EMD comparison for the expression level of CD19/CD5 biomarkers on peritoneal cells among three mouse strains. EMD values represent the EMD scores between the first BALB/c sample (reference) and the other 3 subsequent samples, including a replicate for BALB/c cells. <b>(b)</b> Mouse spleen cells from BALB/c and C57BL/6 mice stained and analyzed using 13-parameter high-dimensional FACS. We then used the following gating strategy: FSC-H/FSC-A (singlets)→dead—/SSC-A (live). Each plot displays the expression level of CD8 and CD25 for spleen cells from corresponding mouse strain. On this figure, replicate is the same sample which was run several times. EMD scores for inter-sample variability are significantly lower than EMD scores for inter-strain variability.</p
EMD scores based on expression of two independent flow cytometry markers more accurately distinguish allergic (CF-ABPA) from non-allergic (CF) patients.
<p><b>(a)</b> This panel compares EMD scores for the combined expression of CD203c and CD63 with “classical” median fluorescence intensity (MFI) values computed separately for the expression of each marker and with MFI values computed for the combined expression of CD203c and CD63. <b>(b)</b> Performance comparison between EMD and two other representative “metrics”, one based on test statistics, Chi-Square (ChS) [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0151859#pone.0151859.ref002" target="_blank">2</a>] and the other is a distance measure, Mahalanobis Distance (MD). All six measures were calculated relative to each sample’s unstimulated control. Data are shown for 20/45 CF patients (10 with CF-ABPA and 10 with only CF) drawn from a previously published CF-ABPA study [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0151859#pone.0151859.ref019" target="_blank">19</a>, <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0151859#pone.0151859.ref023" target="_blank">23</a>] and selected as described in Materials and Methods. For each CF patient sample (n = 10) and CF-ABPA (n = 10), we calculated the EMD on the CD63 and CD203c channels between the unstimulated controls and samples stimulated with the <i>A</i>.<i>fumigatus</i> allergen/extract. Using the SVM method we then defined thresholds (red dashed lines at 123.67 for MFI CD203c, 112.75 for MFI CD63, 0.92 for MD CD203c/CD63, 426.66 for CD203c/CD63 ChS and 0.03 for EMD CD203c/CD63) to distinguish allergic/positive responses from non-allergic/negative responses, i.e., we tried all possible combinations of 5 CF and 5 CF-ABPA patients and used the scores in each case as a training set to find an SVM threshold that divides the dataset in two categories (CF and CF-ABPA) with the lowest possible misclassification rate.</p
Analysis of basophils activation status.
<p><b>(a)</b> To identify basophils, we used the following gating sequence (shown in the figure by the red arrows): FSC-A/SSC-A (total white blood cells)→ FSС-A/FSC-H (singlets)→ CD41a/live/dead (CD41a—live)→ Dump [CD3, CD66b, HLA-DR]/CD123 (Dump—, CD123++)→ use EMD with CD203c/ CD63 [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0151859#pone.0151859.ref019" target="_blank">19</a>] to determine basophil activation status. <b>(b)</b> An example of basophils response to stimulation with the <i>A</i>. <i>fumigatus</i> allergen. Here, MFI represents median fluorescence intensity.</p
EMD clearly differentiates allergic (n = 13) and non-allergic (n = 9) groups of samples, while PB and MD are not able to differentiate them with sensitivity and specificity comparable to EMD.
<p>This figure appeared originally in the PhD thesis written by one of the authors (Noah Zimmerman), which was accepted in 2011 by Stanford University and is available online [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0151859#pone.0151859.ref014" target="_blank">14</a>]. The data come from a peanut allergy study by Gernez et al [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0151859#pone.0151859.ref031" target="_blank">31</a>].Fold change calculated as a ratio of EMD (PB or MD) between unstimulated sample and sample stimulated with offending allergen, normalized by the EMD (PB or MD) between unstimulated sample and sample stimulated with non-offending allergen. Reprinted from [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0151859#pone.0151859.ref014" target="_blank">14</a>] under a CC BY license, with permission from Noah Zimmerman, original copyright 2011.</p
Combined EMD score for three pairs of biomarkers distinguishes mice that received tumor radiotherapy from untreated tumor-bearing mice.
<p>Tumor infiltrating cells from tumor-bearing mice that received tumor radiotherapy (n = 4, red dots) or were untreated (n = 3, blue dots) were gated according to three different strategies: CD4/CD8 expression on live lymphocytes (dead—/ SSC-A; FSC-A/SSC-A); CD25/CD4 expression on B220—/CD4hi live lymphocytes; and, CD25/CD8 expression on live lymphocytes. We then calculated three EMD scores for the following combinations of biomarkers and cell populations: CD8/CD4 expression on lymphocytes, CD25/CD4 expression on B220—/CD4hi lymphocytes, CD25/CD8 expression on lymphocytes. The EMD scores were calculated relative to the control sample (mouse which did not receive tumor radiotherapy). The threshold (plane) was defined using the SVM method.</p
Data analysis workflow for application of EMD to flow cytometry data.
<p>Flow data were collected with H-D flow instruments available in the Stanford Shared FACS Facility and preprocessed with AutoGate software (freely available at <a href="http://CytoGenie.org" target="_blank">http://CytoGenie.org</a>). The third step (classification analysis was applied only to the first two studies described in the Results section).</p
