23 research outputs found
Stochastic proofreading mechanism alleviates crosstalk in transcriptional regulation
Gene expression is controlled primarily by interactions between transcription
factor proteins (TFs) and the regulatory DNA sequence, a process that can be
captured well by thermodynamic models of regulation. These models, however,
neglect regulatory crosstalk: the possibility that non-cognate TFs could
initiate transcription, with potentially disastrous effects for the cell. Here
we estimate the importance of crosstalk, suggest that its avoidance strongly
constrains equilibrium models of TF binding, and propose an alternative
non-equilibrium scheme that implements kinetic proofreading to suppress
erroneous initiation. This proposal is consistent with the observed covalent
modifications of the transcriptional apparatus and would predict increased
noise in gene expression as a tradeoff for improved specificity. Using
information theory, we quantify this tradeoff to find when optimal proofreading
architectures are favored over their equilibrium counterparts.Comment: 5 pages, 3 figure
IST Austria Thesis
Single cells are constantly interacting with their environment and each other, more importantly, the accurate perception of environmental cues is crucial for growth, survival, and reproduction. This communication between cells and their environment can be formalized in mathematical terms and be quantified as the information flow between them, as prescribed by information theory.
The recent availability of real–time dynamical patterns of signaling molecules in single cells has allowed us to identify encoding about the identity of the environment in the time–series. However, efficient estimation of the information transmitted by these signals has been a data–analysis challenge due to the high dimensionality of the trajectories and the limited number of samples. In the first part of this thesis, we develop and evaluate decoding–based estimation methods to lower bound the mutual information and derive model–based precise information estimates for biological reaction networks governed by the chemical master equation. This is followed by applying the decoding-based methods to study the intracellular representation of extracellular changes in budding yeast, by observing the transient dynamics of nuclear translocation of 10 transcription factors in response to 3 stress conditions. Additionally, we apply these estimators to previously published data on ERK and Ca2+ signaling and yeast stress response. We argue that this single cell decoding-based measure of information provides an unbiased, quantitative and interpretable measure for the fidelity of biological signaling processes.
Finally, in the last section, we deal with gene regulation which is primarily controlled by transcription factors (TFs) that bind to the DNA to activate gene expression. The possibility that non-cognate TFs activate transcription diminishes the accuracy of regulation with potentially disastrous effects for the cell. This ’crosstalk’ acts as a previously unexplored source of noise in biochemical networks and puts a strong constraint on their performance. To mitigate erroneous initiation we propose an out of equilibrium scheme that implements kinetic proofreading. We show that such architectures are favored over their equilibrium counterparts for complex organisms despite introducing noise in gene expression
Distributed and dynamic intracellular organization of extracellular information
Although cells respond specifically to environments, how environmental identity is encoded intracellularly is not understood. Here, we study this organization of information in budding yeast by estimating the mutual information between environmental transitions and the dynamics of nuclear translocation for 10 transcription factors. Our method of estimation is general, scalable, and based on decoding from single cells. The dynamics of the transcription factors are necessary to encode the highest amounts of extracellular information, and we show that information is transduced through two channels: Generalists (Msn2/4, Tod6 and Dot6, Maf1, and Sfp1) can encode the nature of multiple stresses, but only if stress is high; specialists (Hog1, Yap1, and Mig1/2) encode one particular stress, but do so more quickly and for a wider range of magnitudes. In particular, Dot6 encodes almost as much information as Msn2, the master regulator of the environmental stress response. Each transcription factor reports differently, and it is only their collective behavior that distinguishes between multiple environmental states. Changes in the dynamics of the localization of transcription factors thus constitute a precise, distributed internal representation of extracellular change. We predict that such multidimensional representations are common in cellular decision-making
Estimating information in time-varying signals
International audienceAcross diverse biological systems-ranging from neural networks to intracellular signaling and genetic regulatory networks-the information about changes in the environment is frequently encoded in the full temporal dynamics of the network nodes. A pressing data-analysis challenge has thus been to efficiently estimate the amount of information that these dynamics convey from experimental data. Here we develop and evaluate decoding-based estimation methods to lower bound the mutual information about a finite set of inputs, encoded in single-cell high-dimensional time series data. For biological reaction networks governed by the chemical Master equation, we derive model-based information approximations and analytical upper bounds, against which we benchmark our proposed model-free decoding estimators. In contrast to the frequently-used k-nearest-neighbor estimator, decoding-based estimators robustly extract a large fraction of the available information from high-dimensional trajectories with a realistic number of data samples. We apply these estimators to previously published data on Erk and Ca 2+ signaling in mammalian cells and to yeast stress-response, and find that substantial amount of information about environmental state can be encoded by non-trivial response statistics even in stationary signals. We argue that these single-cell, decoding-based information estimates, rather than the commonly-used tests for significant differences between selected population response statistics, provide a proper and unbiased measure for the performance of biological signaling networks
Two-level mutual information estimates from single-cell time-series data for nuclear translocation of yeast transcription factors.
(A, B) Data replotted from Ref [27] for Msn2 (top row), Dot6 (middle row), and Sfp1 (bottom row); early transient responses (A) after nutrient shift at t = 0 min from glucose rich (2%, blue traces) to glucose poor (0.1%, red traces) medium are shown in the left column, stationary responses (B) are collected after cells are fully adapted to the new medium. Sampling frequency is 2.5 min, d = 45, and the number of sample trajectories per nutrient condition is N = 100. Thin lines are individual single cell traces, solid lines are population averages. (C, D) Information estimates for the transient (left, C) and stationary (right, D) response periods. Colored bars use model-free decoding-based estimators as indicated in the legend, gray bar is the knn estimate; error bars computed from estimation bootstraps by randomly splitting the data into testing and training sets.</p
Information estimation for multilevel inputs.
(A) Extension of Example 2 from Fig 2B to q = 2, …, 5 discrete inputs. We chose the inputs such that the response for the system at T q equally spaced levels with the same dynamic range as the original example; dynamics at T ≥ 1000 remain unchanged from the original Example 2. (B) Model-based information bounds as a function of the number of input levels for trajectories represented as d = 100 dimensional vectors: exact Monte Carlo calculation (dark gray), MAP decoding bound (black), upper bound (light gray). (C) Performance of model-free estimators, as indicated in the panel, compared to the MAP bound (black). Dashed lines show estimations using N = 103 sample trajectories per condition, solid lines using N = 104 samples per condition; in both cases, we show an average over 20 independent replicates, error bars are suppressed for readability.</p
Performance of decoding-based estimators depends on the dimensionality of the response trajectories and on the number of response trajectory samples.
Performance of various model-free decoding estimators (colored lines) for Examples 1 (A, D), 2 (B, E), 3 (C, F), respectively, compared to the MAP bound, IMAP (black line), as a function of input trajectory dimension, d (at fixed N = 1000) in A, B, C; or as a function of the number of samples, N, per input condition (at fixed d = 100) in D, E, F. Error bars are std over 20 replicate estimations. Decoding estimators: linear SVM, ISVM(lin) (orange); radial basis functions SVM, ISVM(rbf) (blue); the Gaussian decoder with diagonal regularization (see S2 Fig for the effects of covariance matrix regularization and signal filtering on Gaussian decoder estimates), IGD (yellow); multi-layer perceptron neural network, INN (green). Dashed vertical orange line marks the d ≤ 100 regime typical of current experiments. Note that while the amount of information must in principle increase monotonically with d, the amount that decoders can actually extract given a limited number of samples, N, has no such guarantee. The decrease, at high d, in Gaussian decoder information estimate in A, B, C and neural network information estimate in C, happens because the number of parameters of the decoder grows with d albeit at fixed number of samples, leading to overfitting that regularization cannot fully compensate, and thus to the consequent loss of performance on the test data.</p
Example biochemical reaction networks and their behavior.
Three example birth-death processes, specified by the reactions in the top row for each of the two possible inputs (u(1) in blue, u(2) in red), stylize simple behaviors of biochemical signaling networks. (A) Input is encoded in both the transient approach to steady state and the steady state value. (B) Input is encoded in the magnitude of the transient response which is subsequently adapted away. (C) Input is encoded only at the level of temporal correlations of the response trajectory. Bottom row shows example trajectories generated using the Stochastic Simulation Algorithm for the copy number of molecules, t ∈ [0, 2000], for each network and the two possible inputs (light blue, light red); while plotted as a connected line for clarity, each trajectory represents molecular counts and is thus a step-wise function taking on only integer or zero values. Dark blue, red lines show the conditional means over N = 1000 trajectory realizations.</p
Multilevel mutual information estimates from single-cell time-series data for mammalian intracellular signaling.
Data replotted from Ref [26] for ERK (top row) and Ca2+ (bottom row). (A) Early transient responses after addition of 5 different levels of EGF for ERK (or 4 different levels of ATP for Ca2+, respectively) at t = 0 min, as indicated in the legend. (B) In the late response most, but not all, of the transients have decayed. (C,D) Information estimation using different methods (legend) in the early (C) and late (D) period, for ERK (left half of the panels) and Ca2+ (right half). Data for ERK: N = 1678 per condition, T = 30 min (d = 30) for early response and T = 30 min (d = 30) for late response. Data for Ca2+: N = 2995 per condition, T = 10 min (d = 200) for early response and T = 5 min (d = 100) for late response. Plotting conventions as in Fig 8.</p
