# Research Interests

Even genetically identical cells in identical environments exhibit wildly different phenotypical behaviors due to cellular fluctuations known as gene expression "noise". Previously, such noise was considered a nuisance that compromised cellular responses, complicated modeling, and made predictive understanding all but impossible. Many studies focused on how cellular processes remove or exploit noise to a cell's advantage. However, different cellular mechanisms affect these cellular fluctuations in different ways, and it is now clear that these fluctuations contain valuable information about underlying cellular mechanisms. Finding and exploiting this information requires a strong integration of single-cell/single-molecule measurements with discrete stochastic analyses. My focus is to utilize this information to gain predictive understanding of new biological phenomena. Along these lines, we have studied natural and synthetic transcriptional regulation pathways in bacteria, yeast and mammalian cells.

Curr_Proj.

## Current Projects

Emerging techniques now allow for precise quantification of distributions of biological molecules in single cells. These rapidly advancing experimental methods have created a need for more rigorous and efficient modeling tools. Many of the tools we use are extensions of the Finite State Projection approach, which allows us to compute the precise time-varying probability distributions of single-cell responses, even in fluctuating environments.

**New Bounds on Likelihoods of Single-Cell Data.** We have recently derived new bounds on the likelihood that observations of single-cell, single-molecule responses come from a discrete stochastic model, which we pose in the form of the chemical master equation (CME). These strict upper and lower bounds are based on a technique known as the Finite State Projection, and they converge monotonically to the exact likelihood value. By calculating these bounds, we can rigorously discriminate between models with a minimum level of computational effort. In practice, we have incorporated these FSP-derived likelihood bounds bounds into stochastic model identification and parameter inference routines, which improve the accuracy and efficiency of endeavors to analyze and predict single-cell behavior. We have demonstrated the applicability of our approach using simulated data for multiple models with simulated data as well as for experimental measurements of a time-varying stochastic transcriptional response in yeast.

Z. Fox, G. Neuert and __B. Munsky__, Finite state projection based bounds to compare chemical master equation models using single-cell data, Journal of Chemical Physics, 145:7, 074101 (2016), online here

**Estimating and maximizing the information in single-cell experiments. **Modern experiments not only measure single-cell and single-molecule dynamics with high precision, but they can also perturb the cellular environment in myriad controlled and novel settings. Such techniques have opened the door to an infinite number of potential experiments, which begs the question of how best to choose the next experiment. The Fisher information matrix (FIM) estimates how well potential experiments will constrain model parameters and can be used to design optimal experiments. Here, we introduce the finite state projection (FSP) based FIM, which uses the formalism of the chemical master equation to derive and compute the FIM. The FSP-FIM makes no assumptions about the distribution shapes of single-cell data, and it does not require precise measurements of higher order moments of such distributions. We validate the FSP-FIM against well-known Fisher information results for the simple case of constitutive gene expression. We then demonstrate the use of the FSP-FIM to optimize the timing of single-cell experiments with more complex, non-Gaussian fluctuations. We validate optimal experiments determined using the FSP-FIM with Monte-Carlo approaches and contrast these to experiments chosen by traditional analyses that assume Gaussian fluctuations or use the central limit theorem. By systematically designing experiments to use all of the measurable fluctuation information, our method enables a key step to improve co-design of experiments and quantitative models.

ZR Fox and __B. Munsky__, The finite state projection based Fisher information matrix approach to estimate and maximize the information in single-cell experiments, bioRxiv (2018), online here

When biological models under-perform expectations, it is tempting to attribute failure to “bad models” or “insufficient data”. However, predictions from good models and sufficient data may fail due to poor integration of the two. Unlike most engineered systems, biological fluctuations are dominated by discrete fluctuations in DNA, RNA and protein. Integrating stochastic models with single-cell experiments can provide a wealth of information about gene regulatory dynamics %((%1%))%, but for discrete, positive fluctuations, standard data-model integration analyses (e.g., assuming normal distributions or making CLT arguments) can produce nearly perfect fits to old data yet fail dramatically to predict new phenomena. Yet, when these fail, approaches that dispense with CLT assumptions can yield extremely accurate quantitative predictions, even for the * exact same data* and

__exact same models__.We are demonstrating these crucial model-data integration concerns using single-cell-single-molecule data collected on an evolutionarily conserved Mitogen-Activated Protein Kinase (MAPK) pathway and its downstream induction of mRNA transcription in yeast. We discuss how different modeling assumptions affect parameter uncertainties or bias and how these errors affect predictive understanding.

We examine the stress response High Osmolarity Glycerol (Hog1) pathway and its control of transcription mechanisms (polymerase initiation and elongation, mRNA export, and accumulation and degradation) during transient adaptation to hyper-osmotic shock. Our collaborators in the Neuert Lab at Vanderbilt University have quantified individual mRNA at the site of transcription, in the nucleus, and in the cytoplasm for multiple genes using single-molecule fluorescence hybridization for more than 65,000 cells at many points in time, different environmental conditions, and in multiple replica experiments. These measured distributions are demonstrably non-normal and non-symmetric, which has important implications on the results of model-data integration.

We extend a multi-state gene expression model %((%1,2%))% to account for transcriptional regulation and spatial localization of mRNA. We solve this model with three * exact* computational methods: (1) a deterministic ODE analysis, (2) a linear noise analysis, and (3) a chemical master equation (CME) solution. We invoke the CLT to approximate the likelihood of all data given the model for analyses (1)&(2), and we compute the exact likelihood using analysis (3). For each case, we use Metropolis Hastings sampling to find the maximum likelihood and posterior distribution of the parameters, given the data.

Despite excellent fits to training data, the CLT-methods fail to predict the full statistics, and parameter uncertainty and bias errors in the CLT-approaches are orders of magnitude larger than for the CME-approach. Use of second moments (i.e., (co)variances) modestly reduces uncertainty, but exacerbates the bias and yields even worse predictions. We trace this effect to asymmetry in the RNA distributions, which causes systematic under-estimation of the moments and leads CLT-approaches to overestimate RNA degradation rates by multiple orders of magnitude compared to results in different yeast strains %((%2%))%. In contrast, the CME-approach recovers these rates within 5-8%, indicating strong repeatability of both experiments and analyses.

We used the identified models to predict the elongation dynamics of nascent mRNA at transcription sites (TS). Using TS images for *endogenous* mRNA for the *CTT1 *gene, we estimated Pol II elongation rate in excelent agreement with published rates. Using no additional free parameters, we correctly predicted and then measured (*i*) the average full-length *STL1* mRNA per active TS, (*ii*) the quantitative fraction of cells that have active TS’s versus time, and (*iii*) the full distributions of nascent mRNA (or equivalently the number of associated elongating Pol II) per TS.

*,*

**Science****336**:6078, 183-187, 2012

*,*

**Science****339**:6119, 584-587, 2013.

*, 115:29, 7533-7538, 2018.*

**P****roceedings of the National Academy of Sciences**Translation is an essential step in which ribosomes decipher mRNA sequences to manufacture proteins. Recent advances in single-molecule imaging allow live-cell quantification of the kinetics of ribosome initiation and elongation. Here, we integrate single-molecule data and stochastic models to investigate how elongation rates vary among different gene transcripts. Our computational method automatically generates discrete translation models to match any mRNA sequence. The models are then solved using stochastic dynamics; the results are quantified in terms of translation spot intensity; and we fit these to single-mRNA translating spots observed under the microscope. We compare models with fixed and codon-dependent elongation rates. Additionally, we simulate the effect of chemical perturbations, puromycin and harringtonine, which inhibit elongation and initiation steps, respectively. Because codon usage and chemical environments both effect translation mechanisms, the kinetics of protein production can be decoded to extract new information about cellular states and mRNA sequences.

Flow cytometry typically relies upon specific biochemical labels, though they are not always available, they can be costly, and they can disrupt natural cell behavior. Label-free quantification strategies are needed to correct these issues. Unfortunately, label replacement strategies may be difficult to learn if applied labels or other modifications in training data inadvertently modify intrinsic cell properties. Here we demonstrate development of a new approach based upon population statistics and machine learning to integrate labeled and unlabeled training data and to identify models for accurate label-free quantification. Accuracy of this method is then shown by evaluating the resulting ability of the machine learning approach to quantify lipid content in new conditions as part of an algal biofuel application. We apply our approach to make and test label-free quantification of lipid content in *Picochlorum sp.*, at multiple times following nitrogen starvation and lipid accumulation.

This research is being performed in collaboration with experimental efforts in Babetta Marrone's laboratory at Los Alamos National Laboratory.

One of the many ways that bacteria use to evade antibiotic treatments is *Bacterial Persistence*. In this phenomena, rare cells in a large population transiently enter into a dormant state in which they do not grow, but they are also not responsive to antibiotics in their environment. these cells can later escape their dormant state to replenish the population after antibiotics are cleared from the environment. We are interested to develop quantitative models for the epigenetic heterogeneity of persistence in the context of time varying environments. If we can predict what circumstances lead to persistence, we can use these to design more effective control strategies to maximize the effectiveness of antibiotics while minimizing the time and amount of treatments.

**New Computational Approaches to Model Heterogenous Populations. **Population modeling aims to capture and predict the dynamics of cell populations in constant or fluctuating environments. At the elementary level, population growth proceeds through sequential divisions of individual cells. Due to stochastic effects, populations of cells are inherently heterogeneous in phenotype, and some phenotypic variables have an effect on division or survival rates, as can be seen in partial drug resistance. Therefore, when modeling population dynamics where the control of growth and division is phenotype dependent, the corresponding model must take account of the underlying cellular heterogeneity. The finite state projection (FSP) approach has often been used to analyze the statistics of independent cells. Here, we extend the FSP analysis to explore the coupling of cell dynamics and biomolecule dynamics within a population. This extension allows a general framework with which to model the state occupations of a heterogeneous, isogenic population of dividing and expiring cells. The method is demonstrated with a simple model of cell-cycle progression, which we use to explore possible dynamics of drug resistance phenotypes in dividing cells. We use this method to show how stochastic single-cell behaviors affect population level efficacy of drug treatments, and we illustrate how slight modifications to treatment regimens may have dramatic effects on drug efficacy.

R. Johnson and __B. Munsky__, The finite state projection approach to analyze dynamics of heterogeneous populations, Physical Biology, 14:3, 035002 (2017), online here

Recent research has shown that microbes drive carbon sequestration in soil, however the mechanism by which microbial communities affect carbon sequestration remains unknown. We are using feature selection methods combined with neural network and random forest model analyses to obtain insight into which indicator species have positive or negative correlations with the fixation of carbon in complex soil communities. Determining which microbial features drive carbon sequestration in soil could have profound impact on reducing atmospheric carbon dioxide and combating climate change. Currently, the relationship between microbial communities and carbon sequestration is not fully understood. With machine learning and feature selection techniques, it may be possible to determine which microbial communities are most important for converting atmospheric CO2 into a stable carbon pool in soil.

This research is being performed in collaboration with experimental efforts in John Dunbar's laboratory at Los Alamos National Laboratory.