Date of Award
Master of Science (MS)
In human stool, a large population of bacterial genes and transcripts from hundreds of genera coexist with host genes and transcripts. Assessments of the metagenome and transcriptome are particularly challenging, since there is a great deal of sequence overlap among related species and related genes. We sequenced the total RNA content from stool samples in a neonate using previously-described methods. We then performed stepwise alignment of different populations of RNA sequence reads to different indices, including ribosomal databases, the human genome, and all sequenced bacterial genomes. Each pool of RNA at each alignment step was subjected to compression to assess sequence complexity in bits per symbol. In order to account for the high degree of overlap among species, a Bayesian network tool (RNABayes) was constructed using a node based on 16S sequencing, and a large number of nodes based on alignment scores to bacterial genes. The following algorithm was then employed: (1) fit 16S census from a sample onto a Dirichlet distribution using maximum likelihood estimation to get the conjugate prior, (2) estimate probabilities of each bacterial genus for each bacterial mRNA alignment using BLAST alignment scores, (3) fit each of these probabilities to a Dirichlet distribution using maximum likelihood estimation, (4) perform inference iteratively to update the conjugate prior, with the result being the posterior probability distribution of metabolically active stool bacteria. This algorithm was then applied to three datasets: (1) a simulated data set with normally distributed mRNAs, (2) a simulated data set with skewed mRNAs for a single bacterial population, and (3) the RNASeq dataset from our newborn stool sample. Results indicate that a Bayesian network built in this fashion reliably adjusts the prior bacterial population distribution to more accurately reflect the transcriptionally active bacterial population. Application of this method to real world samples appears to show even more marked skew, indicating transcripts are not uniformly distributed by population.
Phillip I. Tarr, Roman Garnett
Bioinformatics Commons, Engineering Commons, Numerical Analysis and Scientific Computing Commons
Permanent URL: https://doi.org/10.7936/K7862DW5