The Li lab studies the genetic and functional basis of complex human diseases using genomic approaches. An overarching theme in the Li lab is the responsible use of complex data in transparent, reproducible, and community extendable research. We rely on research computing resources to sustain efficiency and FAIR Data Principles.
Single-cell genomics and sparse data analytics
Dr. Li co-directs the Michigan Center for Single-Cell Genomic Data Analytics which currently involves five data scientists (mathematicians, statisticians) and five biological researchers, and about fifteen students. Single-cell sequencing data are unique in having two types of sparsity: lower-rank structure in high-dimensional data, and low integer counts and many zeros. We are developing theoretical support for the unique distribution and inference properties of “double-sparse” integer counts, while actively pushing the community to standardize method evaluation by using simulated datasets not tilted towards any particular assumptions. Our approach is to create a compendium of simulated datasets, with known truth, linked by known differences to target-specific model choices, and share them broadly for benchmarking existing tools.
Christopher D. Green, Qianyi Ma, Gabriel L. Manske, Adrienne Niederriter Shami, Xianing Zheng, Simone Marini, Lindsay Moritz, Caleb Sultan, Stephen J. Gurczynski, Bethany B. Moore, Michelle D. Tallquist, Jun Z. Li*, Saher Sue Hammoud*: A comprehensive roadmap of murine spermatogenesis defined by single- cell RNA-seq. Development Cell S1534-5807(18): 30636-1, 2018. Link
Data analysis method development
Li lab has a long-standing interest in unsupervised class discovery, especially in adopting appropriate multi-variate methods to detect sample clusters using high-dimensional data. This interest stems from our earlier work in human genetic diversity (7) and cancer subtypes (8). We closely examined the statistical limitations of a popular clustering method, Consensus Clustering (CC). We found that CC is able to divide randomly generated unimodal data into apparently stable clusters for a range of K (the number of clusters), essentially reporting chance partitions of cluster-less data. For data with known structure, the common implementations of CC perform poorly in identifying the true K. We proposed the use of null distributions that contain realistically observed gene-gene correlation (9). More broadly, we advocated systematic appraisal of classification methods using simulated null datasets with known absence of clusters, as well as "positive" datasets with known K and known clustering strength. This work reflects the group's basic approach to address rigor and reproducibility in the area of classification.
B. Li, J.Z. Li. (2014) A general framework for analyzing tumor subclonality using SNP array and DNA sequencing data. Genome Biology. 15(9):473. PMCID: 4203890.
B. Li et al., Genomic estimates of aneuploid content in glioblastoma multiforme and improved classification. Clin Cancer Res 18, 5595-5605 (2012).
Y. Senbabaoglu, G. Michailidis, J. Z. Li, Critical limitations of consensus clustering in class discovery. Sci Rep 4, 6207 (2014).
The NIH Common Fund's Molecular Transducers of Physical Activity in Humans program aims to assemble a comprehensive map of the molecular changes that occur in response to acute exercise, and to relate these changes to the benefits of physical activity. The Consortium consists of 23 institutions. Our role is to act as one of the seven Chemical Analysis Sites. MoTrPAC will analyze the human and rodent samples using genomic, epigenomic, transcriptomic, proteomic, and metabolomic technologies.
Spontaneous mutation patterns in the human genome
Germline mutagenesis is a fundamental biological process, and a major source of all heritable genetic variation. Mutation rates and their variation along the genome are widely used in genomics research to calibrate variant calling algorithms, infer demographic history, identify recent patterns of genome evolution, and interpret clinical sequencing data. Although mutation is an inherently stochastic process, the distribution of mutations in the human genome is correlated with genomic and epigenomic features, including local nucleotide context. We investigate the basic biology of germline mutational processes through regional variation and context dependency of mutation rates. The goal is to understand what drives this variability, and build accurate predictive models (1). Recently, we used ~36 million singleton variants from 3,560 whole-genome sequences to infer fine-scale patterns of mutation rate heterogeneity, showing how mutability is jointly affected by adjacent nucleotide context and diverse genomic features of the surrounding region, including histone modifications, replication timing, and recombination rate, sometimes suggesting specific mutagenic mechanisms (2). Our results provide the most refined survey to date of the factors contributing to genome-wide variability of the human germline mutation rate.
1. V. M. Schaibley et al., The influence of genomic context on mutation patterns in the human genome inferred from rare variants. Genome Res 23, 1974-1984 (2013).
2. J. Carlson, J. Li, S. Zöllner, Extremely rare variants reveal patterns of germline mutation rate heterogeneity in humans. Nature Communications. 2018.
Gene discovery rat models of addiction and metabolic health
A decrease in aerobic capacity is associated with increased risks for many diseases, including obesity, insulin resistance, hypertension, and type 2 diabetes. To understand the biological underpinnings of this trait we established two rat lines by divergent selection of intrinsic endurance running capacity. After 32 generations, the high capacity runners (HCR) and low capacity runners (LCR) differed by ~9-fold in untrained endurance running distance, and diverged in body fat, blood glucose, and other health indicators (3). We crossed and genotyped HCRs and LCRs, and collected RNA-Seq data for muscle samples to identify eQTLs related to running capacity. Besides identifying selection-specific genetic sweeps segregating high and low fitness phenotypes, we are building a framework underpinning the genetic differences between low and high AEC (4). In a newly funded U01 we are working on another selectively bred rat model to identify gene variants associated with addiction and substance abuse, and validate the results in outbred rats.
Y. Y. Ren et al., Genetic analysis of a rat model of aerobic capacity and metabolic fitness. PLoS One 8, e77588 (2013).
Y. Y. Ren et al., Selection-, age-, and exercise-dependence of skeletal muscle gene expression patterns in a rat model of metabolic fitness. Physiol Genomics 48, 816-825 (2016).
Genome evolution in esophageal cancer
We are interested in an unsolved problem in cancer research: detecting intratumor heterogeneity, defining the order and timing of somatic alterations, and understanding how these alterations contribute to malignancy, metastasis, and drug resistant phenotypes. Cancer evolution is essentially a Darwinian selection process and can be studied using population genetics principles. We have developed a general framework to accurately estimate subclonality of somatic events (5), and have an ongoing project to study esophageal squamous cell carcinoma in a Chinese cohort. In this work we extended our earlier method of analyzing intratumor heterogeneity from genome-wide average estimates of aneuploid content to local, segment-specific estimates. Unlike similar methods from other groups, ours provided a complete mathematical treatment of the confounding effect of copy number alterations (CNA) on the observed somatic mutation frequencies in bulk tumor samples, while considering the copy number state, the relative timing of CNA and mutation, and their phase relationship (6). The ultimate goal of these studies is to apply population genetics principles to understand the biogeographic patterns of somatic variation.
B. Li, J. Z. Li, A general framework for analyzing tumor subclonality using SNP array and DNA sequencing data. Genome Biol 15, 473 (2014).
W. Yuan et al., Mutation landscape and intra-tumor heterogeneity of two MANECs of the esophagus revealed by multi-region sequencing. Oncotarget 8, 69610-69621 (2017).