Information Theory of Mixed Population Genome-Wide Association Studies

01 January 2018

New Image

Genome-Wide Association Study (GWAS) addresses the problem of associating subsequences of individuals' genomes to the observable characteristics called phenotypes. In a genome of length G, it is observed that each characteristic is only related to a specific subsequence of it with length L, called the causal subsequence. The objective is to recover the causal subsequence, using a dataset of N individuals' genomes and their observed characteristics. Recently, the problem has been investigated from an information theoretic point of view in [1]. It has been shown that there is a threshold effect for reliable learning of the causal subsequence at $displaystyle frac {Gh(L/G)}{N}$ by characterizing the capacity of it. Here h denotes the binary entropy function. However, it is assumed that the dataset is collected from one population and the problem of mixed population datasets is not considered in [1], which is observed in many practical settings. In this paper, we study the mixed population version of GWAS, where we assume that the dataset is gathered from K subpopulations, rather than one. Each subpopulation has a specific causal subsequence for the observed characteristic and the subpopulation origins of individuals are latent. The objective is to recover all the causal subsequences with high accuracy. We investigate the fundamental limits of mixed population GWAS and characterize its capacity. It is observed that for a special class of two subpopulations, the capacity is one-fourth of the capacity of unmixed population case with the same parameters. Also, the capacity of this problem has connections to the capacity region of the Multiple Access Channel (MAC).