Skip to main content
American Journal of Human Genetics logoLink to American Journal of Human Genetics
. 2016 Mar 24;98(4):653–666. doi: 10.1016/j.ajhg.2016.02.012

Control for Population Structure and Relatedness for Binary Traits in Genetic Association Studies via Logistic Mixed Models

Han Chen 1,8, Chaolong Wang 1,2,8, Matthew P Conomos 3, Adrienne M Stilp 3, Zilin Li 1,4, Tamar Sofer 3, Adam A Szpiro 3, Wei Chen 5, John M Brehm 5, Juan C Celedón 5, Susan Redline 6, George J Papanicolaou 7, Timothy A Thornton 3, Cathy C Laurie 3, Kenneth Rice 3, Xihong Lin 1,
PMCID: PMC4833218  PMID: 27018471

Abstract

Linear mixed models (LMMs) are widely used in genome-wide association studies (GWASs) to account for population structure and relatedness, for both continuous and binary traits. Motivated by the failure of LMMs to control type I errors in a GWAS of asthma, a binary trait, we show that LMMs are generally inappropriate for analyzing binary traits when population stratification leads to violation of the LMM’s constant-residual variance assumption. To overcome this problem, we develop a computationally efficient logistic mixed model approach for genome-wide analysis of binary traits, the generalized linear mixed model association test (GMMAT). This approach fits a logistic mixed model once per GWAS and performs score tests under the null hypothesis of no association between a binary trait and individual genetic variants. We show in simulation studies and real data analysis that GMMAT effectively controls for population structure and relatedness when analyzing binary traits in a wide variety of study designs.

Introduction

Population stratification is a major source of confounding in genetic association studies.1 With the recent development of computationally efficient algorithms, linear mixed models (LMMs) have become popular in genome-wide association studies (GWASs) for controlling population stratification, as well as familial or cryptic relatedness.2, 3, 4, 5, 6, 7, 8, 9, 10 However, in human genetics, GWASs are often conducted using binary traits; for example, case-control studies sample disease subjects (case subjects) and disease-free subjects (control subjects) and collect their genotype and exposure data retrospectively. Cohort studies follow a group of healthy subjects prospectively and collect their subsequent status evaluations with respect to the binary disease outcome. Despite the popularity of the use of LMMs in binary trait GWASs, their reliance on a generally invalid assumption appears to have been overlooked. Specifically, as typically used, LMMs assume that the trait has constant residual variance, which is usually violated by binary traits in the presence of covariates. As a consequence, we will show that in the presence of population stratification, fitting LMMs to binary traits can fail to control for type I error rates and yield incorrect p value estimates. Typical use of LMMs also ignores the biased sampling in case-control studies, which leads to biased effect estimates.

Our study of LMMs for binary traits was motivated by analyses of the binary trait asthma in the Hispanic Community Health Study/Study of Latinos (HCHS/SOL), originally performed using LMMs with three random effects to account for genetic relatedness as well as household and block group membership from its multi-stage sampling design. For asthma, ancestry is a known confounder of genetic associations in Hispanic/Latino populations, and in HCHS/SOL the proportion of asthma cases varies widely by ancestry group (e.g., 25.6% in Puerto Ricans versus 4.1% in South Americans, see Table 1).11 Despite regression adjustment for ancestry by including principal components (PCs)12, 13 and indicators for ancestry group as covariates, the LMM results for this trait appear invalid (Figure 1), showing clear conservatism/anti-conservatism for variants with the minor allele that is respectively less common/more common in Puerto Rican heritage versus all other ancestry groups.

Table 1.

Numbers of Asthma Case Subjects and Control Subjects in Six Hispanic/Latino Background Groups in HCHS/SOL after Quality Control of Samples

Group Case Subjects Control Subjects Sample Size Proportion of Case Subjects Trait Variance
Central American 55 1,173 1,228 4.5% 0.043
Cuban 182 1,722 1,904 9.6% 0.086
Dominican 99 933 1,032 9.6% 0.087
Mexican 172 4,189 4,361 3.9% 0.038
Puerto Rican 478 1,390 1,868 25.6% 0.190
South American 34 795 829 4.1% 0.039

Trait variance is calculated as the product of proportions of case and control subjects.

Figure 1.

Figure 1

Quantile-Quantile Plot of Association Test p Values from the Asthma GWAS Analysis in HCHS/SOL

(A) All SNPs.

(B) Category 1: SNPs with the ratio of expected variances in Puerto Ricans over non-Puerto Ricans less than 0.8.

(C) Category 2: SNPs with the ratio of expected variances in Puerto Ricans over non-Puerto Ricans between 0.8 and 1.25.

(D) Category 3: SNPs with the ratio of expected variances in Puerto Ricans over non-Puerto Ricans greater than 1.25.

Abbreviations are as follows: LMM, a joint analysis using LMM on the combined samples; LMM meta, an inverse-variance weighted fixed effects meta-analysis approach to combine LMM results from analyzing Puerto Ricans and non-Puerto Ricans separately.

Based on published case-control GWASs,14, 15, 16, 17 this concerning behavior does not appear to be well known. Users of LMMs for binary traits have appealed to Pirinen et al.,9 who showed that linear models are a sufficiently close approximation to logistic models for p value calculations when the effect size of a genetic variant is small and in the absence of population stratification. We will show that in the presence of population stratification, fitting LMMs to binary traits in both cohort and case-control studies can yield incorrect type I error rates in GWASs due to heteroscedasticity, that is, residual variances in a model that are not constant with respect to different values of covariates.18 Like linear regression, LMMs assume homoscedasticity, that is, residual variances are constant regardless of covariate values (Figure 2).19

Figure 2.

Figure 2

True Mean-Variance Relationship for a Binary Trait and the Constant Mean-Variance Relationship Assumed by Linear Models, Illustrated by the Example from the Asthma Data in HCHS/SOL

For a binary trait with the mean π, its variance is π(1 − π), which varies with the mean. This heteroscedasticity is properly accounted for by logistic regression. Linear models inappropriately assume that the variance of the binary trait does not change with the mean and is a constant (homoscedasticity). For example, the variance of the binary trait (asthma status) in Puerto Ricans is considerably larger than the variances in the other five populations, because Puerto Ricans have a much higher asthma disease proportion than the other populations. This heteroscedasticity caused by population stratification results in the p values calculated from LMMs being likely to be incorrect, but is properly taken into account by logistic mixed models using GMMAT.

Recently, liability threshold linear mixed models have been proposed for case-control studies.20, 21 Fitting these models require knowing disease prevalence and accurate heritability and liability estimates in the underlying study population, which might not be available or might be difficult to specify in practice, especially when disease prevalence differs between subpopulation groups. For example, in the HCHS/SOL, the asthma prevalence rates were different in different ancestry background groups;11 in case-control studies in the presence of population stratification, it is difficult to specify the disease prevalence that can be used for analysis when underlying subpopulation groups are unknown and the disease prevalence rates vary between underlying subpopulations. We will show that these methods can fail to control for type I error rates in the presence of moderate or strong population stratification.

To address these issues, we propose using logistic mixed models22 to account for both population stratification and relatedness in genetic association studies of binary traits, while naturally allowing for the non-constant variance of these traits. Because logistic mixed models are computationally more expensive than LMMs and regular logistic models, they have not been used in large-scale GWASs. Although SAS PROC GLIMMIX allows for fitting logistic mixed models with a genetic relationship matrix, it is not feasible for large-scale GWASs because of its computational burden associated with fitting a large number of logistic mixed models, one per variant, across the genome. We develop the generalized linear mixed model association test (GMMAT), which requires fitting a logistic mixed model under the null hypothesis only once per GWAS, and is hence computationally feasible for large-scale GWASs.

Specifically, GMMAT first fits the null logistic mixed model that includes as fixed effects only covariates, including ancestry PCs, but no individual genetic variants, and uses random effects to account for residual population stratification not captured by fixed effects PCs as well as relatedness. When fitting this null model, GMMAT uses penalized quasi-likelihood (PQL)22 and the computationally efficient average information restricted maximum likelihood (AI-REML) algorithm.6, 23 This fitted null model is the same for all genetic variants in a GWAS. GMMAT then applies a score test24 for each individual genetic variant to test for its association with a binary trait. The computational cost of the score test for each genetic variant is negligible compared to the cost of fitting the null logistic mixed model, so the procedure is computationally fast, even for large-scale GWASs.

As a full-modeling approach, GMMAT accounts for the binary nature of the trait, in particular its non-constant variance in the presence of covariates, and so correctly controls type I error rates in the presence of population stratification and relatedness. GMMAT can also allow for complex sampling designs such as hierarchical designs and allow for unobserved shared environmental effects among sampled individuals by incorporating multiple random effects.

Material and Methods

Logistic Mixed Models and Score Tests

For a single-variant test, we consider the following logistic mixed model:

logit(πi)=Xiα+Giβ+bi,

where πi=P(yi=1|Xi,Gi,bi) is the probability of a binary phenotype (e.g., disease status) for subject i, conditional on their covariates, genotype, and random effects bi, Xi is a 1 × p row vector of covariates for subject i, α is a p × 1 column vector of fixed covariate effects including an intercept, Gi is the genotype of a genetic variant for subject i, and β is the genotype effect. We assume that bN(0,k=1KτkVk) is an n × 1 column vector of random effects, where τk are the variance component parameters and Vk are known n × n relatedness matrices. When the number of variance components K = 1, V1 is usually the genetic relationship matrix estimated from a large number of genetic variants. We allow for multiple random effects to account for complex sampling designs, e.g., hierarchical designs, and environmental covariance structure. The binary phenotypes yi are assumed to be independent conditional on the random effects b.

To perform the score test for the null hypothesis H0: β = 0, we need to fit the null logistic mixed model, which is the same for all genetic variants, as

logit(πi0)=Xiα+bi, (Equation 1)

where πi0=P(yi=1|Xi,bi). We fit Equation 1 using the penalized quasi-likelihood (PQL) method.22 Specifically, let W=diag{vi0}, where vi0=πi0(1πi0), X=(X1TX2TXnT)T be an n × p covariate matrix including an intercept, and Y˜ be the “working vector” with components Y˜i=Xiα+bi+vi01(yiπi0). Under the null hypothesis H0: β = 0, we iteratively fit the working linear mixed model Y˜=Xα+b+ε where εN(0,W1). We use the computationally efficient AI-REML algorithm6, 23 to estimate τˆk. After obtaining the variance component estimates, the estimates of the fixed effects αˆ and random effects bˆ can be updated, followed by the working vector Y˜. The process continues until convergence.

At each iteration, we perform a matrix inversion based on Cholesky decomposition with complexity of O(n3) and matrix multiplications with complexity of O(pn2 + K2n2 + p2n), where n is the sample size, p is the number of covariates, and K is the number of variance components. Thus, the total complexity of fitting a logistic mixed model with K known relatedness matrices is O(in3 + ipn2 + iK2n2 + ip2n), where i is the number of iterations required to reach convergence.

The score for H0: β = 0 is T=GT(yπˆ0), where G=(G1G2Gn)T is the n × 1 column vector of genotypes, y=(y1y2yn)T is the n × 1 column vector of outcomes, and πˆ0 is a vector of fitted values under H0, which is the same for all SNPs. The estimated variance of the score is Var(T|H0)=GTPˆG under the null hypothesis, where Pˆ=Σˆ1Σˆ1X(XTΣˆ1X)1XTΣˆ1, and Σˆ=Wˆ1+k=1KτˆkVk. The test for each genetic variant involves a vector-matrix-vector multiplication and an inner product calculation for two vectors, and thus the score test step has complexity of O(qn2 + qn), where q is the total number of genetic variants tested. Also, the score test step can be easily parallelized if necessary. We use the C++ library Armadillo to perform matrix and vector calculations. More details about fitting the logistic mixed model and performing the score test are given in Appendix A.

HCHS/SOL Genotyping and Statistical Analysis

HCHS/SOL subjects who consented to genetic studies had DNA extracted from blood. These DNA samples were genotyped on the SOL HCHS Custom 15041502 B3 array (custom content designed and developed by Papanicolaou, Rotter, and Taylor) at Illumina Microarray Services. This array comprised the Illumina Omni 2.5M array (HumanOmni2.5-8v1-1) and additional custom content selected for HCHS/SOL, including ancestry-informative markers, variants characteristic of Amerindians, and known GWAS hits.25 Illumina Microarray Services, LA Biomed, and the SOL Genetic Analysis Center (GAC) performed quality control to generate recommended SNP- and sample-level quality filters. Samples were checked for annotated versus genetic sex, gross chromosomal anomalies, relatedness and population structure, missing call rates, batch effects, duplicate sample discordance, and Mendelian errors. At the SNP level, checks were performed for Hardy-Weinberg equilibrium, minor allele frequency (MAF), duplicate probe discordance, and missing call rate.

Study participants were recruited via a multi-stage survey sampling procedure, for which individuals were sampled within households that were sampled within block groups.26 The study includes genetic data from 12,803 individuals of Hispanic/Latino origin, belonging to six different Hispanic/Latino background groups.25 The HCHS/SOL study was approved by institutional review boards at participating institutions, and written informed consent was obtained from all participants. Standard quality control, similar to that described by Laurie et al.,27 was performed to filter SNPs and samples of poor quality. Additionally, samples missing information on asthma diagnosis and 56 samples identified as ancestry outliers from principal-component analysis were removed, leaving a sample size of 11,222 for analysis. We also filtered SNPs with a MAF less than 5% in this overall sample, resulting in a set of 1,299,221 autosomal SNPs to be analyzed. The proportions of asthma cases in analyzed participants within each of the genetic background groups of Central American, Cuban, Dominican, Mexican, Puerto Rican, and South American descent are 4.5%, 9.6%, 9.6%, 3.9%, 25.6%, and 4.1%, respectively (Table 1). The Puerto Rican group is a clear outlier from the rest, so we compared it to the collection of all other background groups, which has a combined proportion of 5.8%. To account for potential shared environment effects resulting from the sampling design, relatedness matrices representing household and block group membership were included in addition to the standardized genetic relationship matrix, totaling three random effects for the linear and logistic mixed models. The top five PCs, as well as Hispanic/Latino background group indicators, were used to adjust for ancestry in both models. Additional fixed effects covariates included field center, sex, age, cigarette use, cigarette pack years, and sampling weight (i.e., weights to account for disproportionate selection of the sample according to the sample survey design26). Treatment of sampling weight as a fixed effect in this way can effectively adjust for the marginal effect of design variables.28

Simulation Studies

We perform coalescent simulations29 to generate genotypes for a total of 8,000 founders with 1,000,000 independent SNPs from a 20 × 20 grid (Figure S1) to mimic spatially continuous populations (20 individuals per cell) with migration rate between adjacent cells M = 10 to represent population structure within Europe.30, 31 To simulate genotypes of an offspring cohort, we first sampled (without replacement) 10 pairs of parents for each cell of the 20 × 20 grid from the original cohort and then simulated two children for each family using the gene dropping algorithm,32 thus obtaining 8,000 individuals in the offspring cohort. We combined the two cohorts to get 16,000 individuals with both population structure and relatedness. For individual j in family i, the probability of being a case πij was calculated from

logit(πij)=α0+α1Zi+aij,

where Zi = 1 if family i was from a 10 × 10 grid in the top left, and Zi = 0 otherwise. The parameters α0 and α1 were chosen such that the disease prevalence was 0.28 in the high-risk population group in the top left given no random effects and 0.05 in the low-risk population group given no random effects. The random effects were simulated as

ai=(ai1ai2ai3ai4)N((0000),σ2(100.50.5010.50.50.50.510.50.50.50.51)),

where the variance component parameter σ2 was set to 2. We randomly sampled 10,000 individuals from the combined original and offspring cohorts to form a simulated cohort study, and their standardized genetic relationship matrix was calculated using 625,583 genetic variants with MAF greater than 5% in the founder population. We removed family indicators and compared linear and logistic mixed models using the genetic relationship matrix. We included the top ten ancestry PCs as covariates in both models. We analyzed common genetic variants with MAF greater than 5%. In the simulated case-control study for rare disease, we chose α0 and α1 such that the disease prevalence was 0.045 in the high-risk population group in the top left given no random effects and 0.005 in the low-risk population group given no random effects. We randomly sampled 1,667 case subjects and 8,333 control subjects from the combined original and offspring cohorts to form a case-control study with a total sample size of 10,000 and a case-control ratio of 1:5. We performed the same analysis as described above.

Results

Analysis of HCHS/SOL Asthma GWAS Data

We compared the results from LMMs and logistic mixed models using GMMAT for an analysis of physician-diagnosed asthma in the population-based HCHS/SOL cohort study.

Figure 1A shows the behavior of the overall quantile-quantile (QQ) plots from each method in the asthma analysis. The inflated results from the LMM are caused by violation of its constant residual variance (homoscedasticity) assumption. For binary traits, population stratification affects both population-specific means (disease prevalences in cohort studies) and variances of the trait; population groups with disease prevalence closer to 0.5 have larger variances (Table 1 and Figure 2). The mean-variance relationship assumed in linear models and LMMs is misspecified for binary traits. Although confounding by population structure can be accounted for by adjusting for population groups or ancestry PCs, unequal binary trait variances across different populations are not appropriately modeled in linear models and LMMs. The variance of the standard LMM-based test statistic for no genetic association is underestimated when population groups with larger binary trait variances (i.e., higher disease risk) also have higher MAFs, thus larger genotypic variances. This leads to inflation in the test statistic, and vice versa.

To demonstrate this, we partitioned all common SNPs across the genome into three categories based on their genotypic variances in Puerto Ricans and non-Puerto Ricans: (1) the SNPs with the ratio of expected variances, that is Var(SNP) = 2MAF(1MAF), in Puerto Ricans (high-risk group) over non-Puerto Ricans (low-risk group) less than 0.8; (2) the SNPs with the ratio of Var(SNP) in Puerto Ricans over non-Puerto Ricans between 0.8 and 1.25; and (3) the SNPs with the ratio of Var(SNP) in Puerto Ricans over non-Puerto Ricans greater than 1.25. In this classification, the category 1 SNPs (n = 144,815 [11%]) have appreciably lower MAFs in Puerto Ricans (high risk) than non-Puerto Ricans (low risk). The category 2 SNPs have similar MAFs in both groups (n = 982,805 [76%]). The category 3 SNPs (n = 171,601 [13%]) have appreciably higher MAFs in Puerto Ricans (high risk) than non-Puerto Ricans (low risk). Both LMMs and GMMAT perform well for category 2 SNPs (Figure 1C). The problem with LMMs is more apparent for category 1 and 3 SNPs: Figure 1B shows conservative p values for category 1 SNPs, and Figure 1D shows anti-conservative p values for category 3 SNPs, as expected. This indicates the p values calculated from LMMs are likely to be incorrect for at least 24% of SNPs in HCHS/SOL asthma GWASs. Meta-analysis by analyzing Puerto Ricans and non-Puerto Ricans separately after removing related individuals across the two groups improves the LMM performance, but the results are still not well calibrated for category 1 and 3 SNPs (Figures 1B and 1D), probably due to the heteroscedasticity issue caused by within-group population stratification. In contrast, GMMAT performs well for all categories of SNPs considered.

Simulation Studies

We also conducted extensive simulation studies under the null hypothesis to illustrate this issue in various GWAS case-control and cohort study designs, to exclude the possibility that LMM-inflated results in HCHS/SOL were caused by polygenic effects. We restricted analyses to common SNPs with a MAF greater than 5%. First, in a simulated cohort study of 10,000 individuals with cryptic relatedness, we simulated two population groups with disease prevalences of 28% (population 1) and 5% (population 2), respectively, from a map of spatially continuous populations. The disease prevalences were chosen to mimic the asthma disease proportions of Puerto Ricans and non-Puerto Ricans in HCHS/SOL (Table 1). In both LMM and GMMAT, we adjusted for the top ten ancestry PCs as fixed effects covariates. Figure 3A includes results from 3,200 null simulated datasets with 625,583 SNPs and 10,000 related subjects per dataset. LMM has a type I error rate of 1.26 × 10−7 at the nominal 5 × 10−8 level, compared to 5.0 × 10−8 for GMMAT. Note that because more than 2 billion p values are included, type I error rate estimates at this level are accurate with standard error 5 × 10−9.

Figure 3.

Figure 3

A Simulated Cohort Study with 10,000 Related Individuals

Quantile-quantile plots of association test p values from 3,200 simulation replicates under the null hypothesis of no genetic association, each with 625,583 common SNPs, were combined to get more than 2 billion null p values.

(A) All SNPs.

(B) Category 1: SNPs with the ratio of expected variances in population 1 (high risk) over population 2 (low risk) less than 0.8.

(C) Category 2: SNPs with the ratio of expected variances in population 1 (high risk) over population 2 (low risk) between 0.8 and 1.25.

(D) Category 3: SNPs with the ratio of expected variances in population 1 (high risk) over population 2 (low risk) greater than 1.25.

Following the HCHS/SOL example, we partitioned all the common SNPs into three categories: (1) the SNPs with the ratio of expected variances in population 1 (high risk) over population 2 (low risk) less than 0.8; (2) the SNPs with the ratio of expected variances in population 1 over population 2 between 0.8 and 1.25; and (3) the SNPs with the ratio of expected variances in population 1 over population 2 greater than 1.25. Category 1 includes the SNPs that have appreciably lower MAFs in population 1 than population 2 (24%). Category 2 includes the SNPs that have comparable MAFs in the two populations (58%). Category 3 includes the SNPs that have appreciably higher MAFs in population 1 than population 2 (18%). For category 2 SNPs, both LMM and GMMAT properly control for type I error rates (Figure 3C). However, despite adjusting for ancestry PCs, LMMs have deflated type I error rates for category 1 SNPs (Figure 3B) and inflated type I error rates for category 3 SNPs (Figure 3D). LMMs often fail to control type I error rates in the presence of moderate or strong population stratification when differences in prevalence by population groups cause large differences in binary trait variances between populations, as demonstrated by our simulation studies that were motivated by the HCHS/SOL study. However, this issue is not always evident in the overall QQ plot using all the SNPs in a GWAS, because the number of SNPs in a typical GWAS is usually in the range of 105 and 106, confidence intervals are relatively wide, and inflation and deflation are likely to mask each other in the overall QQ plot.

To illustrate this, Figure S2 shows results from only one simulated dataset with 625,583 common SNPs and 10,000 subjects in the same setting as Figure 3. There is a slight indication of inflation for p values from LMMs in Figure S2A, but the results are conservative for category 1 SNPs (n = 151,206 [24%]) (Figure S2B) and anti-conservative for category 3 SNPs (Figure S2D) (n = 111,455 [18%]). These results show that the p values from LMMs might be incorrect for at least 42% of SNPs in this analysis, even though inflation and deflation might not be obvious in the overall QQ plot.

We next simulated a case-control study with a total sample size of 10,000 and a case-control ratio of 1:5 from the same two populations as in our simulated cohort study in the presence of cryptic relatedness. Compared to cohort studies, case-control studies oversample cases. The disease prevalence was 4.5% and 0.5% in population 1 (high risk) and population 2 (low risk), respectively. Figure S3A shows that GMMAT works well according to the QQ plot of p values. The genomic control inflation factors show that both LMM and GMMAT have appropriate median p values. However, type I error rates at the nominal level of 5 × 10−8 are 1.75 × 10−7 for LMM and 4.6 × 10−8 for GMMAT, respectively. Moreover, when we divided the SNPs into three categories as we previously described, we observed strong deflation for category 1 SNPs (Figure S3B) and strong inflation for category 3 SNPs (Figure S3D) from LMM analysis, as seen in the simulated cohort study. Additional simulation studies show that the same problem for LMMs also exists in cohort studies without cryptic relatedness in the presence of population stratification (Figure S4). Although confounding by discrete populations can be appropriately accounted for by analyzing each population separately via LMMs followed by meta-analysis (Figure S5), this strategy does not perform well in the presence of confounding by continuous population structure (Figure S6). Moreover, LMMs do not work well in case-control studies of unrelated individuals with moderate to strong population stratification (Figure S7). These suggest that despite its wide use in both population-based and family-based genetic association studies, LMMs are generally not appropriate for binary traits due to misspecified phenotype variance in the model, probably yielding incorrect p values (see Appendix B for details).

We also performed additional simulations to compare GMMAT with ROADTRIPS, which performs a retrospective test for association in case-control data,33 and the recently developed liability estimator as a phenotype (LEAP) approach. LEAP fits a liability mixed model that accounts for case-control ascertainment.20 In the same case-control setting as that described for Figure S7, ROADTRIPS was not well calibrated due to failure to fully account for population stratification. LEAP was found to have well-behaved overall QQ plots but inflated type I error rates for category 1 SNPs, regardless of whether top ten ancestry PCs were adjusted as covariates or not (Figure S8).

We next simulated case-control studies with the disease prevalence of 1% by varying the case-control sampling ratios in two population groups. We first considered balanced cases and controls in the two populations. Specifically, when the case-control ratio is 1:1 in both groups, there is no population stratification, and ROADTRIPS, LEAP, LMMs, and GMMAT all properly control for type I error rates (Figure S9). When the case-control ratios are different but flipped in two populations (4:1 in population 1 and 1:4 in population 2), there is population stratification. However, because the variance of the binary trait is the same in both population groups in this situation, LMMs and GMMAT both perform well, whereas ROADTRIPS has inflated type I error rates in the tail and LEAP has an appropriate overall QQ plots but inflated or deflated type I error rates for the SNPs whose MAFs are different in the two population groups (Figure S10).

We also considered a situation where the case-control ratios are different in two populations (25:2 in population 1 and 25:48 in population 2) in a way that led to a smaller variance of the binary trait in population 1 than in population 2. ROADTRIPS and LEAP adjusting for the top ten PCs show inflation in the overall QQ plot, and LEAP without covariates and LMMs have inflated type I error rates for category 1 SNPs and deflated type I error rates for category 3 SNPs (Figure S11). In terms of required computational resources, LEAP requires more than 20 times the amount of memory compared to GMMAT for an analysis of a case-control study of sample size 10,000.

We also performed simulation studies to evaluate the performance of GMMAT in estimating odds ratios of genetic variants. We found that GMMAT had minor bias in estimating odds ratios when fitting logistic mixed models under the null and alternative hypotheses, compared to the true parameter values in large samples (n = 10,000). In addition, the performance in terms of absolute bias appears to be similar to that using MACAU,34 a recently developed Markov Chain Monte Carlo-based mixed model approach for binomial count data, while the GMMAT odds ratio estimates are less variable (Figure S12).

In the absence of population stratification, LMMs and GMMAT have comparable power, but they are both less powerful than logistic regression (Table S1). In the presence of population stratification, LMMs have less power than GMMAT for causal genetic variants with a lower MAF in the high-risk population than in the low-risk population (Table S1), due to its conservative type I error rate control for such variants, as shown in Figure S4B.

Computational Speed and Memory Usage

We benchmarked our GMMAT package against SAS PROC GLIMMIX regarding computational speed and memory usage. To fit a null model with sample size 2,000, GMMAT takes less than 1.5% of the time required by SAS PROC GLIMMIX when fitting a logistic mixed model with one variance component, and less than 0.6% of the time with three variance components (Table S2), yielding the same numerical results to at least the fourth decimal place. On average, with one variance component of random effects, as is commonly used to account for genetic relatedness in GWASs, SAS PROC GLIMMIX requires about 28 min to fit a null model on a single core of an Intel Xeon E5-2690 CPU (2.90 GHz), compared to about 22 s using GMMAT. With three variance components of random effects that account for complex sampling designs in addition to genetic relatedness, SAS PROC GLIMMIX requires about 1.2 hr, whereas GMMAT needs still about 22 s. It takes about 14 min for GMMAT to perform score tests for 1,000,000 genetic variants on the same core, without parallelization. In practice, score tests for different variants can be easily run in parallel in a computing cluster, and testing each genetic variant takes only about 0.8 ms.

GMMAT requires less than 1 GB memory in analyzing 2,000 individuals. With sample size 10,000, it takes about 18 min to fit the null model with one variance component and 34 min with three variance components using GMMAT, and about 3.6 hr to perform score tests for 1,000,000 genetic variants on a single core (about 13 ms for testing each genetic variant) (Table S2). SAS PROC GLIMMIX reports insufficient memory when 100 GB memory is specified to fit the null model for 10,000 individuals, whereas GMMAT requires less than 14 GB memory for one variance component and 21 GB for three variance components.

Discussion

We have proposed in this paper using logistic mixed models to correct for population stratification and relatedness in analyzing binary traits in GWASs. The proposed GMMAT performs computationally efficient score tests for genetic associations in cohort and case-control GWASs with binary traits. We demonstrate that GMMAT is effective in controlling type I error rates. In contrast, even when ancestry terms are included as covariates, applying LMMs to binary traits can lead to incorrect type I error rates in the presence of population stratification, particularly when population groups have heterogeneous disease risks or case-control ratios that result in different binary trait variances. In such scenarios, LMMs are approximately valid only when the MAF of the genetic variant being tested for association is roughly the same in all population groups, i.e., in the presence of no or weak confounding by population structure. The homoscedasticity assumption underlying standard LMMs is essential but has largely been ignored in previous genetic association studies that used LMMs for binary traits. Despite the widespread belief that LMMs can be used to account for population stratification for both continuous and binary traits, our results show that LMMs can lead to incorrect type I error rates and p values in the presence of population stratification and relatedness. Recently, Conomos et al.25 showed that violation of the homoscedasticity assumption could also happen for continuous traits, and a model allowing for group-specific residual variances outperformed standard LMMs in such scenarios. For binary traits in discrete population groups, when there is no or weak confounding by population structure within each group, we have shown that meta-analysis is a valid and effective approach for combining results from analyzing each group separately using LMMs (Figure S5). Moreover, we can use logistic regression to analyze homogeneous unrelated samples, which is more powerful than LMMs and GMMAT (Table S1). In reality, however, population groups with no or weak within-group confounding might not always be clearly defined in large-scale genetic association studies, especially for admixed populations. When within-group population stratification remains, we have also shown that a meta-analysis approach using LMMs is still mis-calibrated similarly to the standard LMM approach (Figure S6). Furthermore, it is often difficult to check the validity of LMMs by comparing group-specific binary trait variances and MAFs. In contrast, such checking is not required by GMMAT when fitting logistic mixed models.

In practice, QQ plots have been widely used for model diagnostics in GWASs. Our results show that a well-behaved QQ plot is not sufficient to identify invalid results due to model misspecification. Specifically, an overall QQ plot can appear to show LMMs properly controlling type I error rates for binary traits (Figure S2A), but this might just reflect balancing out of conservative p values and anti-conservative p values of different subsets of SNPs (categories 1 and 3) (Figures S2B and S2D). It is important to note that these errors (incorrect p values) do not “cancel out;” spuriously significant results, i.e., type I errors, are not “corrected” by omission of true signals, i.e., type II errors. More generally, looking just at the overall QQ plot of all SNPs can lead to unwarranted beliefs that analyses (e.g., use of standard LMMs for binary traits) are appropriate, when they can have serious deficiencies for large proportions of the results.

In case-control studies, which oversample cases, LMMs are subject to incorrect type I error rates due to unequal variances of binary traits caused by population stratification introduced by unequal case-control ratios from different sampling schemes across population groups, even if the disease prevalence is the same in all the subpopulations and the overall case-control ratio is 1:1 (Figure S11). Because ROADTRIPS currently does not allow for covariate adjustment, it does not work well in the presence of moderate to strong population stratification.35 The recently developed liability mixed models20, 21 require accurate estimation of the disease prevalence as well as heritability and liability in the underlying overall population, which can be difficult to obtain in practice in the presence of population stratification and unknown subpopulation groups. These models are generally applicable in the presence of no or weak population stratification, e.g., when the case-control ratios are the same across populations, but can fail to control for type I error rates in the presence of moderate or strong population stratification, e.g., when the case-control ratios differ between populations. They also currently cannot handle multiple random effects in addition to those accounting for genetic relatedness, such as household and block groups in HCHS/SOL. The liability threshold mixed linear model (LTMLM) approach is applicable only to population-based case-control study designs with no confounders and low levels of relatedness, because it cannot adjust for covariates or handle family data.21 The LEAP method can accommodate covariates and be applicable to family data, but inclusion of covariates presents both technical and statistical challenges.20 GMMAT provides a flexible method that does not require knowing disease prevalence or heritability and liability estimates, and provides valid p values while properly controlling for type I error rates.

Recently, Song et al.36 proposed a genotype-conditional association test that accounts for population structure in association tests. However, as pointed out by the authors, their approach does not account for family or cryptic relatedness. In contrast, our logistic mixed model approach is more flexible and can account for both population structure and relatedness in population-based and family-based cohort and case-control studies, as well as complex sampling designs (as illustrated in our HCHS/SOL asthma example). Therefore, our approach can be applied to a much wider range of genetic association studies with family data, cryptic relatedness, unobserved shared environmental effects, and non-random sampling study designs, in addition to population stratification, without the need to model them in different ways.

We provide an open-source R package GMMAT for fitting logistic mixed models and performing score-based tests in GWASs. The package can also be applied to other types of continuous and discrete traits in the general framework of generalized linear models37 that allow for different link functions and different mean-variance relationships. Furthermore, the score statistics obtained from different studies of the same disease can be easily combined in meta-analysis.38

The recently proposed MACAU algorithm34 implemented Wald tests in mixed models for binomial count data using a Markov Chain Monte Carlo-based approach. However, performing Wald and likelihood ratio tests for a large number of individual variants via logistic mixed models is currently computationally impractical for moderate- to large-sample GWASs and would require future research on developing efficient algorithms. Unlike linear mixed models, logistic mixed models can be directly used to estimate odds ratios by fitting the models under the alternative hypothesis. It is computationally feasible to estimate odds ratios by fitting alternative logistic mixed models for a subset of candidate genetic variants of interest. If computational issues can be resolved in the future, logistic mixed models can also be useful for risk prediction in GWASs.

Acknowledgments

This work was supported by grants P01 CA134294, R35 CA197449, and R01 HL113338 (to H.C., Z.L., and X.L.), K99 HL130593 (to H.C.), and R37 CA076404 (to C.W., Z.L., and X.L.). Z.L. was also supported in part by a scholarship from the China Scholarship Council. We thank the participants and staff of the Hispanic Community Health Study/Study of Latinos (HCHS/SOL) for their contributions to this study. The baseline examination of HCHS/SOL was carried out as a collaborative study supported by contracts from the National Heart, Lung, and Blood Institute (NHLBI) to the University of North Carolina (N01-HC65233), University of Miami (N01-HC65234), Albert Einstein College of Medicine (N01-HC65235), Northwestern University (N01-HC65236), and San Diego State University (N01-HC65237). The following institutes, centers, and offices contributed to the first phase of HCHS/SOL through a transfer of funds to the NHLBI: National Institute on Minority Health and Health Disparities, National Institute on Deafness and Other Communication Disorders, National Institute of Dental and Craniofacial Research (NIDCR), National Institute of Diabetes and Digestive and Kidney Diseases, National Institute of Neurological Disorders and Stroke, and NIH Office of Dietary Supplements. The Genetic Analysis Center at the University of Washington was supported by NHLBI and NIDCR contracts (HHSN268201300005C AM03 and MOD03). Genotyping efforts were supported by NHLBI (HSN26220/20054C), National Center for Advancing Translational Science Clinical Translational Science Institute (UL1TR000124), and NIDDK Diabetes Research Center (DK063491). This manuscript has been reviewed by the HCHS/SOL Publications Committee for scientific content and consistency of data interpretation with previous HCHS/SOL publications.

Published: March 24, 2016

Footnotes

Supplemental Data include 12 figures and 2 tables and can be found with this article online at http://dx.doi.org/10.1016/j.ajhg.2016.02.012.

Appendix A: Derivation of GMMAT

The Generalized Linear Mixed Model

The derivations below are based on generalized linear mixed models (GLMMs), and logistic mixed models are a special case of GLMMs when the link function is logit and the dispersion parameter is fixed at 1. In the context of single-variant test, we consider the following GLMM

ηi=g(μi)=Xiα+Giβ+bi,

where Xi is a 1 × p row vector of covariates for subject i, α is a p × 1 column vector of fixed covariate effects including the intercept, Gi is the genotype of the genetic variant of interest for subject i, and β is the fixed genotype effect. We assume that bN(0,k=1KτkVk) is an n × 1 column vector of random effects, τk are the variance component parameters, τ is a K × 1 column vector of τk, and Vk are known n × n matrices. We also assume that given the random effects b, the outcome yi is conditionally independent with mean E(yi|b)=μi and variance Var(yi|b)=ϕai1v(μi), where ϕ is the dispersion parameter (for binary and Poisson data ϕ=1), ai are known weights, and v() is the variance function. The linear predictor ηi is a monotonous function of the conditional mean μi via the link function ηi=g(μi). For binary traits yi, μi=πi=P(yi=1|Xi,Gi,bi) is the probability of the binary outcome (e.g., disease status) for subject i.

For subject i, the quasi-likelihood given random effects b is

qli(α,β;b)=yiμiai(yiμ)ϕv(μ)dμ.

The log integrated quasi-likelihood function of (α,β,ϕ,τ) is

ql(α,β,ϕ,τ)=logexp{i=1nqli(α,β;b)}(2π)n2|k=1KτkVk|12×exp{12bT(k=1KτkVk)1b}db. (Equation A1)

Let

f(b)=i=1nqli(α,β;b)12bT(k=1KτkVk)1b,

we can use Laplace method to approximate the n-dimensional integral

exp{f(b)}db(2π)n2|f(b˜)|12exp{f(b˜)},

thus Equation A1 becomes

ql(α,β,ϕ,τ)=12log|k=1KτkVk|12log|f(b˜)|+f(b˜), (Equation A2)

where

b˜=argmaxbf(b)

is the solution of f(b)=0.

The first partial derivative of qli(α,β;b) with respect to b is

qlib=qliμiμiηiηib=ai(yiμi)ϕv(μi)1g(μi)ZiT,

where Zi is a 1 × n vector of indicators such that bi=Zib, In=(Z1TZ2TZnT), and the second derivative is

2qlibbT=aiZiTZiϕv(μi)[g(μi)]2ai(yiμi)v(μi)ZiTZiϕv2(μi)[g(μi)]2ai(yiμi)g(μi)ZiTZiϕv(μi)[g(μi)]3.

For canonical link functions, v(μi)g(μi)=1, the last two terms become 0. Let

Δ=diag{g(μi)},
W=diag{aiϕv(μi)[g(μi)]2},

then Equation A2 becomes

ql(α,β,ϕ,τ)=12log|k=1KτkVk|12log|i=1naiZiTZiϕv(μi)[g(μi)]2+(k=1KτkVk)1|+i=1nqli(α,β;b˜)12b˜T(k=1KτkVk)1b˜=12log|k=1KτkVkW+I|+i=1nqli(α,β;b˜)12b˜T(k=1KτkVk)1b˜. (Equation A3)

We assume that the weight matrix W changes slowly with respect to the conditional means (following Breslow and Clayton22), that is

Wμi0,

then we take the derivatives of Equation A3:

ql(α,β,ϕ,τ)α=i=1nai(yiμi)ϕv(μi)1g(μi)XiT=XTWΔ(yμ),
ql(α,β,ϕ,τ)β=i=1nai(yiμi)ϕv(μi)1g(μi)Gi=GTWΔ(yμ), (Equation A4)
ql(α,β,ϕ,τ)b=i=1nai(yiμi)ϕv(μi)1g(μi)ZiT(k=1KτkVk)1b=WΔ(yμ)(k=1KτkVk)1b.

Under the null hypothesis H0:β=0, if ϕ and τ are known, we jointly choose αˆ(ϕ,τ) and bˆ(ϕ,τ) to maximize Equation A3, then bˆ(ϕ,τ)=b˜(αˆ(ϕ,τ),β=0) because b˜ maximizes f(b) for given (α,β). Defining the working vector Y˜ with elements Y˜i=ηi+g(μi)(yiμi), the solution of

{XTWΔ(yμ)=0WΔ(yμ)=(k=1KτkVk)1b

can be written as the solution to the system

[XTWXXTWWX(k=1KτkVk)1+W][αb]=[XTWY˜WY˜].

Let Σ=W1+k=1KτkVk, P=Σ1Σ1X(XTΣ1X)1XTΣ1, then

{αˆ=(XTΣ1X)1XTΣ1Y˜bˆ=(k=1KτkVk)Σ1(Y˜Xαˆ)

is the solution that maximizes Equation A3. We note that

Y˜ηˆ=Y˜Xαˆbˆ={I(k=1KτkVk)Σ1}(Y˜Xαˆ)=W1Σ1(Y˜Xαˆ)=W1PY˜.

Estimation of Variance Component Parameters

Following Breslow and Clayton,22 we ignore the dependence of W on τ and use Pearson chi-square statistic to approximate the deviance

2ϕi=1nqli(α,β;b)=i=1n2yiμiai(yiμ)v(μ)dμi=1nai(yiμi)2v(μi).

Then Equation A3 at the maximum becomes

ql(αˆ(ϕ,τ),β=0,ϕ,τ)12log|k=1KτkVkW+I|12i=1nai(yiμˆi)2ϕv(μˆi)12bˆT(k=1KτkVk)1bˆ=12log|k=1KτkVkW+I|12(yμˆ)TΔWΔ(yμˆ)12(Y˜Xαˆ)TΣ1(k=1KτkVk)Σ1(Y˜Xαˆ)=12log|ΣW|12(Y˜ηˆ)TW(Y˜ηˆ)12(Y˜Xαˆ)TΣ1(k=1KτkVk)Σ1(Y˜Xαˆ)=12log|W|12log|Σ|12Y˜TPW1PY˜12Y˜TP(k=1KτkVk)PY˜=c12log|Σ|12Y˜TPY˜.

Similarly, the restricted maximum likelihood (REML) version is

qlR(αˆ(ϕ,τ),β=0,ϕ,τ)=cR12log|Σ|12log|XTΣ1X|12Y˜TPY˜.

Let V0=diag{ai1v(μi)[g(μi)]2}=ϕ1W1, then Σ=ϕV0+k=1KτkVk, and the first derivatives of qlR(αˆ(ϕ,τ),β=0,ϕ,τ) with respect to ϕ and τ are

qlR(αˆ(ϕ,τ),β=0,ϕ,τ)ϕ=12{Y˜PV0PY˜tr(PV0)},
qlR(αˆ(ϕ,τ),β=0,ϕ,τ)τk=12{Y˜PVkPY˜tr(PVk)}.

We define the average information6, 23 matrix AI with the following entries

AIϕϕ=12Y˜PV0PV0PY˜,
AIϕτk=12Y˜PV0PVkPY˜,
AIτkτl=12Y˜PVkPVlPY˜.

Let θ be the variance component parameters to estimate, when ϕ1, θ=(ϕ,τ), and AI is a (K + 1)×(K + 1) matrix. For binary and Poisson data, ϕ=1, θ=τ, and AI is a K × K matrix containing only AIτkτl.

We use the following algorithm to fit the null GLMM:

  • 1.

    Fit a generalized linear model with τ=0 and get αˆ(0) and working vector Y˜(0);

  • 2.

    Use θ(0)=Var(Y˜(0))/K (if ϕ=1) or θ(0)=Var(Y˜(0))/(K+1) (if ϕ1) as the initial value of θ;

  • 3.

    For each k=0,1,,K, update θ using θk(1)=θk(0)+2n1{θk(0)}2(qlR(θ(0))/θk);

  • 4.

    Use Y˜(1)=Y˜(0) as Y˜ and update θ(2)=θ(1)+{AI(1)}1(qlR(θ(1))/θ);

  • 5.

    Calculate αˆ(2) and bˆ(2) using Y˜(1) and θ(2);

  • 6.

    Update Y˜(2) using αˆ(2) and bˆ(2);

  • 7.

    Repeat steps 4–6, until 2max{|αˆ(i)αˆ(i1)|/(|αˆ(i)|+|αˆ(i1)|),|θˆ(i)θˆ(i1)|/(|θˆ(i)|+|θˆ(i1)|)} tolerance.

The Score Test

Once (αˆ,ϕˆ,τˆ) is estimated under the null hypothesis H0:β=0, the score test can be constructed by evaluating Equation A4 at (αˆ,β=0,ϕˆ,τˆ), that is

T=ql(αˆ,β=0,ϕˆ,τˆ)β=GTWˆΔˆ(yμˆ)=GTWˆ(Y˜ηˆ)=GTPˆY˜.

Its variance under the null hypothesis is

Var(T|H0)=E{ql(αˆ,β=0,ϕˆ,τˆ)βql(αˆ,β=0,ϕˆ,τˆ)βT}=E(GTPˆY˜Y˜TPˆG)=GTPˆG,

the last equality holds because PˆΣˆPˆ=Pˆ.

Appendix B. Additional Simulation Studies

Unrelated Individuals with Population Stratification

We performed additional simulation studies to compare LMM and GMMAT in unrelated individuals in the presence of population stratification. We used the coalescent model29 to simulate genotypes for a total of 16,000 unrelated individuals with 1,000,000 genetic variants from a 20 × 20 grid (Figure S1) of spatially continuous populations (40 individuals per cell) with migration rate between adjacent cells M = 10 to mimic population structure within Europe.30, 31 For individual i, the probability of being a case πi was calculated from

logit(πi)=α0+α1Zi,

where Zi = 1 if individual i was from a 10 × 10 grid in the top left (population 1) and Zi = 0 otherwise (population 2). The parameters α0 and α1 were chosen such that the disease prevalence was 0.28 in population 1 (high risk) and 0.05 in population 2 (low risk). Note that the mean model is not mis-specified when a linear link function is used, thus our simulation setting is not in favor of logistic models. We randomly sampled 10,000 individuals to form a simulated cohort study for common disease (Figure S4) and calculated their standardized genetic relationship matrix using 625,504 genetic variants with MAF greater than 5%. We adjusted for top ten PCs in both models and combined results from 3,200 null simulation replicates. We compared our method with the strategy in which population 1 and population 2 were analyzed separately using LMMs and the results were subsequently combined in a meta-analysis (which we referred to as LMM meta), using 100 null simulation replicates (Figure S5). Moreover, we considered continuous population stratification where Zi for each cell is the minimum of its row and column coordinates in Figure S1, which ranges from 0 to 19, and α0 was chosen such that the disease prevalence in populations with Zi = 0 was 0.02 and α1=0.2 (Figure S6).

We also simulated a case-control study for rare disease (Figure S7). We chose α0 and α1 such that the disease prevalence was 0.045 in population 1 (high risk) and 0.005 in population 2 (low risk). We randomly sampled 1,667 case subjects and 8,333 control subjects to form a case-control study with a total sample size of 10,000 and a case-control ratio of 1:5. We performed the same analysis as described above and combined results from 3,200 null simulation replicates.

Comparison with the Existing Methods

We also compared GMMAT with ROADTRIPS33 and the recently developed liability (probit) mixed models.20, 21 ROADTRIPS does not allow for covariates. Because LTMLM21 does not allow covariate adjustment, we used LEAP20 and set the disease prevalence to 0.015, which is the pooled prevalence in both high- and low-risk population groups. We compared ROADTRIPS and LEAP without covariates as well as LEAP adjusting for top ten PCs as covariates with LMM and GMMAT. The simulation settings were the same as in Figure S7, but p values from only one simulation replicate were shown (Figure S8).

Moreover, we simulated three case-control settings with the same prevalence but different case-control sampling schemes in two population groups in Figure S1. The disease prevalence was set to 0.01 in both groups. We compared ROADTRIPS and LEAP with and without adjusting for top ten PCs, with LMM and GMMAT.

In the first setting, we randomly sampled 1,250 case subjects and 1,250 control subjects from population 1 and 3,750 case subjects and 3,750 control subjects from population 2. This was a balanced case-control study with balanced designs in both population groups. Because the case-control ratio was the same, there was no population stratification (Figure S9).

In the second setting, we randomly sampled 2,000 case subjects and 500 control subjects from population 1 and 1,500 case subjects and 6,000 control subjects from population 2. This was an unbalanced case-control study with unbalanced designs but equal binary trait variances in two population groups. Population stratification was created by different case-control ratios in two population groups (4:1 in population 1 and 1:4 in population 2), instead of different disease prevalence (Figure S10).

In the third setting, we randomly sampled 2,500 case subjects and 200 control subjects from population 1 and 2,500 case subjects and 4,800 control subjects from population 2. This was a balanced case-control study with unbalanced designs and unequal binary trait variances in two population groups. Because the disease prevalence was 0.01 in both groups, there were no high-risk or low-risk groups, but population 1 was the low binary trait variance group with variance 0.0686 and population 2 was the high binary trait variance group with variance 0.2252 (Figure S11).

Simulations with Genetic Effects

We first conducted simulation studies to evaluate the performance of GMMAT for estimating the odds ratios of genetic variants in the presence of population stratification. We used the same genotype data as in our null simulations of unrelated individuals with population stratification. For individual i, the probability of being a case πi was calculated from

logit(πi)=α0+α1Zi+βGi,

where Zi = 1 if individual i was from population 1, Zi = 0 otherwise (population 2). Gi is the additively coded genotype for the causal genetic variant. The parameters α0 and α1 were chosen such that the disease prevalence was 0.28 in population 1 and 0.05 in population 2 for individuals with Gi = 0. β was chosen such that the odds ratio varied from 1.0, 1.1, 1.25, 1.5, to 2.0. The total sample size was 10,000. We compared the odds ratio estimates and the p values calculated by GMMAT and MACAU (Figure S12). We then compared the power of LMM and GMMAT for identifying causal variants that had lower MAFs in population 1 than population 2 (Table S1).

We also simulated a case-control study with 1,667 case subjects and 8,333 control subjects with no population stratification. We assumed the disease prevalence was 0.01 for individuals with Gi = 0 and used the same sampling scheme in both population 1 and population 2. We compared the powers of logistic regression, LMM, and GMMAT for identifying causal variants (Table S1).

Web Resources

The URLs for data presented herein are as follows:

Supplemental Data

Document S1. Figures S1–S12 and Tables S1 and S2
mmc1.pdf (4MB, pdf)
Document S2. Article plus Supplemental Data
mmc2.pdf (4.7MB, pdf)

References

  • 1.Lander E.S., Schork N.J. Genetic dissection of complex traits. Science. 1994;265:2037–2048. doi: 10.1126/science.8091226. [DOI] [PubMed] [Google Scholar]
  • 2.Aulchenko Y.S., de Koning D.J., Haley C. Genomewide rapid association using mixed model and regression: a fast and simple method for genomewide pedigree-based quantitative trait loci association analysis. Genetics. 2007;177:577–585. doi: 10.1534/genetics.107.075614. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Kang H.M., Zaitlen N.A., Wade C.M., Kirby A., Heckerman D., Daly M.J., Eskin E. Efficient control of population structure in model organism association mapping. Genetics. 2008;178:1709–1723. doi: 10.1534/genetics.107.080101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Kang H.M., Sul J.H., Service S.K., Zaitlen N.A., Kong S.Y., Freimer N.B., Sabatti C., Eskin E. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 2010;42:348–354. doi: 10.1038/ng.548. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Zhang Z., Ersoz E., Lai C.Q., Todhunter R.J., Tiwari H.K., Gore M.A., Bradbury P.J., Yu J., Arnett D.K., Ordovas J.M., Buckler E.S. Mixed linear model approach adapted for genome-wide association studies. Nat. Genet. 2010;42:355–360. doi: 10.1038/ng.546. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Yang J., Lee S.H., Goddard M.E., Visscher P.M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 2011;88:76–82. doi: 10.1016/j.ajhg.2010.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Lippert C., Listgarten J., Liu Y., Kadie C.M., Davidson R.I., Heckerman D. FaST linear mixed models for genome-wide association studies. Nat. Methods. 2011;8:833–835. doi: 10.1038/nmeth.1681. [DOI] [PubMed] [Google Scholar]
  • 8.Zhou X., Stephens M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 2012;44:821–824. doi: 10.1038/ng.2310. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Pirinen M., Donnelly P., Spencer C.C.A. Efficient computation with a linear mixed model on large-scale data sets with applications to genetic studies. Ann. Appl. Stat. 2013;7:369–390. [Google Scholar]
  • 10.Loh P.R., Tucker G., Bulik-Sullivan B.K., Vilhjálmsson B.J., Finucane H.K., Salem R.M., Chasman D.I., Ridker P.M., Neale B.M., Berger B. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 2015;47:284–290. doi: 10.1038/ng.3190. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Barr R.G., Avilés-Santa L., Davis S.M., Aldrich T., Gonzalez F., Henderson A.G., Kaplan R.C., LaVange L., Liu K., Loredo J.S. Pulmonary disease and age at immigration among Hispanics: results from the Hispanic Community Health Study/Study of Latinos (HCHS/SOL) Am. J. Respir. Crit. Care Med. 2016;193:386–395. doi: 10.1164/rccm.201506-1211OC. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Price A.L., Patterson N.J., Plenge R.M., Weinblatt M.E., Shadick N.A., Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
  • 13.Tucker G., Price A.L., Berger B. Improving the power of GWAS and avoiding confounding from population stratification with PC-Select. Genetics. 2014;197:1045–1049. doi: 10.1534/genetics.114.164285. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Fingerlin T.E., Murphy E., Zhang W., Peljto A.L., Brown K.K., Steele M.P., Loyd J.E., Cosgrove G.P., Lynch D., Groshong S. Genome-wide association study identifies multiple susceptibility loci for pulmonary fibrosis. Nat. Genet. 2013;45:613–620. doi: 10.1038/ng.2609. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Cortes A., Hadler J., Pointon J.P., Robinson P.C., Karaderi T., Leo P., Cremin K., Pryce K., Harris J., Lee S., International Genetics of Ankylosing Spondylitis Consortium (IGAS) Australo-Anglo-American Spondyloarthritis Consortium (TASC) Groupe Française d’Etude Génétique des Spondylarthrites (GFEGS) Nord-Trøndelag Health Study (HUNT) Spondyloarthritis Research Consortium of Canada (SPARCC) Wellcome Trust Case Control Consortium 2 (WTCCC2) Identification of multiple risk variants for ankylosing spondylitis through high-density genotyping of immune-related loci. Nat. Genet. 2013;45:730–738. doi: 10.1038/ng.2667. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Fakiola M., Strange A., Cordell H.J., Miller E.N., Pirinen M., Su Z., Mishra A., Mehrotra S., Monteiro G.R., Band G., LeishGEN Consortium. Wellcome Trust Case Control Consortium 2 Common variants in the HLA-DRB1-HLA-DQA1 HLA class II region are associated with susceptibility to visceral leishmaniasis. Nat. Genet. 2013;45:208–213. doi: 10.1038/ng.2518. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Liu J.Z., Hov J.R., Folseraas T., Ellinghaus E., Rushbrook S.M., Doncheva N.T., Andreassen O.A., Weersma R.K., Weismüller T.J., Eksteen B., UK-PSCSC Consortium. International PSC Study Group. International IBD Genetics Consortium Dense genotyping of immune-related disease regions identifies nine new risk loci for primary sclerosing cholangitis. Nat. Genet. 2013;45:670–675. doi: 10.1038/ng.2616. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Huber, P.J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics (Berkeley, CA: University of California Press), 221–233.
  • 19.Jarque C.M., Bera A.K. Efficient tests for normality, homoscedasticity and serial independence of regression residuals. Econ. Lett. 1980;6:255–259. [Google Scholar]
  • 20.Weissbrod O., Lippert C., Geiger D., Heckerman D. Accurate liability estimation improves power in ascertained case-control studies. Nat. Methods. 2015;12:332–334. doi: 10.1038/nmeth.3285. [DOI] [PubMed] [Google Scholar]
  • 21.Hayeck T.J., Zaitlen N.A., Loh P.R., Vilhjalmsson B., Pollack S., Gusev A., Yang J., Chen G.B., Goddard M.E., Visscher P.M. Mixed model with correction for case-control ascertainment increases association power. Am. J. Hum. Genet. 2015;96:720–730. doi: 10.1016/j.ajhg.2015.03.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Breslow N.E., Clayton D.G. Approximate inference in generalized linear mixed models. J. Am. Stat. Assoc. 1993;88:9–25. [Google Scholar]
  • 23.Gilmour A.R., Thompson R., Cullis B.R. Average information REML: an efficient algorithm for variance parameter estimation in linear mixed models. Biometrics. 1995;51:1440–1450. [Google Scholar]
  • 24.Rao C.R. Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation. Math. Proc. Camb. Philos. Soc. 1948;44:50–57. [Google Scholar]
  • 25.Conomos M.P., Laurie C.A., Stilp A.M., Gogarten S.M., McHugh C.P., Nelson S.C., Sofer T., Fernández-Rhodes L., Justice A.E., Graff M. Genetic diversity and association studies in US Hispanic/Latino populations: applications in the Hispanic Community Health Study/Study of Latinos. Am. J. Hum. Genet. 2016;98:165–184. doi: 10.1016/j.ajhg.2015.12.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Lavange L.M., Kalsbeek W.D., Sorlie P.D., Avilés-Santa L.M., Kaplan R.C., Barnhart J., Liu K., Giachello A., Lee D.J., Ryan J. Sample design and cohort selection in the Hispanic Community Health Study/Study of Latinos. Ann. Epidemiol. 2010;20:642–649. doi: 10.1016/j.annepidem.2010.05.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Laurie C.C., Doheny K.F., Mirel D.B., Pugh E.W., Bierut L.J., Bhangale T., Boehm F., Caporaso N.E., Cornelis M.C., Edenberg H.J., GENEVA Investigators Quality control and quality assurance in genotypic data for genome-wide association studies. Genet. Epidemiol. 2010;34:591–602. doi: 10.1002/gepi.20516. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Pfeffermann D. Modelling of complex survey data: Why model? Why is it a problem? How can we approach it. Surv. Methodol. 2011;37:115–136. [Google Scholar]
  • 29.Hudson R.R. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002;18:337–338. doi: 10.1093/bioinformatics/18.2.337. [DOI] [PubMed] [Google Scholar]
  • 30.Mathieson I., McVean G. Differential confounding of rare and common variants in spatially structured populations. Nat. Genet. 2012;44:243–246. doi: 10.1038/ng.1074. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Wang C., Zhan X., Bragg-Gresham J., Kang H.M., Stambolian D., Chew E.Y., Branham K.E., Heckenlively J., Fulton R., Wilson R.K., FUSION Study Ancestry estimation and control of population stratification for sequence-based association studies. Nat. Genet. 2014;46:409–415. doi: 10.1038/ng.2924. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.MacCluer J.W., VandeBerg J.L., Read B., Ryder O.A. Pedigree analysis by computer simulation. Zoo Biol. 1986;5:147–160. [Google Scholar]
  • 33.Thornton T., McPeek M.S. ROADTRIPS: case-control association testing with partially or completely unknown population and pedigree structure. Am. J. Hum. Genet. 2010;86:172–184. doi: 10.1016/j.ajhg.2010.01.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Lea A.J., Tung J., Zhou X. A flexible, efficient binomial mixed model for identifying differential DNA methylation in bisulfite sequencing data. PLoS Genet. 2015;11:e1005650. doi: 10.1371/journal.pgen.1005650. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Price A.L., Zaitlen N.A., Reich D., Patterson N. New approaches to population stratification in genome-wide association studies. Nat. Rev. Genet. 2010;11:459–463. doi: 10.1038/nrg2813. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Song M., Hao W., Storey J.D. Testing for genetic associations in arbitrarily structured populations. Nat. Genet. 2015;47:550–554. doi: 10.1038/ng.3244. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.McCullagh P., Nelder J.A. Second Edition. Chapman & Hall/CRC; 1989. Generalized Linear Models. [Google Scholar]
  • 38.Lee S., Abecasis G.R., Boehnke M., Lin X. Rare-variant association analysis: study designs and statistical tests. Am. J. Hum. Genet. 2014;95:5–23. doi: 10.1016/j.ajhg.2014.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S12 and Tables S1 and S2
mmc1.pdf (4MB, pdf)
Document S2. Article plus Supplemental Data
mmc2.pdf (4.7MB, pdf)

Articles from American Journal of Human Genetics are provided here courtesy of American Society of Human Genetics

RESOURCES