Abstract
Repeated measures of various biomarkers provide opportunities for us to enhance understanding of many important clinical aspects of CKD, including patterns of disease progression, rates of kidney function decline under different risk factors, and the degree of heterogeneity in disease manifestations across patients. However, because of unique features, such as correlations across visits and time dependency, these data must be appropriately handled using longitudinal data analysis methods. We provide a general overview of the characteristics of data collected in cohort studies and compare appropriate statistical methods for the analysis of longitudinal exposures and outcomes. We use examples from the Chronic Renal Insufficiency Cohort Study to illustrate these methods. More specifically, we model longitudinal kidney outcomes over annual clinical visits and assess the association with both baseline and longitudinal risk factors.
- CKD
- longitudinal data
- repeated measures
- GEE
- mixed effects model
- correlation structures
- Biomarkers
- Cohort Studies
- Disease Progression
- GFR
- Humans
- kidney
- Renal Insufficiency
- Chronic
- risk factors
Introduction
The term repeated measures refers to data observed repeatedly within the same subject, and they are being increasingly collected in many research studies. Aside from being used to evaluate the reproducibility and variability of a novel biomarker (1,2), repeated measures are often generated in the context of longitudinal studies, in which one or more biomarkers are observed over time. For chronic diseases, such as CKD, patterns of biomarker trajectories are crucial for understanding of disease prognosis.
In many studies, the repetitions are predetermined by the study protocol, whereby measures are administered prospectively at specific intervals during scheduled clinical visits or telephone interviews (3,4). In other scenarios, the repeated data (e.g., measures of vital signs, such as BP, heart rate, and respiratory rate) become available at variable time points when certain events (e.g., hospitalization) occur. Repeated measures could also be obtained retrospectively as natural history data through available databases, such as Medicare (5).
Depending on the scientific questions, the longitudinal measures may serve as the outcomes of interest, the exposures, or a combination of both. Examples of these scenarios in kidney disease research include comparisons of the burden of coronary artery calcification for patients at different stages of CKD and ESRD, treating the longitudinal coronary artery calcification measures as the outcome of interest (6); evaluations of the associations of longitudinal measures of GFR with subsequent adverse events, such as ESRD and death, as outcomes (7,8); and an investigation of the causal relationship between BP and kidney function, in which both measures were updated over time (9).
In this paper, we will focus primarily on the appropriate statistical methods for analyzing longitudinal data as outcomes. In particular, we will discuss extensively regression methods that can estimate the rate of change of the longitudinal data in association with certain risk factors (10,11). The terms repeated measures and longitudinal data are used interchangeably.
Motivating Example
The Chronic Renal Insufficiency Cohort (CRIC) Study is a prospective, longitudinal study of patients with CKD, in which repeated measures of serum creatinine and cystatin C, along with several other laboratory measures, were collected from each participant during annual clinical visits (3,7,12). eGFR, calculated on the basis of the CRIC Study eGFR equation that incorporates both serum creatinine and cystatin C (13), is an important measure of kidney function (12). Describing the rate of change and patterns of eGFR decline as well as identifying risk factors that affect CKD progression are of particular interest in CKD research (10,14,15).
Our motivating example involves the effect of functional kidney risk variants in the gene coding for APOL1 on the rate of eGFR decline among the CRIC Study participants. Two haplotypes (G1 and G2) in APOL1 have been positively selected and are common in populations of recent African continental descent, but they are very rare or absent in most other populations, where exposure to Trypanosomes was not common. These APOL1 variants associated with kidney diseases are believed to account for much of the nonsocioeconomic-based related disparity in rates of CKD progression between patients with African ancestry and white patients.
DNA samples from the 1411 African ancestry participants enrolled in the CRIC Study between June of 2003 and August of 2008 were genotyped for APOL1 risk variants (16). Given the near absence of these APOL1 risk variants in whites, the exposure variable was defined in conjunction with race into three categories: APOL1 high-risk genotype (African ancestry participants with two copies of the risk variants), APOL1 low-risk genotype (African ancestry participants with zero or one copy of these variants), and white participants (reference group).
The investigators were interested in assessing whether rates of eGFR decline differ among the three exposure groups. The longitudinal outcome in this example was the annual eGFR measures for each participant for up to 7 years after enrollment. Other covariates included demographics (e.g., age, sex, and clinical site), socioeconomic variables (e.g., income and education level), and traditional clinical risk factors (e.g., systolic BP and body mass index) and were mostly observed at baseline (16).
Data Preparation and Visualization
For longitudinal studies of moderate size, the dataset is often prepared in either of the two ways: wide format (one row of data per participant) or long format (multiple rows of data per participant; one per visit) (17) (details are in Supplemental Appendix 1). The long format is generally more preferred for advanced statistical modeling, because it can handle subjects with different numbers of clinical visits or irregular time points of measurement; it is also easier for dynamically updating the dataset with future follow-up visits.
Exploratory analysis using graphs can help researchers to frame the hypothesis and select appropriate statistical models. Some commonly used visualization tools for longitudinal data include spaghetti plot, heat map, and lasagna plot (18) (Figures 1 and 2). These plots show several features of the eGFR trajectories. First, patients with APOL1 high risk seem to have a faster decline compared with others. Second, such as in most cohort studies, subjects have varying numbers of visits. Third, the baseline eGFR values differ across subjects.
Spaghetti plot of eGFR over time for 15 random Chronic Renal Insufficiency Cohort (CRIC) Study participants. The figure shows an example of the spaghetti plot of repeated eGFR values for 15 randomly selected CRIC Study participants, five from each of the three APOL1 risk categories. The spaghetti plot connects the longitudinal eGFR values within a single subject by lines over time but might result in overplotting with too many subjects on one plot.
Heat map (left panel) and lasagna plot (right panel) plot on the basis of the same 15 subjects. Each row of the heat map represents eGFR values from one subject, and the color indicates the level of eGFR values. The lasagna plot is a heat map sorted by color gradients to reflect the overall eGFR distributions in the population at each visit.
In the era of big data, many multivariable measures, such as imaging scans, proteomics assessments, and electronic health records (6,19,20), are collected at multiple visits for large cohorts of participants. It might no longer be feasible to include all of the measures in one data frame. Utilization of high-performance computing and central databases, such as the National Institute of Diabetes and Digestive and Kidney Diseases Repository (21,22), that crosslink various measures to a unique subject-visit identity is crucial to handle complex and heterogeneous data. In addition, advanced statistical tools, including dynamic visualization interface and hierarchical clustering visualization using graph structures (23,24), need to be leveraged to CKD research involving big data (25–28).
Time Dependency and Correlated Observations
Longitudinal data have unique and crucial characteristics. First, they are typically accompanied by a time variable that indicates when each measurement occurred, and they define a natural ordering of the repeated measures within each subject. Second, the repeated measures within the same subject are potentially correlated. For example, within each subject, the eGFR values of subsequent visits might depend on those at earlier visits. Such intrinsic clusters defined by subjects result in dependency (correlation) among repeated observations, which violates the assumption of independent observations on which many simple analytic methods are based. Hence, using traditional linear regression and ignoring data correlations might lead to inaccurate estimates and erroneous inferences about the associations of risk factors with kidney function decline (29).
Special statistical methodologies for repeated measures allow observations to be correlated and can quantify the magnitude of this correlation. In addition, some of these approaches are designed to predict the individual trajectory of the outcome measures over time and can handle datasets with missing values. Such methods often make assumptions about the structure of correlations between the repeated measures. Typical choices of correlation structures include independence, exchangeable, autoregressive, m dependent, and unstructured (30,31) (Supplemental Appendix 2).
Statistical Approaches for Repeated Measures as the Outcome
In our motivating example, the goal of the analysis was to investigate the rate of kidney function decline and its association with related risk factors. Intuitively, a two-stage approach has been used (10,32,33) in the literature. Take eGFR as an example. First, an individual’s eGFR slope is separately estimated for each subject by regressing the subject’s repeated eGFR values on the time variable. Second, the eGFR slopes from all of the subjects are treated as the outcome variable and fitted into a linear regression. Such a method is limited, in that the estimates are highly sensitive to random variations (noise), especially for subjects with short follow-up and small numbers of observations. It is also not able to evaluate the cross-sectional associations between baseline eGFR and risk factors or adaptable to time-varying covariates (34).
To simultaneously model the effects of risk factors on both the baseline observations and their rate of change and to appropriately account for the correlations among the repeated measures, two types of flexible regression methods are recommended for the analysis of longitudinal data: generalized estimating equations (GEEs) and mixed effects modeling (17,29,34). We introduce these methods in the following sections.
Population-Average Model
A population-average model, such as the GEE approach (17,35), captures the average trajectories across the overall study population and estimates the marginal associations between the repeated outcome measures and the risk factors. A GEE model consists of two parts: the mean response model and the error term. In our motivating example, we can fit a linear GEE model as expressed in model 1 below. For simplicity, no additional covariates are included in this model, except for the major risk factor: APOL1 risk group (0, white; 1, African ancestry APOL1 low risk; and 2, African ancestry APOL1 high risk):(1)
The mean response model takes the repeated measures (e.g., eGFR at jth visit for subject i) as the outcome and describes the overall mean relationship between repeated eGFR values and the exposure variables. We present a rigorous formulation and detailed interpretation of model 1 in Supplemental Appendix 3, but we point out that the coefficient estimated in front of the time variable (year) represents the average eGFR slope for the reference group (whites) and that the coefficient for the interaction of year with APOL1 estimates the difference in eGFR slopes between the other two APOL1 risk groups compared with the reference. In addition, the associations between baseline eGFR and covariates are represented via coefficients in front of the nontime-varying variables in a similar fashion as a cross-sectional linear regression.
The error term accounts for the within-subject correlation. GEE initiates the model estimation by specifying a particular correlation structure among the repeated observations referred to as the working correlation, which is not necessarily accurate for the real data. The working correlation is typically chosen from one of the aforementioned common structures: independence, exchangeable, autoregressive, m dependent, and unstructured (Supplemental Appendix 2). For example, exchangeable structure assumes that the correlation between outcomes is equal, regardless of how far apart they are in time. Autoregressive is often used in longitudinal analysis and refers to when the correlation decreases as the time difference between observations increases. The unstructured correlation structure is the most flexible, with no prespecified patterns. GEE is robust in that an erroneously specified working correlation structure will have little effect on the risk association estimation as long as the mean response model is correct. However, choosing a working correlation that closely approximates the truth is still desirable, because it results in smaller standard errors in the final estimates (30,31,36).
With the data from the motivating example, we fit model 1 assuming an exchangeable working correlation structure and present the estimated coefficients in Table 1. The results show that eGFR, on average, decreases 0.50 ml/min per 1.73 m2 per year for whites. The eGFR decline among the APOL1 high-risk group is 0.94 ml/min per 1.73 m2 per year faster than among whites (P<0.001). The APOL1 low-risk group also has a faster decline than in whites, but their difference is milder (0.38 ml/min per 1.73 m2 more per year).
Interpretation of coefficients from a generalized estimating equation model assessing associations of APOL1 genotype with eGFR measured repeated over time
Subject-Specific Model
In addition to the overall mean effects of APOL1 on the eGFR trajectory, different subjects could start with various eGFR values at baseline and also progress differently. In a GEE model, the subject heterogeneity is absorbed into the error term, and its magnitude is thus not quantified. Mixed effects modeling, which has been more commonly used than GEE in the nephrology literature (11), estimates both the individual variations in baseline outcome variables and their rates of change that deviate away from the population-average trajectory. This is achieved by adding random effects (random intercept and/or random slope) to the population-average model (referred to as fixed effects), as illustrated in Figure 3, by two hypothetical subjects.
Schematic illustrations of linear mixed effects models with random intercepts (left panel) and both random intercepts and random slopes (right panel). The blue and red dots represent observed eGFR values for the two hypothetical subjects i and i′. The blue and red curves represent the mixed effects model fit on the basis of the data. The black lines represent the population trend estimated by generalized estimating equation (GEE). With the random intercept model, the two subjects have different baseline eGFR values but the same rate of decline. In the random intercept model, their rates of decline also differ. The deviations of intercepts and slopes are quantified using random effects parameters.
In our example, the subject-specific random intercept characterizes the difference between an individual’s baseline eGFR and the population average, and a random slope in front of a time variable describes the deviation of individual slope from the population average. Thus, a simple mixed effects model that is analogous to model 1 is as follows:(2)The fixed effects coefficients share the same interpretations as in the GEE model 1. It is the subject-specific random effects that quantify heterogeneity across individuals.
Although they both estimate individual eGFR slopes, the mixed effects model is different from the aforementioned two-stage approach in that it manages to use information from all of the subjects by assuming a common distribution (typically, a normal distribution with mean of zero) for the random effects, and hence, it avoids the problem of estimating too many parameters with too few observations. The random effects also take care of the correlation among repeated observations within subject. Additional working correlation structures can be further imposed onto the error term, making it adaptable to complex correlation structures. A linear model with only a random intercept and independent error term would induce an exchangeable correlation structure.
Tables 2–3 show the estimates for fixed effects coefficients and the estimated subject-specific random effects in model 2. On average, eGFR decreases 0.74 ml/min per 1.73 m2 per year for white participants. The eGFR decline among the APOL1 high-risk group is significantly faster by 1.50 ml/min per 1.73 m2 per year than among whites (P<0.001).
Interpretation of the fixed effects coefficients from a linear mixed effects model assessing associations of APOL1 genotype with eGFR measured repeated over time
Predicted individual deviation (random intercept and slope) in model 2 for five subjects in the white group
Another advantage of the mixed effects model is that it is flexible enough to account for multiple layers of clustering. In particular, many cohort studies, including the CRIC Study, recruit participants from multiple clinical sites. In addition to the within-subject correlation among repeated measures, data correlation could also occur across subjects who come from the same geographic location or social community. It is often necessary to adjust for site effects during the analysis by either (1) including site as a discrete covariate, such as in the work by Parsa et al. (16), when the sample size per site is sufficiently large or (2) adding a site-specific random effect in the mixed effects model to quantify the intra- versus intersite variability.
Model Diagnosis
Model selection is often conducted to choose a set of variables that is most relevant to the outcome or select the best working correlation structure. Unlike GEE, the correlation structure of the mixed effects model must be specified in advance, and the results may be affected by misspecification. For mixed effects models, the one that has smaller Aikake Information Criterion (AIC) and Bayesian Information Criterion (BIC) generally fits the data better. A likelihood ratio test (37) could also be used to compare two models with and without a certain variable. For GEE, one can instead choose the best model that minimizes either the quasilikelihood under the independence model criterion (QIC) (Table 4) (38), or the QICu defined as QIC+2p that penalizes the model when too many variables are included. Comparing the empirical correlation on the basis of the observed data with the model-based estimates assuming that the working correlation is true or conducting sensitivity analyses to examine how the choice of working correlation structures affects the analytic results are both crucial ways for appropriate model selection in GEE.
Comparison of different working correlation structures using quasilikelihood information criterion
Non-Normal Outcomes
For longitudinal outcomes that are not normally distributed, such as hospitalizations (discrete counts) and occurrence of AKI (binary), GEE and mixed effects models can both handle such data, analogous to the way in which the linear regression is extended to generalized linear models (e.g., logistic or Poisson regression [37]). The extension to non-normal outcomes for mixed effects models is called generalized linear mixed models (29).
Nonlinear Trends
Another assumption often made for kidney function decline is that the rate of change remains constant over time (that is, the disease progresses linearly). This assumption is convenient but is not always realistic, because eGFR decline could accelerate or stabilize at different stages of CKD (39,40). Several statistical methods have been developed to accommodate the potential for nonlinear trajectories. The first approach is to conduct a data transformation on the outcome variable (e.g., log, square root [41–43]). Log transformation is commonly used if the original outcome data decline exponentially or have a skewed distribution. The limitation of this approach is that the estimated rate of kidney function decline on the basis of the transformed data often lacks a straightforward clinical interpretation (11).
Alternatively, functional data analysis techniques are a set of flexible methods to characterize the underlying smooth trajectories on the basis of the observed longitudinal measurements. For example, we can model the eGFR values as a polynomial (e.g., quadratic or cubic) function of time in the regression model. In our motivating example, models 1 and 2 can also include new variables or
in addition to the linear term
, and the statistical significance of the higher-order polynomial terms can then be tested to determine whether a nonlinear eGFR trajectory is truly present. More broadly, smoothing splines (e.g., B spline) are used to accommodate the curvilinear trends, such as piecewise linear or polynomial (44,45). The two-slope model used for log serum creatinine in the Chronic Kidney Disease Epidemiology Collaboration equation is a special case of the piecewise-linear spline with one change point (knot) where the eGFR slope is believed to alter (46,47). Principal component analysis–type approaches (48), however, identify the patterns of the trajectories that explain most of the variations across subjects in a data-driven fashion. Some efforts have been devoted to identify subgroups of patients who show distinct patterns in the repeated measures using latent class analysis, such as group-based trajectory modeling (49–51). These methods are powerful, but they often require more observations or longer follow-up to generate reliable model estimates of the nonlinear trajectories and are computationally more expensive than the traditional models.
Missing Data and Joint Modeling for Informative Dropout
Missing data are almost inevitable in a longitudinal study for various reasons, such as participants’ dropout (52), data mishandling, or quality screening (53,54). Both GEE and mixed effects model allow for imbalanced designs, in which the outcome data are not available at the same set of time points for all subjects. However, it is necessary to note that GEE makes more restricted assumptions in missing mechanism than mixed effects model (55) (Supplemental Appendix 4). Under missing at random, a weighted GEE has been developed (56) that corrects biased estimates produced by the standard GEE. An alternative approach is to fill in the incomplete observations using multiple imputations (57) followed by the standard GEE (58,59). Such methods are preferred, especially when the outcome data are not normally distributed (58–60). Note that imputation using last observation carried forward is not recommended for longitudinal data, because it is known to produce highly biased estimates (61,62).
Informative censoring (52) or missing not at random (55) often occurs in longitudinal studies when some patients drop out of the study due to events that are related to CKD progression, such as initiation of dialysis, kidney transplantation, or death (63). In these scenarios, an individual with worse kidney function condition (or a particularly low eGFR) is more likely to drop out of the study and have their eGFR value unobserved. Appropriate methods that take the cause of missingness into account include conditional regression or pattern mixture analysis (64,65) and selection model (66) as well as the joint modeling of survival outcome (e.g., ESRD) and longitudinal data, where a Cox regression model for time to dropout and a mixed effects model for the longitudinal observations are specified separately and linked together (67–70).
Summary
This paper focused mainly on appropriate statistical methodologies for longitudinal data, where the repeated measures are treated as the outcome variable, and they are summarized in Table 5, with the comprehensive comparison between the two regression methods, GEE and mixed effects model. The corresponding statistical softwares for these models are listed in the Supplemental Appendix 5. We are aware that there remain many limitations in the discussion. For example, topics, such as how to handle periods of dialysis or AKI episodes, that affect eGFR values were not covered, because it is difficult to capture AKI due to the infrequency of eGFR measures in most longitudinal cohorts (71). Accordingly, our immediate goal is to promote further discussions and emphasize a few key considerations involved in carefully choosing analytic methods primarily for outcomes on the basis of repeated measures as most appropriate to each specific study design, question, and measure.
Comparisons of the different aspects of the generalized estimating equation model and the mixed effects model for repeated measures
Disclosures
None.
Acknowledgments
Funding for the CRIC Study was obtained under a cooperative agreement from National Institute of Diabetes and Digestive and Kidney Diseases (U01DK060990, U01DK060984, U01DK061022, U01DK061021, U01DK061028, U01DK060980, U01DK060963, and U01DK060902). In addition, this work was supported in part by: the Perelman School of Medicine at the University of Pennsylvania Clinical and Translational Science Award National Institutes of Health (NIH) / National Center for Advancing Translational Sciences (NCATS) UL1TR000003, Johns Hopkins University UL1TR-000424, University of Maryland GCRC M01 RR-16500, Clinical and Translational Science Collaborative of Cleveland, UL1TR000439 from the NCATS component of NIH and NIH roadmap for Medical Research, Michigan Institute for Clinical and Health Research UL1TR000433, University of Illinois at Chicago CTSA UL1RR029879, Tulane COBRE for Clinical and Translational Research in Cardiometabolic Diseases P20 GM109036, Kaiser Permanente NIH/National Center for Research Resources UCSF-CTSI UL1 RR-024131.
CRIC Study Investigators also include Lawrence J. Appel (Welch Center for Prevention, Epidemiology and Clinical Research, Johns Hopkins University, Baltimore, Maryland), Jiang He (Departments of Medicine and Epidemiology, Tulane University, New Orleans, Louisiana), James P. Lash (Section of Nephrology, Department of Medicine, University of Illinois at Chicago, Chicago, IL), Akinlolu Ojo (Department of Medicine, University of Michigan, Ann Arbor, Michigan), and Raymond R. Townsend (Department of Medicine and Center for Clinical Epidemiology and Biostatistics, University of Pennsylvania, Philadelphia, Pennsylvania).
Footnotes
Published online ahead of print. Publication date available at www.cjasn.org.
This article contains supplemental material online at http://cjasn.asnjournals.org/lookup/suppl/doi:10.2215/CJN.11311116/-/DCSupplemental.
- Copyright © 2017 by the American Society of Nephrology