Abstract
Background and objectives Identifying predictors of kidney disease progression is critical toward the development of strategies to prevent kidney failure. Clinical notes provide a unique opportunity for big data approaches to identify novel risk factors for disease.
Design, setting, participants, & measurements We used natural language processing tools to extract concepts from the preceding year’s clinical notes among patients newly referred to a tertiary care center’s outpatient nephrology clinics and retrospectively evaluated these concepts as predictors for the subsequent development of ESRD using proportional subdistribution hazards (competing risk) regression. The primary outcome was time to ESRD, accounting for a competing risk of death. We identified predictors from univariate and multivariate (adjusting for Tangri linear predictor) models using a 5% threshold for false discovery rate (q value <0.05). We included all patients seen by an adult outpatient nephrologist between January 1, 2004 and June 18, 2014 and excluded patients seen only by transplant nephrology, with preexisting ESRD, with fewer than five clinical notes, with no follow-up, or with no baseline creatinine values.
Results Among the 4013 patients selected in the final study cohort, we identified 960 concepts in the unadjusted analysis and 885 concepts in the adjusted analysis. Novel predictors identified included high–dose ascorbic acid (adjusted hazard ratio, 5.48; 95% confidence interval, 2.80 to 10.70; q<0.001) and fast food (adjusted hazard ratio, 4.34; 95% confidence interval, 2.55 to 7.40; q<0.001).
Conclusions Novel predictors of human disease may be identified using an unbiased approach to analyze text from the electronic health record.
- chronic kidney disease
- end stage kidney disease
- natural language processing
- informatics
- electronic health record
- adult
- Ascorbic Acid
- Cohort Studies
- creatinine
- Disease Progression
- Electronic Health Records
- fast foods
- Follow-Up Studies
- humans
- kidney
- Kidney Diseases
- Kidney Failure, Chronic
- Natural Language Processing
- nephrology
- Outpatients
- Renal Insufficiency
- Retrospective Studies
- risk factors
- Tertiary Care Centers
Introduction
Every year, over 100,000 Americans develop ESRD requiring RRT with hemodialysis, peritoneal dialysis, or kidney transplantation (1). Predictors of kidney disease have traditionally been identified through classic epidemiologic approaches, whereby individual risk factors are adjusted for known or suspected confounders and evaluated for their association with CKD progression. This has led to the identification of a number of potential associations with the development of ESRD (2–8), including race, comorbid conditions, conventional and novel biomarkers, lifestyle factors, and medications.
A limitation of retrospective or prospective cohort studies is that associations are typically tested with prespecified covariates. This is akin to candidate gene approaches used to study the genetic basis of diseases. Genome–wide association studies have enabled the discovery of new associations through a paradigm of simultaneous, unbiased testing of multiple associations. Similar approaches have been used to discover new associations between diseases and environmental exposures using environment–wide association studies (9,10) and between a single genetic variant and multiple phenotypes using phenome–wide associations studies (11).
Although work has been done using approaches, such as topic modeling, to incorporate unstructured notes into prediction models (12), modern epidemiologic approaches have not used the clinical narrative in the discovery of disease associations. Clinical notes may contain a rich description of numerous epidemiologic exposures. Discovering associations on the basis of the clinical narrative carries the added complexity of an open cohort, where patients may enter and leave the cohort at various time points and may be lost to follow-up or experience competing events.
In this study, we present and show a new methodology, which we term a concept–wide association study, for examining relationships between several thousand concepts extracted from clinical notes with the development of ESRD.
Materials and Methods
Study Design
We conducted a retrospective cohort study using the full text of clinical notes in the year before the first outpatient general nephrology visit. The date of the first outpatient general nephrology visit was identified through visit and billing data. The outcome was defined using International Classification of Diseases, 9th revision diagnosis codes (585.6, V42.0, and 996.81) and procedure codes (90970, 50360, 50365, 50370, 50380, 55.61, and 55.69) as the date on which RRT was first performed. All identified events were adjudicated through chart review.
Population Studied
Patients were included if they were seen between January 1, 2004, and June 18, 2014 by an adult nephrologist at a Brigham and Women’s Hospital–affiliated outpatient clinic. Exclusion criteria included visits only with transplant nephrology, known ESRD before the first nephrology visit, fewer than five clinical notes in the year preceding the first nephrologist visit, no documented follow-up, or no baseline creatinine values. The threshold of five notes was chosen, because 85% of patients had at least five notes in the year preceding the first nephrology visit.
Data Collection
We obtained data from the Partners Research Patient Data Registry, a centralized clinical data warehouse. We obtained information on patient demographics (age, sex, and race), billing codes, the full text of all electronic clinical notes, and laboratory values, including serum creatinine, calcium, phosphorus, albumin, bicarbonate, and the urine albumin-to-creatinine ratio. We defined baseline laboratory values for each laboratory test as the first available result on or after the first nephrology visit up to 365 days after the visit. Death was determined from the Social Security Death Index, and ascertainment was limited to 30 days after the final clinical note. If ESRD did not occur by this time, the observation was treated as censored. The Partners Healthcare Institutional Review Board approved this study, and the need for informed consent was waived. We adhered to the Declaration of Helsinki.
Extraction of Concepts from Clinical Notes
Concepts rather than individual words were evaluated so that phrases, such as “CHF” and “congestive heart failure” could be considered together when evaluating their association with kidney failure. Concepts were extracted from all clinical notes starting from 1 year before the first nephrologist visit up to but not including the nephrology visit (Figure 1). All clinical note types (including phone calls, outpatient notes, and inpatient notes) were included. Of note, Brigham and Women’s Hospital’s electronic health record is primarily an outpatient record; although it contains admission and initial consultation notes, it typically does not contain inpatient progress notes. Concepts were coded as binary variables for each patient. Concepts with <1% patient prevalence were not evaluated further.
Clinical notes from the year preceding the first nephrology visit were processed, and the extracted concepts were evaluated for associations with time-to-ESRD.
Concepts were extracted from notes in a two-step process. First, negated phrases were removed using the NegEx negation engine (13) using Python 2.7 (14). Second, notes were processed with the National Library of Medicine’s MetaMap software (15) (version 2013v2), which maps phrases to Unified Medical Language System codes known as concept unique identifiers. Extracted concepts were restricted to the Systematized Nomenclature of Medicine–Clinical Terms and RxNorm ontologies to map general and drug concepts, respectively. Concepts were not limited by semantic type, and therefore, all types of concepts contained were extracted (e.g., diagnoses, medications, signs, and symptoms). Mapping of phrases to multiple concepts was allowed (e.g., “heart failure” maps to concepts for both “heart failure” and “congestive heart failure”).
Natural language processing systems may occasionally create erroneous mappings (e.g., the phrase “GN” in a clinical note maps to the concept for “Guinea Republic”). Instead of reporting the name of the concept intended by the phrase mappings, we report the phrase (or phrases) matching with the highest frequency for each concept (i.e., we report “GN” and not “Guinea Republic”).
Statistical Analyses
Proportional subdistribution hazards (competing risk) regression as described by Fine and Gray (16) was performed with ESRD defined as the event of interest and death considered as a competing risk. The Fine and Gray (16) model directly assesses the effect of covariates on the cumulative incidence of a particular type of failure in a competing risks setting. The subdistribution hazard is the rate of ESRD per unit time for individuals who are still alive at that time or have died before that time. That is, individuals who have died without ESRD are treated as though they are still at risk for ESRD. Competing risk regression is preferable for prognostic questions, whereas Cox regression is preferable for etiologic questions (17). We were interested in identifying predictors that would differentiate patients whose kidney disease progressed to ESRD versus those who either did not progress or died—a question of prognosis—and therefore, we chose competing risk regression as our primary approach.
Competing risk regression for each concept was carried out in two phases: (1) unadjusted and (2) adjusted for a published ESRD risk prediction score developed by Tangri et al. (18) using age, race, sex, and baseline laboratory values (serum creatinine, calcium, phosphorus, albumin, bicarbonate, and the logarithm of the urine albumin-to-creatinine ratio).
Multiple hypothesis testing (19) was accounted for using the method by Storey (20), which controls the false discovery rate, defined as the expected proportion of false positives among all significant hypotheses. Using this method, P values were transformed into q values. Hazard ratios (HRs) and 95% confidence intervals (95% CIs) were not adjusted in any way. Concepts with q values <0.05 were reported as associations; this equates to a 5% expected proportion of false positives among all concepts declared to have associations. The false discovery rate method was chosen, because it explicitly controls the error rate of test conclusions among significant results, scales well in the face of increasing numbers of tests, and has higher power compared with the Bonferroni method (19). Concepts with HR>1 were identified as positive predictors, and concepts with HR<1 were identified as negative predictors. Analyses were performed in R 3.1.2 (21). q Values were computed using the qvalue R package (available on Bioconductor) by Storey et al. (22), and multiple imputation was performed using the mice R package (23).
Identifying Associated Concepts
Each concept is likely to have a number of associations with other concepts in either the same direction (e.g., diabetes mellitus and insulin) or opposite directions (e.g., women and men). For a given concept, knowing its associated concepts may help in evaluating its plausibility as a true risk factor or a confounder. The Φ-coefficient is a measure of association for two binary variables, and it can be derived by computing a Pearson correlation on binary variables. We reported the three phrases with the greatest Φ-coefficients (in either direction) for each concept identified in our analysis after excluding duplicates.
Handling Missing Data
The adjusted analysis required calculation of the Tangri score, which is dependent on several variables (18). Missing values were multiply imputed using regression switching with predictive mean matching, a nonparametric method (24). Five datasets were imputed incorporating the event flag, log time, age, race, sex, serum creatinine, calcium, phosphorus, albumin, bicarbonate, and log urine albumin-to-creatinine ratio. The Tangri score was calculated using the imputed variables. Results were pooled using the rules by Rubin (25).
Sensitivity Analyses
The analysis was repeated using Cox regression with death considered as a censoring event. Death-censored ESRD has been used by several ESRD prediction models but may be problematic for the purpose of estimating the risk of ESRD, because death likely represents informative censoring (8,18,26–30). The hazards from a cause–specific hazard model are conditional on survival and cannot be interpreted as marginal hazards that ignore death. If death is independent of ESRD, results similar to those obtained by competing risk regression would be expected with the cause–specific Cox regression. However, in settings where a variable of interest is associated with a competing event, the two approaches may yield conflicting results (31). The adjusted competing risk and Cox analyses were repeated using nonmissing covariates (age, sex, black race, and eGFR) to limit bias from imputation.
Results
After applying inclusion and exclusion criteria to 9817 patients seen in nephrology clinic, we identified 4013 patients who constituted the analytic cohort (Figure 2). Of these, 134 were confirmed to have developed ESRD during follow-up, and 160 were confirmed to have died without developing ESRD (Table 1). Median follow-up time was 1.2 years (range =0–10.2 years).
Approximately half of the patients considered were included in the study cohort.
Baseline characteristics of the patients included in the study and patients excluded on the basis of having fewer than five clinical notes in the year before the first nephrology visit
After processing 103,962 clinical notes authored by 4589 distinct clinicians with NegEx and MetaMap, 38,698 unique concepts were identified. Of these, 7576 were present in at least 1% of patients and subsequently studied.
Predictors of ESRD from Univariate Analysis
Using a false discovery rate threshold of <5% (q<0.05) (Table 2), competing risk regression identified 960 concepts with q values <0.05 (Supplemental Table 1). Of these, 184 had an HR>1, and 776 had an HR<1.
Unadjusted competing risk regression for ESRD showing concepts with q values <0.05 (top 10 positive and negative predictors shown; full results are in Supplemental Table 1)
Of the positive predictors included in the Tangri score (18)—a CKD risk prediction score on the basis of demographic and laboratory data—the unadjusted analysis identified men (“man”), “creatinine,” and “proteinuria” as concepts associated with higher risk of ESRD. Mention of “proteinuria” had an HR of 2.16 (95% CI, 1.51 to 3.10; q<0.001), and “nephrotic-range proteinuria” had an HR of 5.09 (95% CI, 2.48 to 10.40; q<0.001). Other notable positive predictors included concepts related to heart failure; coronary artery disease; type 1 diabetes; complications of chronic kidney, such as hyperkalemia and anemia; medications associated with each of these conditions; and noncompliance with medications. Interestingly, the mention of “no swelling” was also associated with ESRD.
Negative predictors for ESRD included a large number of factors that may be indicative of either a positive association with death (e.g., “ventilator”), poor candidacy for dialysis, or kidney transplant (e.g., “metastatic cancer”) or a lower risk of ESRD (e.g., “female”). Although we cannot distinguish among these reasons for a given concept, all contribute negatively to the cumulative incidence of progression to ESRD in the framework of a competing risk model. Interestingly, “female” had an HR of 0.58 (95% CI, 0.41 to 0.83; q=0.02), which is a near-perfect reciprocal of the risk conferred by “male,” whereas “healthful food” had an HR of <0.01 (95% CI, <0.01 to <0.01; q<0.001), which is in the opposite direction of the risk conferred by “fast food” (HR, 3.62; 95% CI, 1.72 to 7.63; q=0.005) but with a much larger effect size.
Predictors of ESRD after Adjusting for Tangri Score
After adjusting for the Tangri score, we identified 885 concepts using a false discovery rate threshold of <5% (q<0.05) (Table 3), of which 130 had HRs>1 and 755 had HRs<1 using competing risk regression (Supplemental Table 3). Seven hundred seventy-two concepts were identified in both the univariate and multivariate competing risk analyses.
Competing risk regression for ESRD adjusted for Tangri score showing concepts with q values <0.05 (top 10 positive and negative predictors shown; full results are in Supplemental Table 3)
Missing Data
Among variables required for imputation of the Tangri score, the proportion of missingness in the final cohort was 56.1% for the urine albumin-to-creatinine ratio, 37.8% for phosphorus, 9.8% for serum albumin, 2.6% for calcium, 2.4% for bicarbonate, and 0% for age, sex, and eGFR.
Sensitivity Analyses
Using Cox regression models, 129 concepts were found to have q<0.05 in the univariate analysis: 120 concepts with HRs>1 and nine concepts with HRs<1 (Supplemental Table 2); 127 concepts were identified by both competing risk and Cox regression, 833 were by competing risk regression only, and 2 were by Cox regression only. After adjusting for the Tangri score using Cox regression, no concepts were found to have q<0.05.
The competing risk–adjusted analysis was repeated using nonmissing covariates. Adjusting for age, sex, race, and eGFR, 843 concepts were identified with q<0.05, 95 were identified with HRs>1, and 748 were identified with HRs<1 (Supplemental Table 4). Of these, 791 concepts were identified in both the primary analysis and this analysis, 52 concepts were identified in this analysis only, and 94 concepts were identified in the primary analysis only. Cox regression with this set of covariates identified seven concepts (Supplemental Table 5), all with HRs>1, in contrast to no concepts identified in the primary analysis.
Discussion
This is the first study to use an unbiased approach using text from the clinical notes to identify predictors of human disease. The approach’s face validity was confirmed by the identification of several well established risk factors for ESRD, including men, degree of proteinuria, diabetic kidney disease, heart failure, and anemia. This study also identified novel predictors of ESRD that have not been previously described, such as fast food and high–dose ascorbic acid.
This paper describes a hypothesis-generating approach, and caution is needed when interpreting the results. Identified predictors could fall into any of seven possible categories.
Vague and unlikely to be meaningful
Confounded by indication
Mention of the concept in a note, rather than its actual presence, indicates a risk (a specific type of confounding by indication)
False positive result (due to 5% false discovery rate)
Associated with the event of interest but not causal
Associated with the competing event
True risk factor (associated with the event of interest and causal)
Confounding by Indication
Confounding may be obvious in some cases (docusate as a marker of hospitalization) and less obvious in others. The association of pes planus (flat foot) or volar (sole of the foot) with ESRD may be confounded by documentation of a foot examination among diabetic individuals referred to podiatry for diabetic nephropathy. Of the 54 patients noted to have a finding of flat foot, 39 had diabetes mellitus mentioned their notes. The association of “Ascorbic Acid 500 MG” with ESRD may be confounded by its association with “cardiac transplantation”—a relationship that was identified through a systematic evaluation of interconcept associations—although the Φ-coefficient of 0.28 reflects only a weak association between the two concepts. These situations are akin to the linkage disequilibrium problem observed in genome–wide association studies, where false associations may be identified when two genes are in close proximity.
Biologically Plausible Predictors
High–dose ascorbic acid could conceivably lead to a higher risk of ESRD through its metabolism to oxalate, the most common constituent of kidney stones (a risk factor for CKD progression [32,33]) and the metabolic abnormality found in primary hyperoxaluria, a group of rare genetic diseases associated with kidney failure. Fast food is known to be high in sodium and phosphate content (34), and both excessive salt intake and hyperphosphatemia are known to blunt the renoprotective effects of angiotensin–converting enzyme inhibitors and promote CKD progression (35,36). However, neither of these concepts were identified in the adjusted Cox regression analysis, and therefore, the higher cumulative incidence of ESRD for these two concepts may in part be driven by a lower risk of death. In both instances, the lower risk of death may be driven by confounding due to healthy user effect (37).
Sensitivity to Statistical Approach
Cox regression identified fewer predictors than competing risk regression in both the unadjusted and adjusted analyses. The likely explanation for the observed differences is that Cox regression and competing risk regression ask two different but related questions. Cox regression tests whether the mention of a concept is directly associated with the outcome of ESRD, and competing risk regression tests whether individuals with a given concept present in their notes were more likely to (live to) experience ESRD (31). If a concept is associated with a higher risk of death (the competing event), the HR for ESRD as measured by competing risk regression will be lower than the cause-specific hazard, because fewer individuals are alive and able to experience ESRD and vice versa. Because many of the concepts have varying associations with death, it is not surprising that concepts that have either a positive or negative association with death are identified with competing risk regression but not with Cox regression.
To a lesser extent, differences in results were also observed when choosing covariates that did not require imputation. Because over one half of patients had missing values for albuminuria and albuminuria is likely to be monitored more closely in patients with severe or worsening kidney disease, there are likely to be differences in patients with and without missing values. Multiple imputation is fairly robust to this problem as long as the plausible contributors to missingness are included in the imputation process, but we cannot know this definitively.
Limitations
The primary limitation of this study is that its findings are drawn from a single tertiary care center, which may have idiosyncrasies in documentation style and patient characteristics that may differ from other institutions. Validating this analysis in other cohorts is needed. Other limitations include missing covariate information and the open cohort design.
This approach was successful in translating the clinical narrative into a tool for the discovery of possible predictors that have not been previously linked to kidney failure. Future studies to replicate our findings and approach would be informative. The approach outlined here could potentially be used for patient-level prognosis, for population health management, or as a tool to identify previously unsuspected risk factors for CKD progression (38). As the adoption of electronic health records continues to rise and a generation of individuals has their entire health histories stored electronically, this approach provides a novel way to gain potential insights about disease risk as a natural byproduct of care delivery and electronic health record documentation.
Disclosures
All authors have completed and submitted the International Committee of Medical Journal Editors Form for Disclosure of Potential Conflicts of Interest. D.W.B. is a coinventor on Patent No. 6029138 held by Brigham and Women’s Hospital on the use of decision support software for medical management licensed to the Medicalis Corporation (Toronto, ON, Canada). He holds a minority equity position in the privately held company Medicalis Corporation, which develops web–based decision support for radiology test ordering. He serves on the board for SEA Medical Systems (San Jose, CA), which makes intravenous pump technology. He is on the clinical advisory board for Zynx, Inc. (Los Angeles, CA), which develops evidence-based algorithms. He consults for EarlySense (Ramat Gan, Israel), which makes patient safety monitoring systems. He receives equity and cash compensation from QPID, Inc. (Boston, MA), a company focused on intelligence systems for electronic health records. He receives cash compensation from CDI (Negev), Ltd. (Beersheba, Israel), which is a not for profit incubator for health information technology startups. He receives equity from Enelgy (Northridge, CA), which makes software to support evidence–based clinical decisions. He receives equity from Ethosmart (Ein Iron, Israel), which makes mobile applications to help patients with chronic diseases. He receives equity from Intensix (Netanya, Israel), which makes software to support clinical decision making in intensive care. He receives equity from MDClone (Beersheba, Israel), which takes clinical data and produces deidentified versions of it. The financial interests of D.W.B. have been reviewed by Brigham and Women’s Hospital and Partners HealthCare in accordance with their institutional policies. Otherwise, no conflicts of interest were reported.
Acknowledgments
Because Dr. Curhan is the Editor-in-Chief of CJASN, he was not involved in the peer-review process for this manuscript. Another editor oversaw the peer-review and decision-making process for this manuscript.
This research was supported, in part, by a National Institutes of Health T32 training grant awarded to the Division of Renal Medicine at Brigham and Women’s Hospital.
The funding source had no role in the study design, conduct, analysis, or decision to submit the manuscript.
Footnotes
Published online ahead of print. Publication date available at www.cjasn.org.
This article contains supplemental material online at http://cjasn.asnjournals.org/lookup/suppl/doi:10.2215/CJN.02420316/-/DCSupplemental.
- Received March 4, 2016.
- Accepted September 1, 2016.
- Copyright © 2016 by the American Society of Nephrology