This article requires a subscription to view the full text. If you have a subscription you may use the login form below to view the article. Access to this article can also be purchased.
Abstract
Background and objectives Digital pathology and artificial intelligence offer new opportunities for automatic histologic scoring. We applied a deep learning approach to IgA nephropathy biopsy images to develop an automatic histologic prognostic score, assessed against ground truth (kidney failure) among patients with IgA nephropathy who were treated over 39 years. We assessed noninferiority in comparison with the histologic component of currently validated predictive tools. We correlated additional histologic features with our deep learning predictive score to identify potential additional predictive features.
Design, setting, participants, & measurements Training for deep learning was performed with randomly selected, digitalized, cortical Periodic acid–Schiff–stained sections images (363 kidney biopsy specimens) to develop our deep learning predictive score. We estimated noninferiority using the area under the receiver operating characteristic curve (AUC) in a randomly selected group (95 biopsy specimens) against the gold standard Oxford classification (MEST-C) scores used by the International IgA Nephropathy Prediction Tool and the clinical decision supporting system for estimating the risk of kidney failure in IgA nephropathy. We assessed additional potential predictive histologic features against a subset (20 kidney biopsy specimens) with the strongest and weakest deep learning predictive scores.
Results We enrolled 442 patients; the 10-year kidney survival was 78%, and the study median follow-up was 6.7 years. Manual MEST-C showed no prognostic relationship for the endocapillary parameter only. The deep learning predictive score was not inferior to MEST-C applied using the International IgA Nephropathy Prediction Tool and the clinical decision supporting system (AUC of 0.84 versus 0.77 and 0.74, respectively) and confirmed a good correlation with the tubolointerstitial score (r=0.41, P<0.01). We observed no correlations between the deep learning prognostic score and the mesangial, endocapillary, segmental sclerosis, and crescent parameters. Additional potential predictive histopathologic features incorporated by the deep learning predictive score included (1) inflammation within areas of interstitial fibrosis and tubular atrophy and (2) hyaline casts.
Conclusions The deep learning approach was noninferior to manual histopathologic reporting and considered prognostic features not currently included in MEST-C assessment.
Podcast This article contains a podcast at https://www.asn-online.org/media/podcast/CJASN/2022_07_26_CJN01760222.mp3
- IgA nephropathy
- kidney biopsy
- renal insufficiency
- kidney failure
- artificial intelligence
- Oxford classification
- histopathology
- MEST-C
- deep learning
Introduction
The Oxford classification (MEST-C) (1⇓–3) is the current gold standard for the histologic prognostic definition of IgA nephropathy. However, because the quality of current evidence is mostly on the basis of retrospective studies, most recent clinical guidelines do not suggest its application for clinical adoption in guiding treatment (4). Prospective, robust data on the Oxford classification in formal clinical trials are missing mostly due to its relatively new introduction (2009) with recent important revisions (2017) (3), high costs associated with complex histologic evaluations, and poor intra- and inter-reproducibility among pathologists (1,2,5⇓⇓⇓–9).
Digital pathology has enabled innovative analyses using emerging technologies, including artificial intelligence (AI). AI offers a unique opportunity for efficient development of scoring systems. Advantages of AI may include low implementation costs and high reproducibility—two areas of current weakness in manual histologic scoring. However, before clinical adoption, AI algorithms must be developed, tested, and rigorously validated.
Automated pathologic analysis using deep learning, a branch of AI, enables the development of innovative applications in kidney pathology (10⇓–12). Deep learning uses convolutional neural networks (CNNs), artificial counterparts of the neural networks of the brain, enabling computer learning without explicit programming. The computer is exposed to examples of desired input-output behavior, from which it automatically extracts features that are useful for a particular task (13). Previously, we effectively used a deep learning approach to automatically extract classification features from kidney biopsy immunofluorescence images (14). Deep learning of histologic images can also be used for diagnostic purposes, but the purpose of this study was to develop a histologic deep learning predictive score (DLPS) for IgA nephropathy, assessed against ground truth (kidney failure) among a consecutive cohort of patients with IgA nephropathy collected over a 39-year period. Noninferiority was assessed in comparison with the Oxford classification adopted by the current gold standard International IgA Nephropathy Prediction Tool (IIPT) (15) and the currently available clinical decision support system for estimating the risk of kidney failure in IgA nephropathy (CDSS) (16). Additional histologic features will be correlated with our DLPS to identify any potential predictive features not assessed by validated predictive tools.
Materials and Methods
This retrospective, longitudinal, single-center study considered biopsy specimens with primitive IgA nephropathy diagnoses registered between 1982 and 2021, extracted from our center’s dedicated kidney biopsy registry. IgA nephropathy diagnoses secondary to Henoch–Schönlein purpura, lupus nephritis, chronic liver diseases, and other immunologic disorders were not considered. We excluded biopsy specimens from transplanted kidneys or those with unavailable histopathologic slides. Specimens were visually inspected to identify technically inadequate images for specimen discoloration or those damaged during archival storage; as required, new histologic sections were prepared and stained using Periodic acid–Schiff (PAS).
All selected biopsy specimens were codified according to patient and biopsy date (multiple biopsy specimens could have been selected for a single patient). Biopsy specimens were randomly assigned to training for deep learning (“training/validation set”) or a “test set.” Among those assigned to the test set, biopsy specimens from patients with either kidney failure or no event and with at least 5 years follow-up were further selected and included in the “final test set,” according to the IIPT’s maximum prognostic prediction time of 60 months. The final test set was used for comparison between our DLPS with the gold standard Oxford classification content of the IIPT and CDSS (see Figure 1).
Flow chart of kidney biopsy specimen selection, digitalization, and randomization in the training/validation set and test set. *A total of 15 biopsy specimens were provided for the VALIGA study (25); 37 biopsy specimens are missing for other reasons.
Further analysis included the evaluation of additional histologic features (not included in the Oxford classification) on a “mini-test set” comprising 20 selected biopsy specimens (the ten best and ten worst DLPS scores) from the final test set. The analysis of biopsy slides and the acquisition of clinical data included in this study was approved by the local ethics committee (prot. 434/2019/OSS/AOUMO).
Patient Data
Demographic characteristics, clinical features, and patient outcomes were collected retrospectively. For each reported parameter, patients with missing data were not considered (patients included for each parameter analysis are specified in Table 1). GFR was estimated using the 2009 Chronic Kidney Disease Epidemiology Collaboration equation (17) and mean BP was calculated as (systolic BP×2+diastolic BP)/3. CKD stage was defined according to the Kidney Disease Outcomes Quality Initiative guidelines (18). Patient outcome specified time to kidney failure, defined as the need for dialysis or kidney transplant. Patient clinical and outcome data were unknown to pathologists.
Demographic and clinical characteristics of all patients included in the training/validation set and the test set
Patient follow-up data were considered up to the most recent consultation. To verify any loss to long-term follow-up, we investigated regional dialysis and transplantation registries for any additional patient outcomes. Data regarding the start date of replacement therapy and patient mortality were extracted. For follow-up of patients living in other Italian regions, local dialysis and transplantation registries were consulted. For any patients living in regions without these registries, follow-up data were censured up to the last available consultation at our center.
Kidney Biopsy: Preparation and Annotation
All specimens were obtained according to a specific kidney biopsy protocol. Briefly, during an ultrasound-guided percutaneous kidney biopsy procedure, two portions of kidney tissue were sampled from the cortical area (with a Bard Max-Core semiautomatic 16- or 18-gauge 3- to 16-cm needle; Bard Peripheral Vascular Inc., Tempe, AZ): one for immunofluorescence analysis and the other for light microscopy. After Bouin protocol fixation, the microscopy specimen was embedded in paraffin and 3-µm-thick sections were sliced and stained with hematoxylin and eosin and other special stains (PAS, Masson Trichromic, Silver Jones, Congo red). Only PAS-stained specimens were considered in this study.
Digital slide microphotographs were acquired at high magnification (40×/0.65 NA) by a whole slide scanner for digital pathology (D-Sight; Menarini Diagnostics, Florence, Italy). Native images (GXP format) were converted (TIFF format at full resolution) for subsequent annotation (QuPath software; University of Edinburgh; https://github.com/qupath/qupath/releases/tag/v0.3.0) (19), carried out by a pathology technician (F. Gualtieri) and revised by a second, dedicated kidney pathologist (F.F.).
Each biopsy image was manually annotated to obtain compartmentalization of the section according to capsular, medullary, and cortical areas (Figure 2, Supplemental Table 1). All subsequent analyses were performed on the entire cortical area only.
Workflow of slide segmentation and production of overlapping patches.
Mesangial (M), endocapillary (E), segmental sclerosis (S), tubulointerstitial damage (T), and crescent (C) (MEST-C) parameters were annotated according to the current classification (1⇓–3), quantified as zero or one for M, E, and S, or zero, one, or two for T and C. The MEST-C classification was manually scored by the pathologist (F.F.).
Digitalized Image Preparation and Datasets
Due to computational resource availability constraints (an entire biopsy section image is composed of about 6 billion pixels), training of a CNN was not possible for an elevated number of biopsy specimens. An important parameter for effective training is appropriate image size and resolution. A preprocessing pipeline, which employs several classic computer vision techniques to split whole-slide images into overlapping patches, assisted in achieving resizing (e.g., 512×512) and, therefore, input image dimensions were reduced. The dataset, composed of 496 biopsy specimens (1369 sections), yielded 7199 patches (Figure 2).
Deep Learning Training (Training/Validation Set)
Deep learning training was conducted on the training/validation set according to ground truth, following a soft-label strategy, including time periods of patient follow-up and time to eventual kidney failure (20). Coding of ground truth according to a scheme is outlined in Supplemental Table 2.
CNN Design
The ResNet-18 architecture (21), trained to evaluate multiple 256×512 patches simultaneously and yield biopsy specimen–level predictions, was used. Data augmentation was performed by randomly flipping and rotating each input image before feeding it to the network, because neither of these two transformations change the semantic content of the image. Moreover, input image contrast was randomly altered so the network could not distinguish between older and newer biopsy specimens.
Regularization techniques are usually excluded during inference when the goal is the best possible prediction for a single, novel image. In this study, inference was run multiple times and data augmentation techniques were used (22). During the training/validation procedure, all available images were shown to the model and weights were updated accordingly. A total of 80 epochs (reiteration of training) were performed; at each epoch, four randomly selected patches derived from the cortical area of each section were chosen. Therefore, the probability of patches from each cortical area never being shown to the model was considered negligible.
Transfer learning was exploited by pretraining CNNs using the open-source ImageNet (23,24) and then fine-tuned to correctly classify histopathologic images. The learning rate was set during the fine-tuning process and a weighted crossentropy loss was used.
Prognostic Accuracy of DLPS versus Prediction Score Calculators
The DLPS output is a risk score ranging from zero to one. The score is proportional to the risk of kidney failure. The DLPS was applied to the independent final test set to generate a histologic risk prediction. The DLPS was compared with the IIPT and CDSS prediction score calculators, which are both based on MEST-C parameters (1⇓–3). Because IIPT and CDSS prediction scores also require clinical data, we applied standardized data (median cohort clinical data) for each biopsy IIPT and CDSS predictive score generated. This was designed to eliminate the effect of clinical data (Supplemental Table 4).
DLPS Correlated with Additional Histologic Parameters (Mini-Test Set)
An additional manual histologic feature analysis, not considered by the MEST-C score, was performed on the mini-test set and was correlated to the DLPS.
Statistical Analysis
Kidney survival curves were generated using the Kaplan–Meier method. Uni- and multivariable analyses of MEST-C and clinical characteristics for time to kidney failure outcome were calculated by Cox regression. DLPS, IIPT, and CDSS were evaluated and compared according to the following metrics: accuracy, area under the receiver operating characteristic curve (AUC), F1 score, precision, recall, and specificity.
Correlations between our DLPS, IIPT and CDSS, the individual MEST-C parameters, and additional histologic features were assessed by Pearson correlation. For additional histologic correlation, the Bonferroni correction for repeated analyses was applied. We used Stata 11.2 software (StataCorp, College Station, TX).
Results
Patients
The participants’ clinical characteristics are outlined in Table 1. Median (25th–75th percentile) patient follow-up was 6.7 (2.5–14.6) years with a 10-year kidney survival of 78% (95% confidence interval [95% CI], 73% to 82%). Uni- and multivariable analysis of clinical characteristics in predicting kidney survival are outlined in Supplemental Table 3. Creatinine levels, urine protein-creatinine ratio, and mean BP were confirmed as significant clinical risk predictors for kidney failure (P<0.001).
Biopsy Specimens
The flow chart of kidney biopsy selection is depicted in Figure 1. From the 496 selected biopsy specimens meeting inclusion criteria, digitization yielded 1369 images (median of three slices per biopsy) and overlapping patches comprised 7199 images (Supplemental Table 1). The training/validation set included 363 biopsy specimens (73%), and the randomly generated test set (balanced for kidney failure outcome distribution) included 133 biopsy specimens (27%).
MEST-C Predictors of Kidney Survival in the Test Set
The distribution of MEST-C parameters of the test set are outlined in Table 2. The MEST parameters were included in a multivariable analysis for predictors of kidney survival. As previously proven (3,25), the T parameter is the most statistically predictive parameter for kidney failure (hazard ratio [HR], 2.06; 95% CI, 1.3 to 3.28; P=0.002). The C parameter shows a good predictive value (HR, 1.19; 95% CI, 1.04 to 3.48; P=0.03), and the M and S parameters show only marginal statistical significance (HR, 1.92; 95% CI, 1.00 to 3.7; P=0.05 and HR, 2.1; 95% CI, 0.99 to 4.43; P=0.05, respectively).
Distribution and multivariable analysis of the MEST-C parameters (gold standard) in the study’s test set
Prognostic Accuracy of DLPS versus Prediction Score Calculators
The prognostic accuracy of DLPS was comparable with both the prediction score calculators in terms of accuracy, AUC, F1 score, precision, recall, and specificity (P=0.4) (Table 3). According to receiver operating characteristic curve analysis, the AUC for DLPS was insignificantly higher (AUC=0.84) than those of the histologic components of the prediction score calculators (0.77 and 0.74, respectively; P=0.3) (see Figure 3).
The deep learning predictive score, the International IgAN Prediction Tool, and the clinical decision support system for estimating the risk of end-stage kidney disease in IgA nephropathy prediction versus actual incidence of kidney failure in the final test set
Receiver operating characteristic curves of prediction accuracy of our deep learning predictive score (DLPS), the International IgA Nephropathy Prediction Tool (IIPT), and the clinical decision supporting system for estimating the risk of kidney failure in IgA nephropathy (CDSS). Analysis of the final test set (n=95 biopsy specimens). AUC, area under the receiver operating characteristic curve.
Correlations of MEST-C and Prediction Score Calculators Parameters with DLPS
Correlations between the MEST-C parameters, the histologic component of the prediction score calculators (standardized for clinical parameters), and our DLPS are shown in Table 4. A significant correlation between the prediction score calculators (IIPT and CDSS) with our DLPS is highlighted (R of 0.39 [P<0.001] and 0.33 [P<0.001], respectively). No specific correlations were observed between DLPS and the M, E, S, and C parameters. Additionally, the T parameter was highly correlated with the IIPT (R=0.96, P<0.001) and the CDSS (R=0.79, P<0.001), but only moderately with our DLPS (R=0.41, P<0.001). A significant negative correlation with the E parameter was observed with both prediction score calculators (R of −0.29 [P<0.001] and −0.47 [P<0.001], respectively) and was not replicated by our DLPS.
Correlations between our deep learning predictive score, the International IgAN Prediction Tool, the clinical decision support system for estimating the risk of end-stage kidney disease in IgA nephropathy, and MEST-C parameters
DLPS Correlated with Additional Histologic Parameters in Mini-Test Set
Because the MEST-C parameters were poorly correlated with our DLPS scores, additional histologic features were assessed in the mini-test set (the ten highest and lowest scoring DLPS biopsy specimens). Infiltration of interstitial fibrosis and tubular atrophy (R2=0.70, P=0.001) and the number of hyaline casts (R2=0.70, P<0.001) were the most statistically correlated additional histologic features with our DLPS score (see Table 5).
Additional histologic features (not included in MEST-C) scored for 20 selected images from the test set, including the ten biopsy specimens with the highest deep learning predictive score and ten lowest deep learning predictive score
Discussion
The main objective of this study was to compare the prognostic efficacy of an automated deep learning histology score with the manual MEST-C approach. Because our DLPS does not incorporate clinical data, it cannot in any way replace the current gold standard for prognostic definition of IgA nephropathy (the IIPT). Our results suggest that, compared with the histologic component of IIPT and CDSS predictive scores, our deep learning automated histologic scoring system was statistically noninferior. Further, results from this study confirm the importance of the T parameter, which was also highly correlated with our DLPS. The analysis of additional histopathologic parameters highlights that our DLPS correlates with multiple features not included in the MEST-C scoring evaluation, the strongest being inflammation within areas of interstitial fibrosis and tubular atrophy and hyaline casts. Despite these operational differences, the DLPS was accurate and this automated histologic approach may eliminate issues of human subjective bias and save time and costs compared with the “traditional” manual approach, although data do not prove superiority of DLPS to the MEST-C scores.
Post hoc identification of the features used by deep learning algorithms to carry out their predictions are inherently difficult to identify and, in the deep learning field, the “black box” is a well-known problem (26). The availability of MEST-C parameters in our dataset enabled a direct comparison for these features, thereby partially alleviating this limit. Analysis highlighted that the DLPS is highly correlated to the T score (Table 4). The prognostic significance of the T score is not new to literature; many authors have also reported its significance (1⇓–3,25,27,28). This study highlights that our DLPS independently developed a sensitivity toward this parameter, albeit with less correlation than the IIPT and CDSS. Despite having a high prognostic capacity, our DLPS lacks any correlation with parameters related to glomerular structures (M, E, S, and C). We speculate the very low correlation with C may be due to the scarce representation of this feature in our sample (see Table 2). However, it must be emphasized that our DLPS did show a very low correlation with the parameters relating to mesangial hypercellularity (M) and segmental glomerulosclerosis (S). M and S have both previously been confirmed in two large validation studies of the Oxford classification (25,28). Furthermore, the relatively low correlation of the T parameter in our DLPS outcome compared with the IIPT and CDSS suggests other features not included in the MEST-C score are considered by our DLPS in generating risk prediction. We attempted to identify these additional histologic features through correlation with the DLPS for biopsy specimens with the worst and best DLPS (extreme values). Interestingly, the most significant histologic parameters that emerged were inflammation within areas of interstitial fibrosis and tubular atrophy and hyaline casts.
Inflammation within areas of interstitial fibrosis and tubular atrophy is associated with adverse outcomes in kidney transplantation and is related to T cell–mediated rejection. Although its biologic and functional relationship requires further investigation, T-cell infiltration appears to be a prerequisite for inflammation within areas of interstitial fibrosis and tubular atrophy development, as demonstrated in a large sequential kidney transplant biopsy study (29). Similarly, CD20-positive B cells form a prominent part of the interstitial infiltrating cells in IgA nephropathy (30). Tissue-infiltrated B cells secrete proinflammatory cytokines, chemokines, and Igs, which further exaggerate kidney inflammation by attracting more lymphocytes and provoking resident kidney cells, leading to kidney fibrosis and functional deterioration (31). Inflammation was initially evaluated by the researchers of the Oxford classification but, during the score definition, the analysis distinguishing between inflammation in scarred or unscarred tissue was abandoned due to very low reproducibility (2). Eventually, inflammation within areas of interstitial fibrosis and tubular atrophy could represent a more granular description of tubulointerstitial lesions in IgA nephropathy, but further analyses are requested to confirm this hypothesis.
Hyaline casts are associated with thyroidization-type tubular atrophy (32). The increased presence of these lesions in patients with high DLPS scores is likely associated to more advanced forms of tubular atrophy. Hyaline casts were not specifically scored in the development of the Oxford classification definition (2).
Our study suggests an important precedent in defining automated histology-based prognostic scores. Our data suggest that the deep learning approach has the potential to develop histologic prognostic models that are noninferior to those developed by a manual, human-based approach. Furthermore, our study suggests this innovative approach can achieve a prognostic histologic score relatively quickly, requiring fewer resources compared with the traditional approach.
This study has limitations related to the retrospective and monocentric nature of data collection. The algorithm was developed on the basis of histopathologic data only and additional factors, such as the heterogeneity between patients’ medical treatments (steroid, immunosuppressive drug, the use of angiotensin-converting enzyme inhibitors and angiotensin receptor blockers) and other clinical and demographic data, were not considered in the assessment of predictive accuracy, but will be analyzed in future research. Larger, prospective studies, including individual clinical data, are necessary to validate our DLPS and to confirm its external validity.
In conclusion, this study suggests the deep learning approach is noninferior to manual histopathologic reporting and may consider prognostic features not currently included in the MEST-C assessment. The application of DLPS to a full algorithm (including clinical data) should be developed and validated.
Disclosures
G. Donati reports receiving honoraria from B. Braun Avitum and having consultancy agreements with Medtronic. L. Gesualdo reports receiving research funding from Abionyx and Sanofi; receiving honoraria from Astellas, AstraZeneca, Estor, Fresenius, Travere, and Werfen; having consultancy agreements with AstraZeneca, Baxter, Chinook, Estor, GlaxoSmithKline, Medtronic, Mundipharma, Novartis, Pharmadoc, Retrophin, Roche, Sandoz, Sanofi, and Travere; serving on the board of directors for the European Renal Association–European Dialysis and Transplant Association, Renal Pathology Society, and Società Italiana Nefrologia; serving in an advisory or leadership role for Journal of Nephrology and Nephrology Dialysis Transplantation; and receiving royalties from McGraw-Hill Education (Italy) Srl. R. Magistroni reports receiving research funding from Omeros Pharmaceutical and Reata Pharmaceutical and having consultancy agreements with Otsuka Pharmaceutical. F. Pollastri reports being employed by AstraZeneca Computational Pathology Munich. All remaining authors have nothing to disclose.
Funding
This work was supported by the Athenaeum Research Grant (FAR) 2021 of the Department of Surgical, Medical, Dental and Morphological Sciences related to Transplant, Oncology and Regenerative Medicine, University of Modena and Reggio Emilia, Italy and FSE 2014/2020 Obiettivo tematico 10, Emilia Romagna, Italy grant 2538.
Acknowledgments
We thank Prof. Rosanna Coppo and Prof. Paolo Schena for the many suggestions that have produced a significant improvement of this work and for the final revision of the draft.
Author Contributions
F. Bolelli, S. Cimino, M. Ferri, F. Fontana, L. Gesualdo, F. Giaroni, S. Giovanella, F. Gualtieri, M. Leonelli, G. Ligabue, R. Magistroni, E. Mancini, M. Nordio, F. Pollastri, P. Sacco, and F. Testa were responsible for data curation; F. Bolelli, C. Grana, R. Magistroni, and F. Pollastri were responsible for formal analysis and software; F. Fontana and R. Magistroni were responsible for investigation; R. Magistroni and F. Pollastri were responsible for methodology; C. Grana and R. Magistroni were reponsible for project administration and resources; R. Magistroni conceptualized the study, provided supervision, wrote the original draft, and was responsible for funding acquisition, validation, and visualization; and G. Alfano, J. Chester, G. Donati, F. Fontana, M. Leonelli, R. Magistroni, and F. Testa reviewed and edited the manuscript.
Supplemental Material
This article contains the following supplemental material online at http://cjasn.asnjournals.org/lookup/suppl/doi:10.2215/CJN.01760222/-/DCSupplemental.
Supplemental Table 1. Coding of soft labels for deep learning training.
Supplemental Table 2. Kidney failure risk, calculated by unadjusted and multivariable Cox regression analysis of clinical characteristics.
Supplemental Table 3. Manual annotation of biopsy images.
Supplemental Table 4. Deep Learning Predictive Score (DLPS), MEST-C score, IgA Nephropathy Prediction Tool (IIPT), Clinical Decision Support System for Estimating the Risk of End-Stage Kidney Disease in IgA Nephropathy (CDSS), and kidney failure during the follow-up according to Supplemental Table 1 coding (Ground Truth) for all patients in the test set.
Footnotes
Published online ahead of print. Publication date available at www.cjasn.org.
- Received February 10, 2022.
- Accepted June 27, 2022.
- Copyright © 2022 by the American Society of Nephrology
References
If you are:
- an ASN member, select the "ASN Member" login button.
- an individual subscriber, login with you User Name and Password.
- an Institutional user, select the Institution option where you will be presented with a list of Shibboleth federations. If you do not see your federation, contact publications@asn-online.org.
ASN MEMBER LOGIN
Log in using your username and password
Log in through your institution
Purchase access
Pay Per Article - You may access this article (from the computer you are currently using) for 1 day for US$34.00
Podcast