The evaluation of expert pathologists of kidney biopsies remains the gold standard for diagnosing and staging renal diseases [1]. Although biopsies digitalized into Whole Slide Images (WSIs)4 have facilitated obtaining a visual morphological assessment of different anatomical structures for disease categorization, high-quality diagnostic assessments heavily depend on the correct lesion quantification manually annotated by pathologists across structures within a biopsy. Fig. 1 shows an example of annotated region-of-interest (ROI)5 within a biopsy that contains hundreds of densely packed tissue objects. The annotation would cost a skilled expert around 2–4 h for a complete biopsy. Due to the complexity and time-consuming nature of this task, there is a strong need for automated structure annotation and tools for lesion classification to facilitate further quantification, offload annotation time, and reduce intra- / inter-observer variability [2], [3].
Deep learning based instance segmentation algorithms have demonstrated significant capabilities on biomedical datasets [4], [5], [6], [7], [8], [9], [10], [11], [12]. However, developing a generic framework in renal pathology is still a challenge. This challenge can be elaborated in four technical gaps and two practical issues. The four gaps (see Fig. 1) are: (1) densely packed structures, up to 1000, per ROI; (2) considerable variation in size and shape of objects (e.g., arteries can be up to 100 times larger than that of tubuli); (3) class imbalance (e.g., the tubulointerstitial area occupies more than 70% on average in healthy and diseased renal parenchyma [5]); (4) each anatomical structure may present multiple lesions. The two practical issues are: (a) how to fuse multiple datasets with variation in staining to fully exploit scarce annotations; (b) readiness for extensibility, i.e., cost-effective adaptation to new lesion types from expanding datasets in clinical scenarios. Simultaneously addressing these difficulties requires a universal framework that includes efficiency, staining style (domain) transfer [13], [14], [15], and flexibility for continuous learning [16], [17].
Prior studies have shown significant capabilities in lesion classification through dense instance segmentation, but their paradigms lack scalability and adaptability to datasets with potential changes in lesion compositions. Some approaches [5], [7], [8], [18], [19] adopted a two-step process with semantic segmentation followed by dataset-specific post-processing to achieve final instance masks, which cannot be scaled to large-scale datasets with dense objects in various shapes. Besides, each lesion is defined as one semantic class, leading to difficulty for multi-label lesion classification and potential changes in lesion combinations. Specifically, adding or removing lesions requires complete redesign and retraining of the model. Lastly, increasing lesion types requires adding more segmentation maps for prediction, which is significantly resource-intensive.
Detection-based models with regional convolution neural networks (RCNNs) [6], [20] can efficiently detect dense anatomical structures and separately design lesion classification heads for each class. They can, therefore, adopt a plug-and-play mechanism and adapt to lesion changes by replacing the corresponding sub-modules and maximally reusing the others. However, these variants are unscalable to multi-class objects with various scales and shapes due to the reliance on pre-defined bounding box anchors, limiting their application to process a single class with lesser variation in scales and shapes, e.g., glomeruli.
Recently, transformers have been the prevalent anchor-free approach to address multi-class objects at various scales and shapes. Transformer-based instance segmentation utilizes attention mechanisms and learns latent representative embeddings (queries) from global contextual features to process vastly varying objects. However, existing models [21], [22], [23] are resource-intensive for dense structures in large-scale datasets. That is due to two shortages: (1) a large number of static queries from one embedding per object; (2) low instance map occupancy from one object per instance map (depicted in Fig. 2c). Hence, they are limited to processing classes with sparse objects, like glomeruli.
In response to these challenges, we propose a generic and extensible system for dense instance segmentation and lesion classification on large-scale datasets with potential changes in lesion combinations. Our design has two key components: (a) A novel dense instance segmentation sub-network that recognizes basic anatomical structures, i.e., glomeruli, tubuli, and arteries; (b) a lesion classification sub-network with a set of independent heads that predicts lesions for each class. It is crucial to separately modularize lesion classification and dense instance segmentation for the big picture. The segmentation of basic anatomical structures is a universal foundation for all visual assessment systems. However, lesion classification is a task-driven downstream application, which might vary depending on the interest of users. Therefore, the robustness of dense instance segmentation from local changes in heads of lesion classification is beneficial for the extensibility of our system and for expanding datasets in clinical scenarios.
More specifically, for dense instance segmentation, we propose a novel end-to-end approach, named DiffRegFormer, which effectively combines the advantages of diffusion models, RCNNs, and transformers to tackle dense objects with multi-class and multi-scale. Diffusion models [24] generate bounding box proposals from Gaussian noise, and therefore eliminating the need for pre-defined anchors; RCNNs efficiently crop dense instance maps into dense regions which have very high occupancy rate w.r.t. the bounding boxes (see Fig. 2. (a)); Cross-attention mechanisms with dynamic queries [25], [26], [27], [28], [29] extract long-range contextual features and enables robust representation of objects across varying scales. DiffRegFormer is not just a simple assembly of the aforementioned techniques. Such assemblies may suffice for sparse instance segmentation on datasets like MSCOCO [30]. They, however, fail in the context of dense instance segmentation of kidney biopsy ROIs. The crux roots in the proposals that are essentially Gaussian noise at the early training stage of diffusion models. Those noisy candidates, consequently, hinder training due to accumulating errors. Our ablation study analyzes these effects in detail (see Section 4.4.1) and highlights the importance of our specific designs. Specifically, we introduce the following key innovations to address the challenges for dense instance segmentation:
- •
Regional features: Similar to RCNNs, feature maps are cropped using generated proposals and converted to dynamic queries for efficient long-range dependency modeling; this is robust to large-scale variations.
- •
Feature disentanglement: Instead of using shared feature maps, we redesign the model structure to generate separate feature maps for the bounding box decoder and mask decoder; this is crucial to stabilize the training of the mask decoder.
- •
Class-wise balanced sampling: Unlike conventional sampling methods, we propose a novel sampling approach. The key difference is to select class-wise balanced positive samples among the ground-truth boxes instead of the proposal ones.
In conclusion, based on those key designs, the proposed model is a generic and extensible approach for large-scale datasets with little overhead. To the best of our knowledge, DiffRegFormer is the first end-to-end framework that combines diffusion methodology with a transformer in RCNN-style to process dense objects within ROIs efficiently for multi-scale and multi-class. Fig. 3 depicts a flow chart of the dense instance segmentation sub-network.
Each anatomical structure has a dedicated classification head for the lesion classifier since class-wise lesions can differ depending on the dataset’s lesion combinations. In Fig. 6, we input the croppings of anatomical structures and resized them to the same size. Each head is trained only with patches of the specific structures and learns to predict multi-label lesions over each class. Our dataset currently only contains sclerotic glomeruli and atrophic tubuli. However, the lesion classifier is extensible to the expanded large-scale dataset at minimal cost because adding, removing, or replacing heads cannot disrupt the other modules.
Our work aims to propose a generic technical solution that has flexibility and scalability for automating lesion classification on dense anatomical structures in clinical scenarios. It works as the foundation for further quantification analyses. The main contributions of this work are the following:
- •
We propose the first end-to-end dense instance segmentation that effectively combines diffusion models, RCNNs, and transformers for multi-class, multi-scale objects. More importantly, our model can directly process ROIs of kidney biopsies, avoiding patch splitting.
- •
We propose novel designs to address the accumulative errors caused by diffusion models at an early stage and stabilize the training process.
- •
We propose a novel class-wise balance sampling method to improve detection and segmentation performance.
- •
Instead of the static queries used in transformers, we convert regional features into dynamic queries and model the long-range dependencies between regional features while avoiding performance deterioration on dense objects.
- •
We compare DiffRegFormer with previous models that can process multi-class and multi-scale objects within ROIs in an end-to-end manner. Our model outperforms the previously published models in evaluating Jones’ silver-stained images.
- •
We show that DiffRegFormer has the potential of stain (domain)-agnostic detection for PAS-stained images without stain-specific fine-tuning.
- •
Our lesion classifier can achieve multi-lesion classification. In addition, our plug-and-play strategy can flexibly adapt to large-scale datasets with potential changes in lesion combinations at minimal overhead.
The remainder of the paper is organized as follows. In Section 2, all relevant work is introduced, and we elaborate on all improvements w.r.t. previous research. In Section 3, we describe each component of our model in detail. In Section 4, we demonstrate the extensive evaluation of our model, including comparison experiments and ablation studies. In Section 5, we describe the advantages and limitations of our pipeline and show possible future follow-up research.