The data curation process for training vision self-supervised learning (SSL) models as described in the paper "DINOv2: Learning Robust Visual Features without Supervision" involves several steps:
- Data Sources: The authors use a combination of curated and uncurated data sources. The curated datasets include ImageNet-22k, the train split of ImageNet-1k, Google Landmarks, and several fine-grained datasets. For the uncurated data source, they collect a raw unfiltered dataset of images from a publicly available repository of crawled web data.
- Deduplication: The authors apply a copy detection pipeline to the uncurated data and remove near-duplicate images. This reduces redundancy and increases diversity among images. They also remove near-duplicates of images contained in the test or validation set of any benchmark used in the work.
- Self-supervised Image Retrieval: The authors build their curated pretraining dataset by retrieving images from their uncurated data source that are close to images in their curated sources. They compute an image embedding using a self-supervised ViT-H/16 network pretrained on ImageNet-22k, and use cosine-similarity as a distance measure between images. Then, they perform k-means clustering of the uncurated data.
- Implementation Details: The deduplication and retrieval stages of their pipeline rely on the Faiss library to efficiently index and compute batch searches of nearest embeddings. The whole processing is distributed on a compute cluster of 20 nodes equipped with 8 V100-32GB GPUs and takes less than two days to produce the LVD-142M dataset.
This is quite specific… it is quite interesting to ask about
In the paper by Richard Chen et al., the process of selecting patches and whole-slide images (WSIs) for pretraining is as follows:
- Whole-Slide Images (WSIs) Selection: The WSIs used for pretraining were selected from the TCGA-BRCA (The Cancer Genome Atlas - Breast invasive carcinoma) cohort. The TCGA-BRCA dataset is a large-scale, publicly available dataset that contains histopathological images of breast cancer tissues.
- Patch Extraction: The Tissue Image Analysis (TIA) toolbox was used to tessellate each WSI into non-overlapping patches of size 256x256 at 20x magnification. This process ensures that each patch contains tissue and is of a consistent size, which is important for training the model.
- Patch Selection: The patches used for pretraining were those that contained tissue. This was determined using the TIA toolbox, which can identify tissue-containing patches. In total, 2,055,742 image patches were curated from 1,038 WSIs from the TCGA-BRCA cohort.
- Data Augmentation: Standard learning and data augmentation parameters from the respective source papers of the self-supervised learning methods (SimCLR and DINO) were used. For DINO, this included constructing a set of 8 local views (96x96 crops) and 2 global views (224x224 crops) to encourage local-to-global correspondences between the student and teacher models.
- Model Training: The self-supervised learning models were trained on the curated patches. The training was evaluated at 100 epochs.
It's important to note that the selection of patches and WSIs was not random, but rather was based on the presence of tissue in the patches and the availability of WSIs in the TCGA-BRCA cohort. The goal was to curate a dataset that is representative of the diversity of morphological phenotypes in breast cancer histopathology.
Note that this paper did not utilize an elaborate data selection, redundancy-reduction process like the case of DINOv2. It is reasonable to think that WSI patches might contain a lot of redundancies, intuitively at least, while we have access to a very large amount of unsupervised patch data. A proper data curation process should be useful.
- CLAM Model with SSL Features
- Reduction of training datasets (100%, 75%, 50%, 25%)
- Results: The authors found that ImageNet features achieve slightly lower (but comparable) performance on many tasks. They also found that self-supervised methods are more robust, as demonstrated in DINO achieving good performance with only 25% of the original training data in BRCA subtyping. They also found that DINO outperforms SimCLR in weakly-supervised tasks as well as all patch-level tasks.
Interpretability: The authors used the attention weights from the [CLS] token that pools over the patch embeddings to visualize what DINO has learned. They found that visualizations of attention distributions each capture distinct morphological phenotypes, localizing cell location, stroma tissue, and fat/air pockets.
- Image Level Objective: Just like DINO, the DINOv2 architectures follows the similar idea of a teacher-student model where a cross-entropy loss is measured between their respective extracted features.
- Patch-Level Objective: Unlike DINO, the DINOv2 model tries to make the student’s job harder by masking some of the input patches, and additionally measure how close the student output is (in the sense of cross-entropy loss) to the teacher’s corresponding patch output. This loss is the added in addition to the image-level loss. This is related to andMAE.iBOT
- Untyed Head Weights: Have two different attention heads (instead of shared weights) between the patch-level and the image-level problem, were observed to help. So that’s implemented.
- :Sinkhorn-Knopp Centering
- :KoLeo Regularization
- :Resolution Adjustment