PathVQA: Pathology Visual Question Answering

13 min readDec 13, 2020

This article was produced as part of the final project for Harvard’s AC295 Fall 2020 course.

Authors: Haoxin Li, Genevieve Lyons, Rebecca Youngerman, Jerri Zhang

Overview

The purpose of this project is to develop a set of Pathology Visual Question Answering models. Pathology is an important branch of medical practice which involves diagnosing conditions through specimens surgically removed from the body, such as biopsies. However, using machine learning for digital pathology is not as widely studied as other imaging sources, such as radiology [1]. Visual Question Answering aims to train a model that is able to respond to questions about a given pathology image. Using data provided by the PathVQA Grand Challenge [2], we create a set of three models that respond to three different types of questions about a pathology image: “What is present?”, Yes/No questions, and free answer questions. Despite many challenges due to deficiencies in the data, our models achieve 17.8% (for top 1, 69.5% for top 5), 84.6%, and 37.5% test accuracy, respectively.

Introduction and Background

Pathology is the study of and diagnosis of disease using “surgically removed organs, tissues (biopsy samples), bodily fluids, and in some cases the whole body (autopsy)” [3]. Pathology can answer many questions, including necrosis, inflammation, and cancer diagnosis. For example, this is a pathology slide of a kidney, which could be used to diagnose renal failure:

As machine learning has created a renaissance in medical artificial intelligence, however, digital pathology has not been as widely studied as other types of imaging, such as X-rays, MRIs, and CT scans [1]. As always with machine learning in the medical field, collecting data is a challenge; it requires significant work among individuals with extreme expertise. That said, there have been a number of advances using deep learning to help with digital pathology, particularly with tumor pathology [4].

Data

The March 2020 paper “PathVQA: 30000+ Questions for Medical Visual Question Answering” [5] seeks to develop a large set of images, questions, and answers for a Pathology Visual Question Answering model, and additionally have created a Grand Challenge for this task, using this data source [2]. The dataset can be found here: UCSD-AI4H/PathVQA. The authors pose the question: “Is it possible to develop an “AI Pathologist” to pass the board-certified examination of the American Board of Pathology?”

The data were collected by the authors using a “semi-automated pipeline to extract pathology images and captions from textbooks and generate question-answer pairs from captions using natural language processing”. However, as we will discuss below, the author’s process was not very reliable and contained many deficiencies that we had to work around to create usable models.

Exploratory Data Analysis

We first conducted an exploratory data analysis to learn about the structure of the PathVQA images and associated questions. We loaded in the data, which has a predefined train, test, and validation split, and viewed example images. These images show cells, organs, specific biological conditions, and various other pathology-focused subjects. Also, some images are illustrations, as the dataset was collected from various textbooks. Each image has an average of 7.6 associated questions. An example image and associated questions are shown below.

In the full dataset (including training, testing, and validation sets), there are n = 32,795 total questions associated with 4,289 total pathology images. In the training dataset, there are n = 19,755 total questions associated with 2,599 total pathology images. The bar charts below show the frequency of each answer type.

We observe that there are more yes than no answers in our dataset. This is actually quite notable and has a big effect on our baseline model (discussed below). When we dig into the most common questions, we observe that they follow a very specific pattern, asking “Is ____ present?” and the answer to this question is almost always yes (see the distribution below). Essentially, these questions are not true yes/no questions — they are simply classifying where the pathology slide is from. 95% of the pathology slides have at least one “Is ___ present” question. This fact predicates our modeling decisions and next steps, described below.

We also investigate the open ended answers, for which we will train another model. These open ended answers are distributed as below:

Finally, we observe that the answers are often very short (< 10 words, or just 1 word), while the questions tend to be longer, up to 100 words long.

Methods

Challenges

Our first attempt was a naive model (see the Method section for a description of the architecture) that we trained using all yes/no questions and associated images. This model technically achieved 77% accuracy — however, as we noted before, a large volume of the questions were of the form “Is ___ present?”, and almost always have an answer of “yes”. Our model was able to perform so well initially because it just guesses “yes” for all of those questions. Because of this data deficiency, we decided to exclude these “Is ___ present” questions from our yes/no VQA model. However, these questions contain useful information about what is present in the image, and should not be discarded. So, we repurposed them to create a classification model using their designations, e.g. if the question is “Is cardiovascular present?” and the answer is yes, then the label for the classification model is “cardiovascular”.

The freeform response questions had a similar data deficiency. For example, the question “Where is this part in?” is almost always related to an answer of “spleen”. We excluded such questions that would falsely increase our accuracy metrics. We also limited the freeform answers to the top 50 most common answers.

Experiment Setup

Taking the above challenges into consideration, the project is set up so our PathVQA can answer 3 types of questions, depicted in the flowchart below:

Data Processing

We predict the likelihood of 17 classes for our classification task. These classes were generated from our “Is ____ present?” questions, restricting to those with at least 20 occurrences in the train, test and validation datasets. The distribution of these answers is still skewed within our training set to the point where initially our models would only predict the most common few classes. To remediate this challenge, we performed re-sampling [6] of the data to balance each class to 100 occurrences. Some classes, therefore, were over-sampled. We performed data augmentation by randomly flipping and rotating the images in order to prevent overfitting for those images that are seen multiple times in training and to generally increase the diversity of our image data [7]. Our final training dataset is 990 images, validation dataset test set is 369 images, and test dataset is 370 images.

*Our initial class distribution is seen on the left, and final distribution after re-sampling is seen on the right*

Classification Model: without transfer learning

This model used an architecture consisting of four convolutional neural network (CNN) layers, each with a pooling layer immediately following. The model finishes with a flattening layer, a dropout layer to help prevent overfitting, and two dense layers. The final dense layer corresponds to our number of classes (17) so that we generate a prediction value for each potential class for each image. This model summary details the architecture with a total of 2,770,777 parameters.

Classification Model: with transfer learning

We fine tuned weights from VGG16 trained on ImageNet in order to increase our model accuracy. We employ transfer learning using these weights, setting our last few layers as trainable. We train two CNN and pooling layer pairs, as well as a flatten, dropout, and two dense layers (similarly to our non-transfer learning model) on top of the pre-trained VGG network. This results in a total of 15,136,557 parameters, though only 421,849 are trained with our pathology images. As will be discussed in the results section, this transfer learning certainly increases our performance.

No Attention VQA: LSTM + VGG

In this model, we used an LSTM with an embedding layer and two bidirectional LSTM layers to analyze the questions, and VGG-16 with weights pre-trained from ImageNet to analyze the pathology images. Traditionally used embeddings such as GloVe are not appropriate in a clinical setting, as many important words are out-of-vocabulary or underrepresented, so we trained the embeddings from scratch. The resulting latent space embeddings are concatenated and then combined in an MLP to predict the response. The MLP has two hidden layers with 1024 and 512 nodes, respectively, dropout (p=0.1), and ReLU activation functions. The architecture is very similar to the one proposed in “VQA: Visual Question Answering” [8]. We trained this model separately for yes/no answers and freeform answers.

Source: Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C. and Parikh, D., 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision (pp. 2425–2433).

No Attention VQA: BERT + VGG

This model is similar to the one described above, but instead of using an LSTM and embeddings trained from scratch, we used BERT under the assumption that some generic text knowledge and semantic meaning will still be able to transfer to the clinical domain. We did not re-train the BERT embeddings. We trained this model separately for yes/no answers and freeform answers.

Stacked Attention VQA: BERT + VGG

We implemented the attention based model based on the model structure proposed in “Stacked Attention Networks for Image Question Answering” [11]. Similar to the VQA structures mentioned in the above sections, stacked attention networks (SAN) has an image channel and a text channel to ingest pathology images and questions. Additionally, we added two attention layers to utilize the information from both channels.

*Source: Z. Yang, X. He, J. Gao, L. Deng and A. Smola, “Stacked Attention Networks for Image Question Answering,” 2016 IEEE CVPR, 2016, pp. 21–29, doi: 10.1109/CVPR.2016.10.*

For the image channel, we first rescaled the image to 224 by 224. Instead of extracting features from the convolutional layers, we extracted the last pooling layer to retain the spatial information of the original images. The extracted feature map has the dimension of 512 x 14 x 14, where 14 × 14 is the number of regions in the input image, and each region is represented by a feature vector of length 512. Each such feature vector corresponds to a 16 × 16 pixel region of the input image, and those feature vectors will be what the text channel pays attention to. For the question channel, we used pretrained DistilBERT to extract text embeddings.

To implement the attention layer, we first generated the attention distribution of text embeddings over the regions of the image (eq.15, eq.16). Based on the attention distribution, we calculated the weighted sum of the image vectors (eq.17), and formed a refined query vector (eq.18). The query vector encodes both question information and the visual information that is relevant to the answer. Sometimes, a single attention layer may not be sufficient to locate the correct region for correctly answering the question. In that case, we can stack multiple attention layers (eq.19–22), so the aggregated image feature vector is added to the previous query vector to form a new query vector. The updated query vector should be able to extract more fine-grained visual attention information and improve the attention performance.

Results

Classification Model: “Is___present?”

The results from the classification model show that we are able to properly classify what is present in the pathology image with a top-1 test accuracy of 18% and top-5 test accuracy of 70%. Using transfer learning with VGG results in a significant increase in the accuracy of our models. The large discrepancy between top-1 and top-5 test accuracy is likely because of certain organs appearing very similarly, and the significant variation of quality and type of image in the dataset. For example, the model frequently mistakes cardiovascular (i.e., a heart) with endocrine.

We visualized the classification model using GradCam to ensure that it is properly identifying the organs, and to understand why it is sometimes mis-identifying organs.

For example, below is a GradCam visualization of a correct prediction of endocrine. We see that the most important pixels are exactly the organ of interest (here, a thyroid gland) — and ignoring the blue background, the ruler, and the black square in the corner. Specifically, the model is focusing on the bottom left portion of the organ, with the distinctive butterfly shape.

We can see similar logic for an incorrect prediction of “endocrine” for a heart. The model disregards the background and correctly identifies the organ, but focuses on the bottom left of the organ, seeing the butterfly shape caused by some tissue in the corner, and mis-identifies it as endocrine. The top 5 predictions for this image are similar to those of the true endocrine image above, including gastrointestinal and hematologic.

VQA Models: Yes/No Questions and Top 50 Freeform Answers

The results for the VQA models show that the model with two attention layers between the question and the image significantly outperforms the non-attention models for yes/no questions, with 85% test accuracy and a test AUC of 0.94, compared to a test accuracy of 80% for the LSTM and 76% for the BERT model without attention. It is interesting that the LSTM outperforms the BERT model without attention. This is likely because we trained the embeddings from scratch for the LSTM, but used pretrained embeddings for BERT, which may not have been as appropriate or specific for the medical setting.

However, the BERT model without attention performs better than the BERT model with attention for the freeform answers with a test accuracy of 37% vs 33%. The LSTM performs very poorly for this model, with 6% test accuracy after 10 epochs and 15% after 50 epochs. It is important to note that, due to the data deficiencies described above, the sample size for the freeform answers was ultimately quite low (n = 536 question/answer/image pairs), and so the results may be unstable. This is likely driving the poorer performance of the attention model compared to the BERT non-attention model and the LSTM, as we train the embeddings from scratch, but there are likely not sufficient observations to properly train the embeddings (when compared to the yes/no questions).

Discussion

Our results indicate that transfer learning significantly improves the performance of the classification model. For the VQA models, we found that the model with attention layers between the image and the question significantly improved our results for the yes/no questions. However, we hypothesize that the attention model performed poorly for freeform answer questions due to the very small sample size.

We found that the pre-trained embeddings in BERT were somewhat transferable to the specificity of the medical domain, but attained better results by training our embeddings from scratch (as evidenced by the LSTM out-performing the non-attention version of BERT). However, again, a larger sample size would be needed for these embeddings to be fully trained for the freeform answers.

Because of the heterogeneous nature of pathology images (which can be of any organ system in the human body), we believe that the classification task is a critical one, and any Pathology VQA model intended for medical practice should be trained in conjunction with a classification task, or separately for different organ systems.

Pathology Visual Question Answering is an important task, and our results indicate that it is a promising area of research. Maybe one day an AI pathologist will be able to pass the board certified examination of the American Board of Pathology. However, to accomplish this goal, it will be critical to develop a better and more appropriate dataset.

Finally, it is important to note that any AI pathologist intended for use in medical practice would need significant input from a trained pathologist. Our team is not qualified to make these determinations. Medical AI should be used in tandem with, and never as a replacement for, medical professionals.

References

[1] Niazi MKK, Parwani AV, Gurcan MN. Digital pathology and artificial intelligence. Lancet Oncol. 2019 May;20(5):e253-e261. doi: 10.1016/S1470–2045(19)30154–8. PMID: 31044723.

[2] Pathology Visual Question Answering. https://pathvqachallenge.grand-challenge.org/

[3] What Is Pathology? https://www.mcgill.ca/pathology/about/definition

[4] Jiang Y, Yang M, Wang S, Li X, Sun Y. Emerging role of deep learning-based artificial intelligence in tumor pathology. Cancer Commun (Lond). 2020;40(4):154–166. doi:10.1002/cac2.12012

[5] He, Xuehai & Zhang, Yichen & Mou, Luntian & Xing, Eric & Xie, Pengtao. (2020). PathVQA: 30000+ Questions for Medical Visual Question Answering.

[6] Resampling strategies for imbalanced datasets. https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets

[7] Shorten, C., Khoshgoftaar, T.M. A survey on Image Data Augmentation for Deep Learning. J Big Data 6, 60 (2019). https://doi.org/10.1186/s40537-019-0197-0

[8] S. Antol et al., “VQA: Visual Question Answering,” 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 2015, pp. 2425–2433, doi: 10.1109/ICCV.2015.279.

[9] Z. Yang, X. He, J. Gao, L. Deng and A. Smola, “Stacked Attention Networks for Image Question Answering,” 2016 IEEE CVPR, 2016, pp. 21–29, doi: 10.1109/CVPR.2016.10.