United approach for classification and segmentation – using Guided Attention Inference Network
In the last years, with the enourmously fast development of artificial intelligence field,
there are many attempts to automate many tasks and roles performed traditionally by humans,
and even outperform them. Particulary, Medical image-based diagnoses, such as
pathology, radiology, and endoscopy, are expected to be the first
in the medical field to be affected by AI methods, in particular by deep learning networks and algorithms.
For example, a convolutional neural network was recently reported as being highly beneficial in the field of endoscopy.
Medtronic uses PillCam device, which is a disposable capsule that uses a miniaturized camera to make thousands of snapshots the GI tract. The goal is to replace the traditional methods of endoscopy, which are more specialists-dependent, costlier, more complicated and e.t.c. Deep learning methods can significantly help to mine and sort the most important and valuable shots, which will accelerate the diagnostics process and even increase its assurence and quality, by adding a computational-assistant which can process huge amounts of data.
In our project, firstly we train a deep learning network to classify ill and healthy GI tract shots,
secondly we use an attention mechansim to improve the network performance, and add other capabilities on which we detail further.
Our main goal is to use deep learning network to classify PillCam shots to healthy and ill,
and also to incorporate an attention mechanism which will imporve the network performance,
by focusing it on the most significant area to the classification decision.
Thus, we want to achieve the following outcomes:
Improve the classification on positives, i.e true-positives,
given a very high bound on the false-negatives are allowed to occure.
Localize the area that contributed to the classification decision the most,
and be able to show it to any observer, i.e weak-segmentation task.
An additional achievable outcome from the last is an increase of explainability of the networks classification decisions,
such that if the network makes wrong decisions, the observer has additional indicators,
which are the corresponding false-localization areas, which can be analyzed for further imporvement.
In our approach incroporated with the Guided-Attention-Infference-Network (GAIN) method,
we will use a deep learning classification network MobileNet V2 for a binary-classification task for ill and healthy shots,
with an additional visualizing capability of the reasonings for those classifications.
The method for the networks reasoning visualization is called GradCAM,
a prime component in the GAIN method, which is an improvement of CAM method .
CAM - Class Activation Maps in its different variations (GradCAM and e.t.c) are a family of CNN-explanation visualizations methods,
which are based on the observation, that convolutional layers naturally retain spatial information which is lost in fully-connected layers,
so it can expected that the last convolutional layers have the best compromise between high-level semantics and detailed spatial information.
Further, we will guide our network usings those visualizations to improve its own decisions
(which is the main novelty of GAIN comparing to regular visualization techniques mentioned above),
in a way known as self-attention mechanism, which generally are widely and variously used in many deep learning networks.
Additionally, GAIN method also allows to use an extra-supervision with the masks of our positive images,
which where seen in the examples above.
Briefly, GAIN method can be summerized as following:
Firstly, GAIN method creates an attention map as a heatmap of its classification-decision.
Secondely, it uses this map to modificate and erase from the image the most significant part to the classification-decision.
Finally, by passing the modificated image through to network and by a special definition of so-called AM(Attention-Mining) loss,
and optimizing it, it forces the heatmap to cover all the significant areas to classification-decision.
AM loss defined as the classification score of the masked image to be the class according to which it was masked,
i.e the groundtruth class. The minimization of that score should lead the heatmap to grow all over the region of interest
of the groundtruth class.
As to the extra-supervision part, GAIN directly define the loss as the pixel-wise difference between the heatmap and the pre-given masks.
Firstly, we try to achieve some reproducibility of the GAIN paper results on the VOC 2012 dataset with
VGG network used in the paper.
We also experiment with the gradients backward path of the masked image setting it on/off.
The influence of this gradients strongly associated with the capability of the AM loss increase the attention map,
where semantically it can be understood as the AM loss influence on the network to say on the inputed masked image,
how much it isn't the class was masked. If the gradients are set off, it doesn't say that at all, which leads to attention map growing larger.
See image attached for illustration.
In the paper the authors used a single-image-per-batch approach, which we upgraded to multibatch mode,
which allows the direct pararel computation of many GradCAM heatmaps for a batch of images at a time.
In Medtronic Data it is more convenient to use AM loss only on positive samples,
as the negative samples could not changed to positive by erasing any details, i.e the whole tissue is healthy.
We manipulated the color of the erased area to be the average RGB color of the region of interest of all of the images.
The experiments were done on a remote Amazon GPU Engine.
We used PyTorch as the main deep learning framework with Tensorboard logging of all of the training & testing measurments, such as:
Loss per iteration/epoch
ROC measurmenet - given a threshhold of FA of 0.05% and 0.1%, idicates the rate of true-positives can be achieved.
IOU - intersection over union of heatmaps and pre-given masks to measure the localization quality and imporvement.
The first Table (up to down) shows the results of training with only extra-supervision with different amount of masks.
The second Table shows the results of training with Attention-Mining self-supervision only,
and combining the two methods as it presented in the formulas (see above or in the paper),
i.e combining Attention-Mining self-supervision with extra-supervision together with the pre-given masks.
Another important aspect of the results of the second table is the "grad on/off notation",
as was explained and demonstrated in the previous section.
The results we've obtained werent unequivocal:
From one hand using only extra-supervision gave a little noisy increase in our ROC measure with 0.1% threshhold,
and more significant increase in 0.05% threshhold. It also has been observed that in multiple re-runs of those experiments,
the result more frequently achieved the highest results in ROC in both mesures.
The IOU also has increased significantly, which can be seen in the visualizations.
From other hand incorporating Attention-Mining self-supervision mechanism was much trickier.
In Medtronic data, it wasn't observed significant improvement and in many cases, as presented in the table below,
It was seen that there is an unsignificant decline in the ROC measure,
with eather no improvemt in localization (IOU), or decline in it.
In VOC 2012, working on paper results reproductions, we've niether noted significant changes in the attention maps,
as it presented in the paper, nor improvement in the performance.
Researching the case, we've found that manipulating the path of the backward-gradients in the Attention-Mining training,
(technical details can be found in the repository on GitHub),
can achieve in VOC interesting effect of very good localization in many cases,
although with a price of a decline in performance measured with accuracy,
and a side-effect of increasing data-biases, which were noted in the attention map.
One example is as we can see in the first attached example,
that a bird image in VOC 2012 dataset is heated with the trees aside her, discovering the bias of birds and trees/plants.
Another example were observed int the process is images of boats and the water with a large water area heated.
As a result of this finding, which was emphasized by our academic stuff as an interesting research direction,
and as a result which could behave differently on Medtronic data, we've performed experimnets with gradient manipulation
and without it, as named in the second Table as "grads on" - without manipulation, "grads off" - with it.
Visual results also presented in this manner further.
From the above analysis and discussion on the results, the achievements can be listed as:
GradCAM as a classification-reasoning visualization technique is applicable to Medtronic data,
      and provide the localization capability expected from an attention mechanism.
Extra-Supervision training improves the performance on the ROC measurement, mostly on the 0.05% FA threshold.
Extra-Supervision training significantly improves the performance on the IOU measurement, i.e the localization quality,
      which is also noteable in the presented visulizations of the attention maps.
Both previous advantages are achievable with 10% of the amount of all of the pre-given masks.
      Even with only 1% of the pre-given masks, the localization capability is much better and indicates a pretty good generalization capability.
An interesting research direction of data-biases visualization discovered:
      from the process of gradients manipulation through the AM loss path on the masked image backward path,
      which pottentially can contribute to the academic developemnt of this and similiar visualization techniques.
A few conclusions derived from the results:
Attention-Mining self-supervision training is harder to achieve, technically by itself,
      and taking into considerations Medtronic data characteristics and the appropriate to them characteristics of the attention maps,
      mostly that the region of interest is in most cases is smaller then the attention map achieved before AM training.
      That should be taken into account.
      It can be also noticed that in the combined extra-supervision and attention-mining training,
      the first constrains also the size of the image, so the second with gradients on/off doesn't influence too bad on it
      in localization sense, this is an interesting observation, which we mention in the research section.
      Another aspect of AM training in our task is the very harsh constraint on the FN we allow,
Attention-Mining self-supervision training is needed to be additionaly examined, more on that in next section.
Manipulating gradients through the AM loss path on the masked image backward path :
      As it was mentioned before, this is an interesting finding. Technically, our team added the capability to not only set on/off the gradients,
      but also to control their magnitude, i.e making it an additional parameter for hyper-tunning in a way such that,
      for a magnitude parameter x the magnitude of the gradients of the previously manipulated path will be factorized by 1/x,
     thus for a big value it will be equal to set the gradients off and for x=1 value it will be the default case, as without manipulations.
Bounds on the attention map size:
      From the convlusions in the previous section, it can be also efficient to constrain the size of the attention maps,
      either by minimizing it or by bounding it according to some prior precomputed value,
      i.e the average size of the pre-given segmentation masks and e.t.c
      In the image attached below we experimented with this technique setting the gradients off
      and learning on the negative samples, such that the growing heatmaps on negative also constrained the
      size of the attention maps of the positive by the complement principle.
Training on all of the lables (not only on positives):
      One of many different post-experiments with training on negative and positive labels data an AMloss weight adjustments gave an interesting result,
      which should be further investigated.
      0.1% FA ROC results similiar to baseline performace was achieved with a little improvement in IOU and 0.05% ROC.
      It was also useful in constraining the attention map area & size of the positive images with gradients set off during AM training
      as the heating grows on the negative on the whole image, and by complement the attention maps of the positive become small.
Observing results on more measurements:
      Although it doesn't serv the main industrial goal of the project, it can be educational to seek for the lower bound of the FN
      which can show an improvement during AM training, or even AUC measure.
Students:
      Ilya Kotlov
Industrial Supervisors:
      Mrs. Alexandra Gilinsky, Medtronic
      Mr. Itamar Talmi, Medtronic
Academic Supervisors:
      Prof. Alexander Bronstein, Lecturer & Academic Coordinator, Technion
      Mr. Lev Yohananov, TA & Assistant, Technion
The entire project can be found on GitHub.
There you'll find the code, user/admin manual, project progress documentation, and more.
Please contact us for any suggestions, insights, questions and additional information about the project.