Author ORCID Identifier

Document Type


Date of Award


Degree Name

Master of Science (MS)


Computer Science

First Advisor

KC Santosh


Chronic respiratory diseases, ranking as the third leading cause of death worldwide according to the 2017 World Health Organization (WHO) report, affect a staggering 544.9 million individuals. Compounding this public health challenge is the fact that over 80% of health systems grapple with shortages in their radiology departments, highlighting an urgent need for accessible and efficient diagnostic solutions. While various image classification models for analyzing thorax abnormalities have been developed, relying solely on one type of dataset (image data, for example) for thorax abnormality analysis is insufficient. Integrating texts with image data could provide more accuracy as well as analysis. In response to this challenge, we propose a multimodal approach to generate detailed radiology reports from chest X-ray images and their corresponding radiological reports (Impression and Findings). Our framework integrates a pre-trained Convolutional Neural Network (CNN) for robust image feature extraction, a Recurrent Neural Network (RNN), and a visual attention mechanism to ensure coherent sentence generation. The image encoder employs the ResNet152 architecture to extract nuanced visual features from chest X-ray images. Simultaneously, the sentence generation model utilizes a Long Short-Term Memory (LSTM) layer to process textual data and generate contextually relevant reports. On an IU dataset of 7470 pairs of X-ray images and 3995 reports, our model exhibited superior performance based on language generation metrics (BLEU1= 0.4424, BLEU2= 0.2923, BLEU3= 0.207, BLEU4= 0.1464, ROUGE= 0.3396, and CIDEr= 0.2268), providing accurate and coherent impressions and findings compared to other benchmark models.

Subject Categories

Computer Sciences


MULTIMODAL LEARNING, CHEST X-RAY, THORAX ABNORMALITY, pre-trained Convolutional Neural Network (CNN), robust image feature extraction, Recurrent Neural Network (RNN), visual attention mechanism, coherent sentence generation

Number of Pages



University of South Dakota



To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.