Document Type : Original Article

Authors

Golestan University

Abstract

In recent years, one of the most widely studied areas in computer vision and natural language processing (NLP) is the interdisciplinary problem of Visual Question Answering (VQA), which involves the integration of computer vision and NLP.

Important challenges in this field include the need for large and suitable datasets as well as powerful hardware for training the model. Key factors to improve the performance of these models include selecting the appropriate neural network for processing the inputs, selecting the appropriate dataset, and the method of combining the features extracted from the inputs. Also, using different attention mechanisms can improve the overall performance of the system. Furthermore, incorporating various attention mechanisms into the model can significantly enhance the overall performance of VQA systems. In these systems, different neural networks are employed to process inputs: convolutional neural networks (CNNs) with various architectures are used for image processing, and different types of recurrent neural networks (RNNs) are used for text processing.

In this research, the architecture of the convolutional neural network is changed and the self-attention mechanism is used in text processing and the Skipgram language model is used for embedding the input text. The performance of the proposed model is evaluated on two datasets, VQA 1.0 and VQA 2.0. The results show that the proposed model has been able to increase the overall accuracy in the VQA 1.0 dataset to 67.25% and in the VQA 2.0 dataset to 61.57%, which show a significant improvement over the baseline models.

Keywords

Main Subjects