Beyond Text: Exploring Multimodal BERT Models

Main Article Content

Sarika Kondra, Vijay Raghavan
Vijay kumar Adari

Abstract

This paper explores the burgeoning potential of Bidirectional Encoder Representations from Transformers (BERT) for various multimodal tasks. BERT’s ability to capture contextual relationships in text empowers it to create richer representations for data beyond just language.


The paper explores how BERT can be integrated with visual and auditory information for applications in video, image analysis and audio processing. Different approaches for this integration are discussed, including early fusion and multimodal transformers, where BERT collaborates with models specialized in other modalities to achieve a deeper understanding of the content.


BERT’s capabilities extend to audio data as well. It can be employed for tasks like speech recognition, where it can improve word prediction accuracy by leveraging its understanding of language context. Additionally, BERT holds promise for sentiment analysis in audio, enabling the analysis of emotional tones and speaker intent.


Furthermore, the combination of BERT with Graph Neural Networks (GNNs) presents promising results for tasks involving relational data and text, as seen in recent work with Graph-BERT.


The paper also highlights the potential of multilingual BERT models (mBERT) for tasks in multiple languages. The versatility of mBERT and other multilingual models allows for the exploration of tasks beyond individual languages.


In conclusion, BERT’s versatility in multimodal tasks opens new possibilities for data interaction and understanding. 

Article Details

Section

Articles

How to Cite

Beyond Text: Exploring Multimodal BERT Models. (2025). International Journal of Research Publications in Engineering, Technology and Management (IJRPETM), 8(1), 11764-11769. https://doi.org/10.15662/IJRPETM.2025.0801003

References

1. Sun, Chen, et al. “VideoBERT: A Joint Model for Video and Language Representation Learning.” Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019.

2. Su, Weijie, et al. “VL-BERT: Pre-training of Generic Visual-Linguistic Representations.” arXiv preprint arXiv:1908.08530 (2019).

3. Shen, T., et al. “BERT-Based Denoising and Reconstructing Data of Distant Supervision for Relation Extraction.” CCKS2019 Shared Task (2019).

4. Lu, Jiasen, et al. “ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks.” Advances in Neural Information Processing Systems 32 (2019).

5. Tan, Hao, and Mohit Bansal. “LXMERT: Learning Cross-Modality Encoder Representations from Transformers.” arXiv preprint arXiv:1908.07490 (2019).

6. Goyal, Yash, et al. “Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.

7. Hudson, Drew A., and Christopher D. Manning. “GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.

8. Li, Liunian Harold, et al. “VisualBERT: A Simple and Performant Baseline for Vision and Language.” arXiv preprint arXiv:1908.03557 (2019).

9. Alberti, Chris, et al. “Fusion of Detected Objects in Text for Visual Question Answering.” arXiv preprint arXiv:1908.05054 (2019).

10. Li, Gen, et al. “Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training.” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020.

11. Wang, Yansen, et al. “Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors.” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019.

12. Qi, Di, et al. “ImageBERT: Cross-Modal Pre-Training with Large-Scale Weak-Supervised Image-Text Data.” arXiv preprint arXiv:2001.07966 (2020).

13. Shang, Junyuan, et al. “Pre-Training of Graph Augmented Transformers for Medication Recommendation.” arXiv preprint arXiv:1906.00346 (2019).

14. Zhang, Jiawei, et al. “Graph-BERT: Only Attention Is Needed for Learning Graph Representations.” arXiv preprint arXiv:2001.05140 (2020).

15. Devlin, Jacob, et al. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv preprint arXiv:1810.04805 (2018).

16. Koroteev, M. V. “BERT: A Review of Applications in Natural Language Processing and Understanding.” arXiv preprint arXiv:2103.11943 (2021).

17. Yang, Kaicheng, Hua Xu, and Kai Gao. “CM-BERT: Cross-Modal BERT for Text-Audio Sentiment Analysis.” Proceedings of the 28th ACM International Conference on Multimedia, 2020.

18. He, Jiaxuan, and Haifeng Hu. “MF-BERT: Multimodal Fusion in Pre-Trained BERT for Sentiment Analysis.” IEEE Signal Processing Letters 29 (2021): 454–458.

19. Liu, Xubo, et al. “Leveraging Pre-Trained BERT for Audio Captioning.” 2022 30th European Signal Processing Conference (EUSIPCO). IEEE, 2022.

20. Morato, Irene Martin, and Annamaria Mesaros. “Diversity and Bias in Audio Captioning Datasets.” (2021).

21. Liang, Yuxuan, et al. “BERT-Enhanced Text Graph Neural Network for Classification.” MDPI Applied Sciences 12.7 (2022): 3322.

22. Wang, Yuxuan, et al. “Cross-Lingual BERT Transformation for Zero-Shot Dependency Parsing.” arXiv preprint arXiv:1909.06775 (2019).

23. Pires, Telmo, Eva Schlinger, and Dan Garrette. “How Multilingual Is Multilingual BERT?” arXiv preprint arXiv:1906.01502 (2019).

24. Conneau, Alexis, et al. “Unsupervised Cross-Lingual Representation Learning at Scale.” arXiv preprint arXiv:1911.02116 (2019).

25. Baptista, R., et al. “Universal Speech Tagging with Multilingual BERT.” Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (2021): 1563–1573. Association for Computational Linguistics.

26. Virtanen, Antti, et al. “Multilingual Is Not Enough: BERT for Finnish.” arXiv preprint arXiv:1912.07076 (2019).

27. Gutmann, Michael, and Aapo Hyvärinen. “Noise-Contrastive Estimation: A New Estimation Principle for Unnormalized Statistical Models.” Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, 2010.

28. Vaswani, Ashish, et al. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30 (2017).

29. Rahman, Wasifur, et al. “M-BERT: Injecting Multimodal Information in the BERT Structure.” arXiv preprint arXiv:1908.05787 (2019).

30. Zadeh, Amir, et al. “Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (year not specified).

31. Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.

32. Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.

33. Linna, Nathaniel, and Charles E. Kahn Jr. (2022). Applications of natural language processing in radiology: A systematic review. International Journal of Medical Informatics, 163, 104779.

34. Korzack, M. S., and Weller, A. (2020). Political speech act classification using pre-trained BERT. In Proceedings of the 23rd International Conference on Speech and Computer (pp. 516–527). Springer, Berlin, Heidelberg.

35. Chi, Po-Han, et al. (2021). Audio ALBERT: A lite BERT for self-supervised learning of audio representation. 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE.

36. Huang, Wen-Chin, et al. (2021). Speech recognition by simply fine-tuning BERT. ICASSP 2021 – IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.

37. Shin, Joonbo, Yoonhyung Lee, and Kyomin Jung. (2019). Effective sentence scoring method using BERT for speech recognition. Asian Conference on Machine Learning. PMLR.

38. Miyazaki, Koichi, et al. (2020). Weakly-supervised sound event detection with self-attention. ICASSP 2020 – IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.

39. Ling, Shaoshi, et al. (2019). Bertphone: Phonetically-aware encoder representations for utterance-level speaker and language recognition. arXiv preprint arXiv:1907.00457.

40. Gu, Jia-Chen, et al. (2021). MPC-BERT: A pre-trained language model for multi-party conversation understanding. arXiv preprint arXiv:2106.01541.

41. Liu, Yinhan, et al. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.

42. Wang, Zihan, Stephen Mayhew, and Dan Roth. (2020). Extending multilingual BERT to low-resource languages. arXiv preprint arXiv:2004.13640.

43. Ulčar, Matej, and Marko Robnik-Šikonja. (2020). FinEst BERT and CroSloEngual BERT: Less is more in multilingual models. In Text, Speech, and Dialogue: 23rd International Conference, TSD 2020, Brno, Czech Republic, September 8–11, 2020, Proceedings 23. Springer International Publishing.

44. Khanuja, Simran, et al. (2021). MuRIL: Multilingual representations for Indian languages. arXiv preprint arXiv:2103.10730.