Neural Networks for Fashion Image Classification and Visual Search
DOI:
https://doi.org/10.70454/JRIST.020102Keywords:
Vision Transformer, Fashion Image Classification, Spatial Attention, Visual Search, Retrieval, AttentionAbstract
In modern internet commerce and digital retail, fashion classification and visual search are very important jobs. They make it possible to quickly recommend products, find them, and keep track of inventories. Even though deep learning has come a long way, current CNN (convolutional neural network) methods still have trouble with class disparities, overlapping categories, and picking up on fine-grained visual details in manner datasets. To solve these problems, this study suggests Fashion ViT-SA, a hybrid neural network that combines a Vision Transformer (ViT-Base, Patch16) core with a Spatial Attention Module. The model uses the transformer's capacity to encode global context while also using spatial attention to highlight local clothing features, which improves discriminative representation. We used the Deep Fashion multi-modal Dataset and did some preprocessing, such as filtering categories, encoding labels, separating the data into groups, and adding data both online and offline to make sure the class distributions were even. Fashion ViT-SA was used to extract features that were then used to sort items into seven fashion types and to search for items visually based on their content using an Annoy-inspired approximate nearest neighbour index. We trained the model with weighted cross-entropy loss, improved it with Adam-W, and then tested it for accuracy, precision, recall, F1-score, and loss. Experimental findings indicate that Fashion ViT-SA attains 83% accuracy, surpassing a baseline CNN model by 13%, and delivers solid, real-time retrieval performance for visually analogous products. The research underscores the promise of hybrid transformer-based architectures in fashion AI applications, merging classification precision with scalable visual search, thus propelling both scholarly inquiry and practical e-commerce innovations.
References
[1] R. Jabbar, M. Shinoy, M. Kharbeche, K. Al-Khalifa, M. Krichen, and K. Barkaoui, “Driver Drowsiness Detection Model Using Convolutional Neural Networks Techniques for Android Application,” 2020 IEEE Int. Conf. Informatics, IoT, Enabling Technol. ICIoT 2020, pp. 237–242, 2020, doi: 10.1109/ICIoT48696.2020.9089484.
[2] X. Wang et al., “Pedestrian attribute recognition: A survey,” Pattern Recognit., vol. 121, pp. 1–32, 2022, doi: 10.1016/j.patcog.2021.108220.
[3] S. Li and W. Deng, “Deep Facial Expression Recognition: A Survey,” IEEE Trans. Affect. Comput., vol. 13, no. 3, pp. 1195–1215, 2022, doi: 10.1109/TAFFC.2020.2981446.
[4] S. Abbas et al., “Convolutional neural network based intelligent handwritten document recognition,” Comput. Mater. Contin., vol. 70, no. 3, pp. 4563–4581, 2022, doi: 10.32604/cmc.2022.021102.
[5] A. M. Obeso, J. Benois-Pineau, M. S. García Vázquez, and A. Á. R. Acosta, “Visual vs internal attention mechanisms in deep neural networks for image classification and object detection,” Pattern Recognit., vol. 123, 2022, doi: 10.1016/j.patcog.2021.108411.
[6] M. Zhuge et al., “Kaleido-Bert: Vision-Language Pre-training on Fashion Domain,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 12642–12652, 2021, doi: 10.1109/CVPR46437.2021.01246.
[7] B. Yan, H. Peng, K. Wu, D. Wang, J. Fu, and H. Lu, “LightTrack: Finding lightweight neural networks for object tracking via one-shot architecture search,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 15175–15184, 2021, doi: 10.1109/CVPR46437.2021.01493.
[8] M. Lin et al., “Zen-NAS: A Zero-Shot NAS for High-Performance Image Recognition,” Proc. IEEE Int. Conf. Comput. Vis., pp. 337–346, 2021, doi: 10.1109/ICCV48922.2021.00040.
[9] Y. He, D. Yang, H. Roth, C. Zhao, and D. Xu, “DiNTS: Differentiable Neural Network Topology Search for 3D Medical Image Segmentation,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 5837–5846, 2021, doi: 10.1109/CVPR46437.2021.00578.
[10] H. Alqahtani, M. Kavakli-Thorne, and G. Kumar, “Applications of Generative Adversarial Networks (GANs): An Updated Review,” Arch. Comput. Methods Eng., vol. 28, no. 2, pp. 525–552, 2021, doi: 10.1007/s11831-019-09388-y.
[11] H. Wu et al., “Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 11302–11312, 2021, doi: 10.1109/CVPR46437.2021.01115.
[12] S. J. Wang, Y. He, J. Li, and X. Fu, “MESNet: A Convolutional Neural Network for Spotting Multi-Scale Micro-Expression Intervals in Long Videos,” IEEE Trans. Image Process., vol. 30, pp. 3956–3969, 2021, doi: 10.1109/TIP.2021.3064258.
[13] H. Kumar and Y. Hasija, Machine Learning in Medical Image Processing, vol. 195. 2021. doi: 10.1007/978-981-15-7078-0_35.
[14] J. M. Wolfe, Guided Search 6.0: An updated model of visual search, vol. 28, no. 4. Psychonomic Bulletin & Review, 2021. doi: 10.3758/s13423-020-01859-9.
[15] G. Castellano and G. Vessio, “Deep learning approaches to pattern extraction and recognition in paintings and drawings: an overview,” Neural Comput. Appl., vol. 33, no. 19, pp. 12263–12282, 2021, doi: 10.1007/s00521-021-05893-z.
[16] D. Khan et al., “Advanced IoT-Based Human Activity Recognition and Localization Using Deep Polynomial Neural Network,” IEEE Access, vol. 12, no. July, pp. 94337–94353, 2024, doi: 10.1109/ACCESS.2024.3420752.
[17] Geri, C. Joppi, M. Denitto, and M. Cristani, “Well googled is half done: Multimodal forecasting of new fashion product sales with image-based google trends,” J. Forecast., vol. 43, no. 6, pp. 1982–1997, 2024, doi: 10.1002/for.3104.
[18] H. M. Dipu Kabir et al., “SpinalNet: Deep Neural Network With Gradual Input,” IEEE Trans. Artif. Intell., vol. 4, no. 5, pp. 1165–1177, 2023, doi: 10.1109/TAI.2022.3185179.
[19] A. de Santana Correia and E. L. Colombini, Attention, please! A survey of neural attention models in deep learning, vol. 55, no. 8. Springer Netherlands, 2022. doi: 10.1007/s10462-022-10148-x.
[20] J. Kong, H. Wang, C. Yang, X. Jin, M. Zuo, and X. Zhang, “A Spatial Feature‐Enhanced Attention Neural Network with High‐Order Pooling Representation for Application in Pest and Disease Recognition,” Agric., vol. 12, no. 4, 2022, doi: 10.3390/agriculture12040500.
[21] Q. Hou, C. Z. Lu, M. M. Cheng, and J. Feng, “Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 12, pp. 8274–8283, 2024, doi: 10.1109/TPAMI.2024.3401450.
[22] Z. Peng et al., “Conformer: Local Features Coupling Global Representations for Recognition and Detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 8, pp. 9454–9468, 2023, doi: 10.1109/TPAMI.2023.3243048.
[23] K. Dobs, J. Martinez, A. J. E. Kell, and N. Kanwisher, “Brain-like functional specialization emerges spontaneously in deep neural networks,” Sci. Adv., vol. 8, no. 11, pp. 1–11, 2022, doi: 10.1126/sciadv.abl8913.
[24] G. G. Chrysos, S. Moschoglou, G. Bouritsas, J. Deng, Y. Panagakis, and S. Zafeiriou, “Deep Polynomial Neural Networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 8, pp. 4021–4034, 2022, doi: 10.1109/TPAMI.2021.3058891.
[25] A. Baldrati, M. Bertini, T. Uricchio, and A. Del Bimbo, “Effective conditioned and composed image retrieval combining CLIP-based features,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2022-June, pp. 21434–21442, 2022, doi: 10.1109/CVPR52688.2022.02080.
[26] B. Lao and K. Jagadeesh, “Convolutional Neural Networks for Fashion Classification and Object Detection,” CCCV Comput. Vis., pp. 120–129, 2015.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Kamal Kishor Rajak, Dr. Pharindra Kumar Sharma (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.
This is an Open Access article distributed under the term's of the Creative Common Attribution 4.0 International License permitting all use, distribution, and reproduction in any medium, provided the work is properly cited.









