Feature Selection for Automatic Answer Extraction from Online Web Forums
DOI:
https://doi.org/10.70454/JRIST.020101Keywords:
Answer Quality Classification, Multi-Task BERT, Feature Selection , Online Web Forums and Deep LearningAbstract
This research proposes an efficient method for automatic answer quality classification in online web forums using a Multi-Task BERT-Based deep learning framework. The primary objective is to accurately categorize user responses into low, medium, and high-quality classes by leveraging advanced language representation and relevant content features. Starting with Stack Overflow forum data collection, the methodology moves on to comprehensive preprocessing, exploratory data analysis, and the extraction of syntactic and semantic features. Syntactic features include things like sentence length, punctuation use, code snippets, and TF-IDF vectors and contextual embeddings using spaCy. To reduce redundancy and increase the quality of the model input, we utilised feature selection with RFECV and the principle component analysis (PCA). A two-output-layer multi-task BERT architecture, which the model uses as its basis, can handle major classification and auxiliary tasks simultaneously, allowing it to increase generalisation. After seven training epochs, the trained model achieved remarkable results: 99.54% Training Accuracy, 92.13% Validation accuracy, 92.13% Precision, 92.13% Recall, and 92.13% F1-Score. In addition to providing fast inference, the model typically only takes 0.0071 milliseconds per sample for reactions. In large-scale community-driven QA platforms, these results validate the model's robustness, efficiency, and appropriateness for real-time applications.
References
[1] S. Jaki, T. De Smedt, M. Gwó, R. Panchal, A. Rossa, and G. De Pauw, “Online Hatred of Women in the Incels . me Forum : Linguistic Analysis and Automatic Detection State of the Art,” 2019.
[2] D. P. R. De Lima, M. A. Gerosa, T. U. Conte, and J. F. D. M. Netto, “What to expect , and how to improve online discussion forums : the instructors ’ perspective,” vol. 0, 2019.
[3] S. Razniewski, J. Z. Pan, and G. Weikum, “Commonsense Properties from Query Logs and Question Answering Forums,” 2018.
[4] T. Zhang, G. Upadhyaya, A. Reinhardt, H. Rajan, and M. Kim, “Are Code Examples on an Online Q & A Forum Reliable ? A Study of API Misuse on Stack Overflow,” pp. 886–896, 2018, doi: 10.1145/3180155.3180260.
[5] T. K. F. Chiu and T. K. F. Hew, “Factors influencing peer learning and performance in MOOC asynchronous online discussion forum,” vol. 34, no. 4, pp. 16–28, 2018.
[6] Z. Mohamadi, “Studies in Educational Evaluation Comparative e ff ect of online summative and formative assessment on EFL student writing ability,” vol. 59, no. July 2017, pp. 29–40, 2018.
[7] S. Ji, C. P. Yu, S. Fung, S. Pan, and G. Long, “Supervised Learning for Suicidal Ideation Detection in Online User Content,” vol. 2018, 2018, doi: 10.1155/2018/6157249.
[8] T. Internet, B. Chen, Y. Chang, F. Ouyang, and W. Zhou, “CO,” no. October 2019, 2018, doi: 10.1016/j.iheduc.2017.12.002.
[9] S. Zhang, X. I. N. Zhang, H. U. I. Wang, L. Guo, and S. Liu, “Multi-Scale Attentive Interaction Networks for Chinese Medical Question Answer Selection,” vol. 6, pp. 74061–74071, 2018.
[10] Q. M. Zhong and H. Norton, “Educational Affordances of an Asynchronous Online Discussion Forum for Language Learners,” vol. 22, no. 3, pp. 1–19, 2018.
[11] E. F. Bloomfield and D. Tillery, “The Circulation of Climate Change Denial Online : Rhetorical and Networking Strategies on Facebook,” 2018.
[12] H. Shing, S. Nair, A. Zirikly, M. Friedenberg, H. Daum, and P. Resnik, “Expert , Crowdsourced , and Machine Assessment of Suicide Risk via Online Postings,” pp. 25–36, 2018.
[13] T. Chakrabarty, C. Hidey, S. Muresan, K. Mckeown, and A. Hwang, “AMPERSAND : Argument Mining for PERSuAsive oNline Discussions,” 2017.
[14] L. Zheng, H. Wang, and S. Gao, “Sentimental feature selection for sentiment analysis of Chinese online reviews Sentimental feature selection for sentiment analysis of Chinese online reviews,” no. March 2015, 2017, doi: 10.1007/s13042-015-0347-4.
[15] D. Hazarika, E. Cambria, and R. Zimmermann, “CASCADE: Contextual Sarcasm Detection in Online Discussion Forums,” 2016.
[16] J. Glass and B. Randeree, “SemEval-2015 Task 3: Answer Selection in Community Question Answering,” 2015.
[17] M. Hasnain, I. Ghani, S. R. Jeong, and A. Ali, “Ensemble learning models for classification and selection of web services: A review,” Comput. Syst. Sci. Eng., vol. 40, no. 1, pp. 327–339, 2022, doi: 10.32604/CSSE.2022.018300.
[18] S. Meyer and D. Elsweiler, “GLoHBCD: A Naturalistic German Dataset for Language of Health Behaviour Change on Online Support Forums,” 2022 Lang. Resour. Eval. Conf. Lr. 2022, no. June, pp. 2226–2235, 2022.
[19] M. Gaikwad, S. Ahirrao, S. Phansalkar, and K. Kotecha, “Online Extremism Detection: A Systematic Literature Review with Emphasis on Datasets, Classification Techniques, Validation Methods, and Tools,” IEEE Access, vol. 9, no. C, pp. 48364–48404, 2021, doi: 10.1109/ACCESS.2021.3068313.
[20] S. Nazah, S. Huda, J. H. Abawajy, and M. M. Hassan, “An Unsupervised Model for Identifying and Characterizing Dark Web Forums,” IEEE Access, vol. 9, pp. 112871–112892, 2021, doi: 10.1109/ACCESS.2021.3103319.
[21] N. Humaira, “ISAS (Attention, Interest, Search, Action, Share) Model of Cosmetics Marketing Communication on Online Beauty Forum (Case-Study: Avoskin Marketing on Sociolla),” MEDIALOG J. Ilmu Komun., vol. 4, no. 1, pp. 186–200, 2021, doi: 10.35326/medialog.v4i1.1031.
[22] A. K. Jain, S. R. Sahoo, and J. Kaubiyal, “Online social networks security and privacy: comprehensive review and analysis,” Complex Intell. Syst., vol. 7, no. 5, pp. 2157–2177, 2021, doi: 10.1007/s40747-021-00409-7.
[23] E. Tseng et al., “The tools and tactics used in intimate partner surveillance: An analysis of online infidelity forums,” Proc. 29th USENIX Secur. Symp., pp. 1893–1909, 2020.
[24] A. Khan et al., “Machine Learning Approach for Answer Detection in Discussion Forums: An Application of Big Data Analytics,” Sci. Program., vol. 2020, 2020, doi: 10.1155/2020/4621196.
[25] A. Shrestha, E. Serra, and F. Spezzano, “Multi-modal social and psycho-linguistic embedding via recurrent neural networks to identify depressed users in online forums,” Netw. Model. Anal. Heal. Informatics Bioinforma., vol. 9, no. 1, 2020, doi: 10.1007/s13721-020-0226-0.
[26] F. Martin, C. Wang, and A. Sadaf, “Facilitation matters: Instructor perception of helpfulness of facilitation strategies in online courses,” Online Learn. J., vol. 24, no. 1, pp. 28–49, 2020, doi: 10.24059/olj.v24i1.1980.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Hemant Sharma, Shyamol Banerjee (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.
This is an Open Access article distributed under the term's of the Creative Common Attribution 4.0 International License permitting all use, distribution, and reproduction in any medium, provided the work is properly cited.









