Evaluating AI Tutor Interaction with Ambiguous Junior High Mathematics Questions Using Black-Box

Khoirul Islam; Nisfu Laili Saidah; Saifudin Yahya; Muhammad Miftakhul  Syaikhuddin

doi:10.31004/riggs.v5i1.6363

Authors

Khoirul Islam Universitas Negeri Surabaya
Nisfu Laili Saidah Universitas Hasyim Asy’ari
Saifudin Yahya Universitas Negeri Surabaya
Muhammad Miftakhul Syaikhuddin Universitas Pesantren Tinggi Darul ’Ulum

DOI:

https://doi.org/10.31004/riggs.v5i1.6363

Keywords:

Human-AI Interaction, Generative AI Tutor, Ambiguity, Junior High School Mathematics, Black-box

Abstract

The growth of generative artificial intelligence as a mathematics education tutoring tool presents new possibilities of assisting student learning. Nevertheless, successful learning interactions involve AI tutors to address ambiguous questions posed by students in a proper manner, particularly in junior high school mathematics, where questions are frequently incomplete, have ambiguous concepts, or lack reference points to the surrounding context. This paper compares the quality of interaction of two popular AI tutors, ChatGPT and Gemini, in answering ambiguous questions in mathematics at the junior high school level. An approach that was used was black-box testing where only observable input-output behavior was tested without accessing internal model mechanisms. A sample of 50 ambiguous mathematics situations was randomly designed around five types of ambiguity, namely incomplete information, conceptual ambiguity, output format ambiguity, missing context, and contradictory information. Both AI tutors were tested on each scenario once giving a total of 100 dialog interactions. Two independent raters evaluated all interactions based on a Human–AI Interaction rubric which includes ambiguity detection, relevance of clarification, transparency of assumptions, quality of interaction, and quality of solution. The findings show that both systems can identify ambiguity, but Gemini shows higher rates of clarification and more pedagogically suitable interaction patterns than ChatGPT, especially in situations related to the lack of information and contextual ambiguity. The results demonstrate the significance of clarification behavior and interaction design in AI-based tutoring systems and offer a viable way of how AI tutors can be responsibly used in mathematics education at junior high schools.

Downloads

Download data is not yet available.

References

B. Pepin, N. Buchholtz, and U. Salinas-Hernández, “A Scoping Survey of ChatGPT in Mathematics Education,” Digital Experiences in Mathematics Education, vol. 11, no. 1, pp. 9–41, Apr. 2025, doi: 10.1007/s40751-025-00172-1.

A. Testoni and R. Fernández, “Asking the Right Question at the Right Time: Human and Model Uncertainty Guidance to Ask Clarification Questions,” Long Papers. [Online]. Available: https://chat.openai.com/

M. Nafis Mumtaz, I. Ramadani, I. Sunan, and K. Yogyakarta, “Analysis of the Utilization of AI Chat GPT in the Academic Life of Prospective Mathematics Teacher Students Universitas Negri.”

R. Deng, M. Jiang, X. Yu, Y. Lu, and S. Liu, “Does ChatGPT enhance student learning? A systematic review and meta-analysis of experimental studies,” Comput. Educ., vol. 227, Apr. 2025, doi: 10.1016/j.compedu.2024.105224.

W. Holmes, M. Bialik, and C. Fadel, “Artificial Intelligence In Education Promises and Implications for Teaching and Learning,” 2019. [Online]. Available: http://bit.ly/AIED-

L. Busuttil and J. Calleja, “Teachers’ Beliefs and Practices About the Potential of ChatGPT in Teaching Mathematics in Secondary Schools,” Digital Experiences in Mathematics Education, vol. 11, no. 1, pp. 140–166, Apr. 2025, doi: 10.1007/s40751-024-00168-3.

C. Walkington, “The implications of generative artificial intelligence for mathematics education,” Sch. Sci. Math., 2025, doi: 10.1111/ssm.18356.

A. B. Arrieta et al., “Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI,” Dec. 2019, [Online]. Available: http://arxiv.org/abs/1910.10045

E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, “On the dangers of stochastic parrots: Can language models be too big?,” in FAccT 2021 - Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, Association for Computing Machinery, Inc, Mar. 2021, pp. 610–623. doi: 10.1145/3442188.3445922.

N. U. J. et al., “A Systematic Review of Generative AI in Education,” Journal of Computer Sciences and Applications, vol. 12, no. 1, pp. 25–30, Sep. 2024, doi: 10.12691/jcsa-12-1-4.

Y. Tsuta, N. Yoshinaga, and M. Toyoda, “Uncertainty-aware Automatic Evaluation Method for Open-domain Dialogue Systems,” 2023.

S. Amershi et al., “Guidelines for human-AI interaction,” in Conference on Human Factors in Computing Systems - Proceedings, Association for Computing Machinery, May 2019. doi: 10.1145/3290605.3300233.

E. Kasneci et al., “ChatGPT for good? On opportunities and challenges of large language models for education,” Apr. 01, 2023, Elsevier Ltd. doi: 10.1016/j.lindif.2023.102274.

Y. Wardat, M. A. Tashtoush, R. AlAli, and A. M. Jarrah, “ChatGPT: A revolutionary tool for teaching and learning mathematics,” Eurasia Journal of Mathematics, Science and Technology Education, vol. 19, no. 7, 2023, doi: 10.29333/ejmste/13272.

M. J. Q. Zhang and E. Choi, “Findings of the Association for Computational Clarify When Necessary: Resolving Ambiguity Through Interaction with LMs,” Apr. 2025.