Have you ever wondered what AI chatbots are terrible at? Distinguished professor Srđan Verbić, member of the consortium as representative of Univerzitet Metropolitan, explores this question in his newest blog post, inspired by the work carried out within the AI4VET4AI project.
We bring you his findings and insights in continuation.
I’ve written, reviewed and analysed thousands of multiple-choice questions (MCQs) by now. If anything, I know well how difficult it is to craft a good MCQ, especially one that aims to assess more than just factual knowledge. Earlier, I imagined that artificial intelligence would help researchers to develop better tools for measuring knowledge and skills. However, today, people use AI chatbots in a more banal way, to automate the production of carelessly designed MCQs. They’re trying to use machines, which basically mimic humans, in doing a job that humans have never excelled at. It won’t work that way.
Why Are AI-Generated MCQs So Weak?
- AI’s Tendency to Focus on Surface-Level Details
AI chatbots excel at text pattern recognition. When you give her a set of sentences, she tends to generate questions about details, like names, dates or explicit statements, rather than about underlying concepts or broader relations. For example, instead of asking, “Why does water’s polarity make it essential for biological processes?” you’ll get the question “Which of these molecules is polar?” This focus on content details limits assessment to remembering trivial facts, not understanding.
- Stuck at the Bottom of Bloom’s Taxonomy
Most AI-generated MCQs test only the lowest cognitive levels: remembering and, occasionally, understanding. Higher-order skills, application, analysis, synthesis and evaluation are rarely addressed. This is because generating such questions requires a superior approach, full understanding and appreciation of the learning process as well as abstraction and substantial creativity. It is very difficult to mimic such a process if even human question authors are not skilled in creating higher-level questions. This is why most AI-generated questions start with “which.”
- Obvious Correct Answers
There is a common flaw with MCQs — the correct answer is almost always the longest or most detailed. Why? AI, just like humans, tends to make correct answers precise and complete, while distractors (incorrect alternatives) are vague, short or implausible. Students somehow know about this, and they instinctively select the longest alternatives. This makes guessing easy and undermines both the assessment’s validity and reliability.
- Poor Distractor Quality
Whoever creates MCQs should generate wrong alternatives (distractors) that sound reasonable to someone who doesn’t fully understand the concept but are implausible to someone who does. AI struggles here a lot more than human question writers. She often produces distractors that are obviously irrelevant or wrong. Such distractors are trivial to eliminate, which makes the question too easy and useless. Skilled human writers try to use students’ common misconceptions to formulate distractors that would test students’ understanding. AI doesn’t know much about these misconceptions. Mostly, because we didn’t document them enough.
- Lack of Focus on Learning Outcomes
People design questions with specific learning outcomes in mind. Quite often they fail. The problem is that no one knows for sure which outcome best corresponds to a specific question. You need a panel of experts to agree on that, and they also might be confused. If we don’t know clear criteria for outcomes, how would machines know? The situation becomes even worse when you push artificial intelligence to create multiple-choice questions where all alternatives should target the same outcome. That way, AI generates questions and distractors that are not valid for the intended outcomes.
How Can We Improve the Quality of AI-assisted MCQs?
- Don’t waste your time correcting and improving bad MCQs generated by AI. If you get a bad question, improving some of the alternatives won’t help much. Write the question yourself, keeping in mind what learning outcome you want to assess. And then ask the chatbot to brainstorm about the correct answer and distractors. Even if she gets it wrong, it could help you to identify “AI misconceptions” and write somewhat better distractors.
- Ask the AI chatbot to rewrite distractors to make them similar in form and length. You should pay attention to the plausibility of each alternative, but let AI explore different constructions and wording. It is something that AI does well.
- Let AI create a good stimulus for your MCQ. Stimulus for a question can be a short story describing a specific situation, a table with data which could be analysed, a process diagram or a short video clip. With a stimulus, you open many more opportunities for asking meaningful and relevant questions. Generative AI likes to make up stories and draw illustrations. It should be used. Just imagine the possibility of making a short podcast on some topic and then asking a few questions about it. Great, isn’t it?
Final Thoughts
There are many tasks where you can rely on AI chatbots. Writing multiple-choice questions is still not one of them. AI can help, but you need to come up with a good question for a concrete learning outcome. Ideally, you should combine human expertise with AI’s ability to generate content. Keep in mind that creating MCQs for higher-order competences might be almost impossible. If you can’t do it well, consider using open-ended questions. The administration and scoring of responses won’t be automated that way, but not everything needs to be automated these days, right? Especially not in education.