by Karmen Garey, PharmD, PGY-1 Baptist Memorial Hospital – North Mississippi Pharmacy Resident, University of Mississippi School of Pharmacy
From the students’ perspective, once they hit “submit” after completing an exam they think “Thank goodness that’s done!” However, for teachers, there is still some critical work to do. Now it’s time to review the performance data to ensure the examination was fair and measured what was intended. Here are a few tips and strategies to assess the quality of an exam.
Make certain the exam (as a whole) is a “good” one
Before the exam is administered to students, a good exam should be written with the following goals in mind:1,2
- An exam should address multiple levels of Bloom’s taxonomy — from knowledge recall to application and analysis.
- The exam should include a variety of questions that test a range of concepts that map back to the learning objectives.
- The consistency of the exam's performance over time is important. An exam should routinely perform the same from year to year despite some changes to the questions.
- An exam should measure the learning outcomes and course material it was designed to test.
Make certain the questions included on the exam are “good” ones
There are two types of questions that should be included on exams: mastery questions and discriminating questions. Mastery questions are those questions that students are expected to excel on.3 This type of question is typically a “knowledge level” question in Bloom’s Taxonomy. The questions often test factual recall and the recognition of fundamental material.2 These questions might be called “gimmie questions” by the students; however, teachers include these questions to ensure that students have a firm understanding of the basic but super important concepts or facts. Discrimination questions, on the other hand, are intended to identify students who have a deeper knowledge of the material and separate students into different performance levels (e.g. identify "A", "B", and "C" students). Higher-performing students are expected to answer these questions correctly more often than lower-performing students. This type of question often targets the comprehension, application, analysis, synthesis, or evaluation cognitive level in Bloom’s taxonomy. These questions require an in-depth knowledge of the subject matter.2
Next, let’s look at the distractors. Does each question include appropriate distractors?3 A distractor is an answer choice that, while wrong, sounds and appears like it could be plausible. A good distractor should be clear and concise and should be similar in structure and content to the correct response. Savvy test-takers have learned to spot answers that seem different in some way, so even small variations in the style, subject matter, and length of the answer choices can provide clues.
Next, is the question stem clearly written. Is it clear what the learner is being asked? Or is the question open to interpretation? When writing questions, it is important to ensure that the question is not misconstrued. Sometimes students will overthink a question and try to find the hidden meaning when there is none. To avoid this problem, use words that are unambiguous. Avoid phrasing that could be cryptic.
Finally, is the answer to the question correctly keyed. If a lot of students selected the “wrong” answer, it's possible that the question was miskeyed. While this is not something that happens often, it does happen! So it is always a good idea to double-check that the correct answer was selected on the answer key.
Some other things to consider as you look at the post-exam performance data. How did the exam scores look last year? While a group of students performing much better or much worse than previous year’s students is not always an indication that the exam is invalid, it should prompt additional questions.
- Was the material taught in a manner that was different from previous years?
- Was the exam formatted or delivered differently?
- Could the students this year have been less (or better) prepared in some way to comprehend the material?
- Is cheating suspected?
- If there are multiple instructors, did students received different messages about the content?
The answers to these questions may not be obvious or even relevant, but it is something to keep in mind.
Use the post-exam statistical analysis to identify problem questions3
As technology becomes a more integral part of exam delivery, it enables a wealth of data that can be used for post-exam quality assurance. Most post-exam statistical analysis tools report similar elements; however, the names may be slightly different. ExamSoft is among the most common exam delivery tools available today and routinely reports these statistics:
- Item Difficulty represents the difficulty of a question. It reports the percentage of students who correctly answered the question. The lower the percentage the more difficult the question. There is not a set number that the item difficulty should be but the number should be used to ensure the intent behind the question matches the number. For example, if the teacher wants the item to be a mastery question, the difficulty should be 0.90 to 1.00 with very few students getting the question wrong. If the question is meant to separate those who have a firm grasp on the material vs. those who don’t, lower levels are acceptable. An instructor may have a difficulty “cutoff” number in mind where anything below 0.6 (for example) prompts additional analysis of the question.
- Upper/Lower 27%, Discrimination Index, and Point Biserial are each calculated differently but they report a similar concept. Stated simply, they all determine whether the top performers on the exam achieved better results on a question compared to those who did not perform well. If the top performers don’t out-perform the poor performers, the question should be assessed to determine why.
- Upper 27% / Lower 27% - what percentage of the top 27% vs. the bottom 27% of performers got the question correct.
- Discrimination Index – this represents the difference in performance between the best performers vs. the lowest performers.
- Point Biserial – indicates whether those who answered correctly on a specific item correlates with doing well on the exam overall. In other words, does performance on this question predict whether a student did well (or not so well) on the exam?
Correlation with Overall Exam was |
Point Biserial |
Very good |
>0.3 |
Good |
0.2-0.29 |
Moderate |
0.09-0.19 |
Poor |
<0.09 |
So, let’s look at the statistical analysis from two example questions.
- This was a mastery question — students are expected to do well on this question. It’s a fundamental concept that all students should know.
- The Discrimination index = 0.04 which indicates almost no discrimination between the top and bottom performers. In this case, because it’s a mastery question and we expected all students to perform well on this question. Thus, we don’t expect this question to discriminate between the best and worse performers.
- The Point Biserial = 0.10 indicating this question only moderately correlate with doing well on the exam overall. Again, the top and bottom performers performed quite similarly on this question, so there won’t be a strong correlation between the performance on this question and the overall exam.
- If this question was not intended to be a mastery question, perhaps the material was taught particularly well … or maybe there was cheating involved
Now let’s take a look at a question where only 66% of the students selected the correct response.
- Item difficulty = 0.66 so 66% of the students selected the correct response. This is not a bad thing but it is important to make sure the students who understood the material were more likely to get this question right.
- This is intended to be a discriminating question, so let’s make certain it’s actually discriminating between the best and worse performers.
- Look at the Upper vs. Lower 27%: 82% of the top performers got this question correct. Only 46% of those who performed the poorest on this exam got this question correct.
- Discrimination Index: 0.36. This question did a good job discriminating between the best and worst performers on this exam.
- Point Biserial = 0.28 Performance on this question has a good correlation with the student’s overall exam performance.
While there are no hard rules for how to analyze an examination, the strategies I’ve outlined in this blog post are some of the best practices every teacher should follow. It is important to follow a systematic process and establish “cut-offs” in advance. The key is to be clear and consistent from exam to exam.
References
- Brame C. Writing Good Multiple Choice Test Questions. 2013. Accessed December 3, 2020.
- Omar N, Haris SS, Hassan R, Arshad H, Rahmat M, Zainal NFA, et al. Automated Analysis of Exam Questions According to Bloom's Taxonomy. Procedia - Social and Behavioral Sciences. 2012;59:297–303. Accessed December 1, 2020.
- Ermie E. Psychometrics 101: Know What Your Assessment Data Is Telling You. Examsoft. 2015. Accessed November 18, 2020.