Blog Post

Grade Smarter

Use point-biserial to know a good or bad test question

This article on point biserial calculations includes an excerpt from Dr. Jennifer Balogh’s book, A Practical Guide to Creating Quality Exams.

Knowing what makes a good or bad test question is essential for faculty developing exams for large class sections. The deciding factor between a good or bad test question is its ability to “discriminate” students who have mastered the material from those who have not. The key is to look at who is getting the item correct and who is getting it wrong. If only the high performing students are getting a question correct, it might be an indicator that the item is too hard. However, if students on the low end of the grading scale are getting the answer correct while high performers are getting it wrong, the question is probably written poorly.

If students on the low end of the grading scale are getting the answer correct while high performers are getting it wrong, the question is probably written poorly.

One of the most accepted ways to evaluate an item is to calculate a correlation. The technical term for the correlation used in exam item analysis is a point-biserial. In a point-biserial correlation test scores on a continuous scale are compared to a single item that has only two possible values: correct or incorrect. At a high level, what you are doing is correlating a response on a single question with the student’s overall score. The overall test score is a signal of whether the student is high-performing or low-performing. If well written, students’ responses to a given item will correlate with their overall test scores. The following table provides an extreme example to illustrate how the point-biserial calculation works.

Table 1 – Example of How Responses to Items Correlate to Overall Scores

StudentItem 1Item 2Item 3Items 1 ...50Total Score
Student 1CorrectIncorrectCorrect...50
Student 2CorrectIncorrectCorrect...45
Student 3CorrectIncorrectCorrect...45
Student 4CorrectIncorrectCorrect...40
Student 5IncorrectIncorrectCorrect...35
Student 6IncorrectIncorrectCorrect...30
Student 7IncorrectIncorrectCorrect...30
Student 8IncorrectIncorrectCorrect...25
Mean of student score453037.5...37.5
Standard deviation30450...8.29
P-value0.500.501.00...
Point biserial Rpbi0.91-0.910.00

In this example, the students with the highest total scores answered question 1 correctly and got question 2 wrong. The entire class answered question 3 correct. Although item 1 is unrealistically good. This is an example of a good item. Conversely, item 2 is a poorly written item as the lowest performing students answered the question correctly while the high performing students got it wrong. Item 3 is what happens when a question is too easy. The result, the item cannot discriminate between high and low performers.

A high point-biserial reflects the fact that the item is doing a good job of discriminating your high-performing students from your low-performing students. Values for point-biserial range from -1.00 to 1.00. Values of 0.15 or higher mean that the item is performing well (Varma, 2006). According to Varma, good items typically have a point-biserial exceeding 0.25. Items with incorrect keys will show point-biserials close to or below zero. As a rule of thumb, items with a point-biserial below 0.10 should be examined for a possible incorrect key.

Point-biserials that are negative signal a big problem. With this pattern, the high-performing students are getting the answer wrong, and the low and/or mid performing students are getting it right. Researchers have recommended removing items that have a negative point-biserial (Kaplan & Saccuzzo, 2013).

GradeHub helps teachers determine what is a “good and bad test question” in our Item Analysis and Item Matrix reports. The Item Analysis Report provides the point-biserial calculation for each question. Where an item’s point-biserial is <0.15, we highlight the line in the report. We also categorize and summarize potentially good or bad test questions by point-biserial in the Item Matrix Report as follows: “poor” <0.15, “fair” 0.15-0.3, and “good” >0.30.

The content from this post was provided courtesy of Jennifer Balogh Ph.D. To learn more about point-biserial and other tips to create the best exams possible, pick up A Practical Guide to Creating Quality Exams by J Balogh. Jennifer has been in the testing industry for over a decade and owns a consulting business dedicated to designing, developing, and accurately scoring tests.