Jennifer, I think most educators have heard of item analysis, but I often see a ton of confusion even from the administrators who run testing centers and instructors who are administering exams. Can you give us some general guidelines on how to interpret these statistics?
Jennifer Balogh: Sure, what you want to look for, in terms of your items, is how difficult an item is and how well the item can discriminate between a high-performing student and a low-performing student.
What you want to look for … is how difficult an item is and how well the item can discriminate between a high-performing student and a low-performing student.
The difference is someone who has mastered the material compared to someone who has not. There are some great statistics that GradeHub provides allowing you to be able to look, at a glance, at items and try to understand whether that item is functioning well or not, with regards to discriminating high performers from low performers. The idea is you want students who are doing well in mastering the material to get an item correct, and you want students who are not understanding the content to get a question wrong. So, an ideal situation for an item is when students who are high performers can get it correct and students who are low performers can’t.
The idea is you want students who are doing well in mastering the material to get an item correct and you want students who are not mastering the content to get an item wrong.
Mark: So, like our previous installments, can we go through the item statistics here in a GradeHub report to see how you would interpret this report?
Jennifer: Of course, and I have to say I think this is my favorite tab. I love the information found here. It’s showing you the measures of discrimination, and when I use that word I know, it’s a very overloaded term. But what I’m talking about, and what is discussed in the testing literature, is an item’s ability to discriminate the knowledge and skills of a test taker.
One of the most accepted ways of evaluating an item is to calculate a correlation. The technical term for this kind of correlation is a point biserial. Instead of correlating two sets of things that are on a continuous scale, you are correlating one thing on a continuous scale (test scores) and one thing that has only two possible values; correct or incorrect (items). At a high level, what you are doing is correlating a response on a single question with the student’s overall test score. The overall test score is an indicator of whether the student is high-performing or low-performing.
If a test question is performing well, then it will correlate with the overall test scores of the students. GradeHub provides this information for you.
A high point biserial reflects a test question doing a good job. Values for point biserial can range from -1.00 to 1.00. A value of 0.15 or higher suggests that the question is performing well. What you are typically shooting for is something positive.
When you’re going into negative numbers for point biserial, you get kind of the reverse of what you’re expecting. In that situation, low performers might be getting this item correct while high performers are not. You know that there’s a big problem: maybe the answer key is wrong or the question is being conveyed in a way that high performers aren’t getting it right. Researchers have recommended removing items that have a negative point biserial.1
Generally, you want to keep your point biserial positive.
Mark: Oh, great that’s interesting and can you go a little bit over our matrix here? At a high level, it’s a summarized item analysis where GradeHub is showing the difficulty levels and also the discrimination.
Jennifer: Right, so across the top, you’ve got the difficulty, and the difficulty here is represented as proportions, which is very typical in the testing field to show difficulty in that way. It’s the proportion of students that got the question right out of all of the students who took the test. Anything that’s 0.85 up to 1.0, you can think of it as 85% or more of the students got that item correct. So, it’s pretty easy. In the mid-range, 50% to 84% of those are represented as proportions 0.50 to .84 got the item correct. Those are the items that quite a few people have missed, so those are in the medium range. It’s a great range (medium difficulty) for questions to be in because you’re getting a lot of information about your students when items are in that range. Then, you have the harder items that are below .50, and that’s just showing you that students do not understand the material. Generally, in professional testing, we try to steer away from items that are super hard or have proportions of 0.30 or below because if everybody’s missing an item, you’re just not getting as much information on that item. It might be a situation where you want to make sure that you covered challenging content, but it is a good idea to review the items that are coming in that are extremely difficult.
It’s a great range (medium difficulty) for items to be in because you’re getting a lot of information about your students when items are in that range.
Mark: Is this the same thing as a p-value?
Jennifer: That’s right! It’s the same thing as a p-value because of the proportion here, and that’s actually what the p stands for. When I was first introduced to the field, I wondered, “Okay, what is this strange p-value?” But all it is is the proportion of students that got an item correct versus all of those that took the test.
Mark: And we also plot by discrimination. Can you speak on that?
Jennifer: Yeah, so down in the column on the left you see that you’ve got this nice grid where you’ve got difficulty across the top and now along the side, you’ve got discrimination. There is poor discrimination at 0.15 and good discrimination at 0.30 to 1.0. This information represented in a grid is a great way to look at how your items are performing in general. Generally, when you’re in this poor column, you want to be looking at those items to see if you can improve them for next time. If they’re performing so poorly that they’re negative, you should consider taking them out of the test scoring because the negative usually indicates that there is something wrong going on.
Mark: So in that case, you would change the answer key weight to zero then and throw it out?
Jennifer: Yes! It’s what I would do. You have to be careful with that kind of performance because you might be measuring something that’s outside of what was intended.
Mark: Interesting. And, if you were to look at this grid overall, you would say we had a well-written test that was easy.
Jennifer: Absolutely, and it’s nice to be able to see which items had good discrimination, and which were easy and which were in the medium range. This feature can help as you’re moving forward to see if you might need more items that are just a little bit more difficult. It also allows you to see the spread of your questions and how their performance looks graphically to help you write items and future exams.
Mark: Thanks, so here is our item analysis report in GradeHub, and I was wondering if you could go through Question 2, which is highlighted. Can you provide us some of your insights on how you would read all the information that’s here on this graph?
Jennifer: Sure, there’s a lot of great information here. First, you can see all the numbers in terms of the distribution of the score responses. You can see that 15% of the students responded with “A” and you can see that graphically in the chart above. It’s a great visual to help you see what the pattern of response was. I also like how the correct answer is in bold, so you can easily see that most people responded with “D,” which was the correct answer. Then you have rpb, which is just showing you the correlation or the point biserial (that’s why it’s “pb” for point biserial). The reason that it’s an “r” is that, traditionally, correlations are represented with “r.” Again, this is showing you the item’s ability to discriminate those who have mastered the material from those who have not. The higher this number is, the better. Generally, you don’t want that number to be negative, and anything below 0.15 should probably be reviewed.
The other way that you can visualize this information with r is in the graphs above labeled upper 27% and lower 27%. Why 27? It has to do with the fact that a lot of these techniques for estimating discrimination came about before there were personal computers. Frankly, a correlation took a lot of effort to compute by hand, so researchers came up with different ways of how to estimate that quickly.
One technique used the upper 27% of students from a class compared to the lower 27% of students, and that’s why you see that number. You can see this graphically here where the upper 27% is responding with the correct answer “D,” which is what you want to see. With the lower 27%, some of them are responding “D,” but importantly others are not and that’s precisely the kind of pattern that you want to see. It’s also reflected in an excellent correlation score of 0.465.
The next column is showing alpha (Cronbach’s alpha) or the reliability of the test overall. You can see the reliability of this test is 0.799, which is very high. The alpha minus (Cronbach’s alpha with delete) calculates what the reliability of the test would be without this particular item. Generally, you’ll notice that the numbers are going to be a bit lower because you’re dealing with one less item in your computation. However, what you don’t want to see is a huge bump (an increase) in your reliability when this item is gone because it’s an indicator that this item is not doing its job.
Finally, in the right-hand column, you’ve got the 27% low and high giving you the numbers behind the charts that you see up above. Like I mentioned earlier, it’s a great way to look at the information at a glance. In this chart, you would want to see high numbers on the right-hand side and lower numbers on the left-hand side because that’s telling you that the item is doing its job. The top performers are performing well on this item, and those who are not so strong in terms of their knowledge are not performing as well.
The alpha minus calculates what the reliability of the test would be without this particular item.
Mark: Well, that’s awesome! Now, I was wondering if we could go through one more test question where everything isn’t looking as good. I highlighted Question 16, and was wondering if you could highlight some of the differences that you see from Question 2, which you looked at and said was a high-performing test question, versus one that’s lower-performing?
Jennifer: Sure, so let’s look at this one, it seems that 85% of the students are responding “C,” which is the incorrect answer. This doesn’t necessarily by itself mean that it’s a terrible item. If only those students who are answering “C” were the low performers, the question might be okay. However, there seems to be a problem surfacing when you look at the upper 27% graphic. In the upper 27% graph, most of those students are answering “C” as well, and that’s not the pattern that you want. You don’t want the high performers to be fooled by this question. It appears that this item is not doing its job. The upper and lower 27% is a great graphic. At this point, I would want to ensure that the answer key is correct or try to identify a problem with the question.
The other corroborating information to detect that the item has a problem is the alpha score. If the alpha goes up by removing the item from the test, it’s an indication that the item is not doing its job well. Visually, it’s shown in the 27% low and high chart as well. It clearly shows that among the top performers, only 20% of those are getting the item correct, which is not the pattern you want.
Mark: Okay great! Thanks again Jennifer for sharing all of your insights. To learn more about GradeHub, visit our website or schedule a demo at gradehub.com. Thanks for joining us!
1. “A Practical Guide to Creating Quality Exams.” A Practical Guide to Creating Quality Exams, by Jennifer Elaine. Balogh, Intelliphonics, 2016, pp. 159–160.