GradeHub is now a part of the Turnitin family. To grade assessments, including bubble sheets, try Gradescope by Turnitin.

Blog Post

Grade Smarter

Validity, Reliability, and Overall Exam Statistics

Mark Espinola, CEO + Co-founder at GradeHub, recently spoke with Dr. Jennifer Balogh about her book and exam validity and reliability.

An edited transcript of the conversation follows.

Mark Espinola: Dr. Jennifer Balogh is the founder and general manager of Intelephonics. Dr. Balogh is an expert in educational assessments and is the author of A Practical Guide to Creating Quality Exams.

Jennifer, what motivated you to write your book?

Jennifer Balogh: Well I’ve always been interested in the ways that teachers have graded. For example, some of them would use rubrics and others would use different algorithms for curving grades. But, it was when I became a graduate student, and I started writing exam questions, grading exams, and looking at statistics for items that I realized how much I didn’t know. For example, I would come across an item because I would see printouts on the wall and see that only 4% of the entire class of 100 students got an answer correct. I wondered if that was a good question or just a difficult question and, ultimately, I didn’t know the answer to that.

Later, my career path took me right into the center of assessment when I started working for a large test publisher. From there, I started learning about testing theory and using it effectively. Now, I know the answers to those questions, and I want to be able to make that information accessible to people who use it because they’re using assessments in their jobs every day.


Mark: That’s great. So, Jennifer, along the vein of writing better exams,  could you share some of your thoughts on what makes a valid test?

Jennifer: Absolutely, three main things make a valid test. That information is coming from a book of standards that’s mostly for professional test developers, but the book mentions that classroom tests really should follow the same kinds of guidelines.

First and foremost is validity. Validity is measuring what it is you intend to measure. In the world of academia, that is knowledge or skills that you’re trying to understand if the student has mastered or not.

There are three things that you want to do to ensure that your test is valid:

  • First, you want to cover the appropriate content. I suggest you create a blueprint of your test to make sure that the proportion of questions that you’re asking covers what you want to be covering in your course.
  • Second,  you want to make sure that you’re asking questions in a relevant manner. Usually, when you’re providing information in a class, multiple-choice is an entirely “relevant” way to do that.  A place where it wouldn’t be so relevant is if you’re teaching a class, such as driver’s education. Let say, you only have a written component but you’re not actually testing someone’s ability to drive.  There your exam (e.g., multiple-choice) is not so relevant. In this case, you want to make sure you’re also testing with a performance assessment. In situations that you’re testing the actual skill (Mark comment, “sure, you want them to drive then”). That’s right; you want them to drive a vehicle and show you that they can parallel park.
  • Third, you want to ensure that your test can discriminate those students who have mastered the material from those who have not. One way that you can do that is by making sure that each item is showing discrimination. GradeHub has great tools and statistics that it presents to teachers so that they can make sure that their test can discriminate.

Validity is measuring what it is you’re intending to measure.


Mark: In the area of reliability, I think that some educators don’t have an excellent grasp of it. Can you share a little more about that?

Jennifer: So, that’s a critical factor that makes a test a good test, and that is reliability. Reliability is basically consistency; it’s precision. So if you take a test on one day, will you get the same score as you take it on another day?

There are a lot of factors that come into play that make it so that test scores may not be precisely the same and a great example is a bathroom scale.

So you get on that scale multiple times just within five minutes, and you want that scale to be able to give you the same weight every time. If it does, then that’s a reliable scale. Different things can make something not so reliable.  It can come from the person but also the measurement instrument.

Using the bathroom scale analogy again, if you’re getting on that scale, and then you have chocolate cake for dessert, the next time you get on the scale you may see some differences. That’s because of you, the person (i.e, not the scale). This also might be the case on an assessment when students may have a good day one day and a bad day the next. That difference could affect their performance on the test.

Sometimes the difference can be caused by the measurement instrument as well. So using the scale analogy, the scale might be off a little bit, or your exam might be off a little bit as well. This is where reliability suffers.

Reliability is consistency, it’s precision.

Mark: What’s your recommendation on reliability?

Jennifer: Professional test publishers shoot for reliability that’s exceptionally high in the 0.900’s range. That’s not going to be realistic in the classroom.

We understand that instructors are not in the business of selling their tests professionally and so that level of reliability can be relaxed quite a bit. There’s no hard and fast rule, but generally, a test that is 0.700 and above is considered reliable.

In the classroom, I think you can relax reliability to 0.500 or above. If you’re getting below 0.500, you might really want to be looking at ways to increase the reliability. There are different ways that reliability can be estimated. What I mentioned before was taking a test one day and then taking a test the next day and making sure that those scores are the same. There are other techniques that you can use in order to estimate the reliability where you’re getting similar numbers just from one test.

Mark: So Jennifer, if I have low reliability on my exam, what are some of the things that I should be looking at doing?

Jennifer: There’s a mathematical relationship between the number of items in your test and the reliability of that test. The easiest thing that you can do to increase reliability is to make your exam longer in terms of the number of items. That said, if you’re adding items that all students get right or all students get wrong, you’re not adding a lot of information to the test. So, you want to make sure that the items that you’re adding are also good. Generally, increasing the number of items is a good rule of thumb on how to improve reliability.

So, the easiest way to increase reliability is to make your exam longer.

Mark: If you were using old tests, will removing poor performing questions be another way to increase reliability?

Jennifer: Absolutely.

Mark: Okay, can you tell us how validity and reliability interact with each other? Maybe you can explain this graphic?

Reliability graphic

Jennifer: Sure, and this is a good thing to talk about because of the two factors interplay with each other. So in the targets, you can imagine this as an exam. We’re looking at a target, and if an arrow hits the same spot on that target, it’s consistent, it’s reliable. If it hits the bull’s eye, which is what you intend to measure, the test is valid.

So in this upper left-hand depiction, you’ve got the target, and a scatter of different points which are telling you that it’s not very reliable. And, all of those points are in the upper portion of the target. They’re not hitting the bull’s eye at all, which gives you an indication that it’s not a valid measure either.

So in the upper right side,  you can have all the arrows going for that target, but you can see that they’re scattered again. That’s a situation where it’s unreliable yet valid.

On the bottom left you have an example of where the arrows are hitting pretty close together, but they’re missing the bull’s eye. So that’s an example of being reliable but not valid.

And then finally, what you’re shooting for when you’re creating an exam is both reliability and validity. So that’s where the arrows are hitting the bull’s eye in a very narrow scatter.

What you’re shooting for … is both reliability and validity.

Mark: Okay, awesome, can we go through overall exam statistics in GradeHub?

GradeHub overall exam statistics including reliability with Cronbach's Alpha

Jennifer: Absolutely, let’s go through this. It’s all great information to have.

First, the report is showing you the number of students. That’s good to know. As an instructor, of course, you know how big your class is. However, it’s a good reminder that if you have a smaller group of students, you might see different trends associated with that group as opposed to a larger group of students. For example, if it’s a small group, maybe this group is very proficient and they’re fast learners. In this case, it will probably result in a little bit skew to what you’re commonly seeing with the larger classroom.

Mark: Would that affect test reliability too?

Jennifer: Yes, it can. So if you have a small group of students, who are kind of homogeneous and performing the same way, you’re not going to see as high reliability. Reliability depends on the group of students. You tend to see higher reliability if you have a large range in your performance of your students.

The number of exam questions is also a great thing to be able to see at a glance. And as we had talked about, it has a direct relation to reliability. Shorter exams with fewer questions will tend to have lower reliability than those with many questions. But again, you need to balance everything. 

The reliability estimate that you see here, that’s the next thing. Again what you’re trying to shoot for is something greater than 0.500 or 0.600, and, the higher you get the better. What reliability is telling you is not the performance of your course from one day to the next on the same test but rather an estimate based on one test performance across your whole group of students. So what it’s telling you is internal test consistency. That’s the kind of reliability that it’s showing you.

You don’t want your test to be asking all of the same kinds of questions, of course. For an exam, you may want some range. If you ask the same type of question over and over again, it’s going to be highly reliable, but that’s not what you’re doing in a real testing environment.

If you’re getting a reliability score that is 0.980, that’s not necessarily what you want either because you want that test to be covering a lot of different content areas and evenly reflecting what you’re covering in class.

Mark: Interesting, now what’s the difference between the median and mean scores?

Jennifer: Sure, the mean score is just another name for an average. We’re all pretty familiar with that. It’s good to be able to see that mean score along with a median.

The median is the middle score. Let’s say we have 10 scores. You rank all your scores from highest to lowest. If you have an even number like 10, then you’re going to have two middle scores. The median is the average between those two middle scores.

The median is interesting to look at because, unlike the mean or average, it is not influenced by an outlier. For example, let’s say students didn’t so do so well as a cohort on your test, and they scored maybe 70% on average. What you might see is that you have most people scoring around 70% and a couple of people who “aced the test”  and got 100%. You’re gonna see that reflected much more in the mean than you are the median. The median is kind of immune to that kind of “pull” from an outlier.

Mark: Interesting, that’s maybe why instructors whenever they’re using, for example, z-scores, which we have in our student report, sometimes they will use the median versus the mean for curving. Does that make sense (to do)?

Jennifer: Yes, that does.

Mark: Standard deviation, I’ve had enough statistics, but I’m always rusty on this. So what is an instructor looking for in their standard deviation?

Jennifer: Sure, the standard deviation is a way to quantify the spread of your distribution. If you have a large standard deviation, it means students had a large range in their scores, all the way from the bottom to the top. If the standard deviation is small, it means that most students were scoring within a small percentage of each other. You can use this information to try to understand your test and whether it was too narrowly focused or not. Again, it will depend a lot on your students and their performance.

The standard deviation is a way to quantify the spread of your distribution.

Mark: Should instructors be looking for a particular range for standard deviation?

Jennifer: It really depends on the test. You can’t really give it a number because it will change on every different test depending on the number of students and the number of questions. So, I can’t just give you a number and say shoot for this.

Mark: We recently switched to Cronbach’s Alpha versus KR-20. And, could go a little bit into the similarities and the differences between those measures of test reliability?

Jennifer: Sure, that’s a great question.

KR-20, the KR stands for Kuder–Richardson, and it is the 20th equation presented in a publication that they published in 1937. KR-20 is a way to estimate reliability.

A common way to do that (reliability) before we had a lot of computing power back in the day was that you would do a split half. Basically, you would take the even numbers in a test and the odd numbers along with students’ performance on those across all your students, and you would correlate them. If the correlation was high, it was an indication that the reliability was good.

KR-20 was a technique to do that so you could look at all different possible ways that you’re splitting up the test.

Cronbach took a step further and liberated this estimate from needing to be dichotomous. With dichotomous meaning, you either get the question correct or incorrect. Cronbach kind of looked at it in a different way by using an ANOVA.

The bottom line is that KR-20 and alpha give you the same answer if you’re dealing with dichotomous data. They’re mathematically the same for dichotomous test questions. It’s just that Cronbach’s Alpha allows you to make reliability estimates when there’s partial credit while KR-20 does not. That doesn’t mean that you have to use KR-20 for dichotomous questions. You can use Cronbach’s Alpha for both.

Cronbach’s Alpha allows you to make reliability estimates when there’s partial credit while KR-20 does not.

Mark: So, I’ve had people in the field who will see a multiple-choice question that has a single response (for example, C) and they automatically think that a multiple-choice question is not dichotomous and that’s not the case, right?

Jennifer: Right, so you know if it’s a straightforward multiple-choice kind of question with just a single, correct response, then KR-20 and alpha would have the same measure. If you have a multiple-choice question and have ABCD or E as options and you choose C, and that’s the correct answer, you’re either getting that question right or wrong. So, it’s dichotomous.

Mark: Interesting, thanks for clarifying that. Well, Jennifer thanks for sharing all your insights.

To learn more about GradeHub, visit our website or schedule a demo!