Reliability, Standard Error, and Point Biserial |

How to Interpret GradeHub’s Statistical Reports

Reliability

In the summary report, we provide the reliability of the test. In testing, reliability is a measure of consistency. If a test has high reliability, it is consistent to the point that students would achieve close to the same results if they were administered that same test multiple times.

The Kuder and Richardson Formula 20 is used to calculate reliability. Here is the formula:

The value of reliability can range from 0.0 to 1.0. A rule of thumb states that instructor-created tests with a reliability greater than or equal to 0.70 are acceptable.

Reliability is important in testing because it checks how much students’ test scores reflect random measurement errors. Measurement errors can cause students to answer questions correctly or incorrectly. These errors arise due to the following factors:

1. Student factors: motivation, concentration, fatigue, boredom, lapses of memory, careless in marking answers, luck in guessing

2. Test factors: questions that are too hard, confusing or unclear questions

3. Grading factors: inconsistent scoring guidelines, carelessness, counting or computation errors

One way to improve test reliability is creating good quality questions. To determine the quality of test questions, we use item discrimination. Questions with high discrimination are good because they discriminate between high performing students and low performing students. Reliability and discrimination are positively correlated: Questions with high discrimination increase test reliability, while questions with low discrimination decrease test reliability. You can read more about item discrimination below.

Another way to improve test reliability is lengthening the test. Longer tests generally produce higher reliabilities because the percentage of measurement error decreases as test length increases. Take an example with a group of students who did not study for the test: If a test only has one multiple choice question, there is a high chance the students can guess the answer correctly; however, if a test has 20 questions, it is very unlikely the students can answer all questions correctly.

Administering an unreliable test is like randomly assigning test scores to students. Determining whether your test is reliable or not will help you decide whether to administer the same test again to future classes or to make changes to improve the test.

Standard Error

What we also provide in the summary report is the standard error: an estimate of the standard deviation of a statistic. Standard error helps you understand how a student’s real knowledge differs from their actual test score.

Let’s break down what exactly a standard error is. First, we need to understand what a standard deviation is. A standard deviation is a value that indicates how much individual responses vary or “deviate” from the mean. If individual responses vary greatly from the group mean, the standard deviation is big; and vice versa.

To calculate standard deviation of a class, we use the following formula:

In the case of class tests, we do not know the test scores of the whole population (aka all students from the school), so instead we use the standard error since it is computed from known sample statistics (the test scores an individual class). More specifically, we calculate the standard error of the mean with the following formula:

Now that we know how to calculate standard error, we can talk about what it does for testing. Standard error analyzes how reliable the mean is. A small SE is an indication that the sample mean is a more accurate reflection of the actual population mean.

Item Difficulty

Item difficulty analysis helps determine which questions are too hard or too easy. Item difficulty ranges from 0.0 (no students answered correctly) to 1.0 (all students answered correctly). As you can see, hard questions are answered correctly by 0 – 0.50 of the class, medium questions are answered correctly by 0.50 – 0.85 of the class, and easy questions are answered by 0.85 – 1.0 of the class. It is recommended to have questions that can be answered correctly by about 60% of students.

There are several explanations to justify why questions are “too hard”. If a question has a low difficulty value, such as 0.10, the question might have more than one correct answer; the question might be on a topic not covered significantly in class; the question might not be written clearly; there is an error in the answer key.

Also, there are explanations to justify why questions are or “too easy”. If a question has a high difficulty value, such as 0.90, the question might be on a topic that is common sense/trivial; the answer to the question is too obvious.

One interesting take away from the item difficulty analysis is comparing the instructor’s expectations to the students’ results. If an instructor thought a question was supposed to be easy but the majority of students got it incorrect, maybe the instructor needs to go more in depth in that subject.

In addition, easy and hard questions should not automatically be eliminated from tests. Many instructors like starting the test with easy questions to reduce testing anxiety and have hard questions at the end of the test to see which students have a good grasp on the material. In the end, it is up to the instructor to decide what types of questions they would like to place in their test.

Point Biserial and Item Discrimination

We use item discrimination to determine whether students who answer a particular question correctly have done well overall on the test. In particular, we calculate the point-biserial correlation to measure the item discrimination for each test question.

Point-biserial correlation ranges from < 0.1, 0.1 – 0.3, and > 0.3. The higher the point-biserial correlation, the higher the item discrimination is; therefore, the more fair the question is since it indicates that students who did well on the test tended to answer the question correctly, and students who did not do well on the test tended to answer the question incorrectly. It is recommended that a question has 0.20 or higher correlation values.

One way to increase point-biserial correlation is by using the item difficulty analysis. Questions that are too easy or too hard do not discriminate well; almost all students will be able to answer the easy questions and few students will be able to answer the hard questions. Focus on including questions with medium difficulty (around 60% of students will answer correctly) to increase point-biserial correlation.

References
http://www.greenbook.org/marketing-research/how-to-interpret-standard-deviation-and-standard-error-in-survey-research-03377
http://www.omet.pitt.edu/docs/OMET%20Test%20and%20Item%20Analysis.pdf
http://stattrek.com/estimation/standard-error.aspx?Tutorial=AP
http://stattrek.com/statistics/dictionary.aspx?definition=Standard%20deviation
https://testing.wisc.edu/Reliability.pdf

[ssba_hide]

Blog Post

Grade Smarter

How to Interpret GradeHub’s Statistical Reports

Reliability

Standard Error

Item Difficulty

Point Biserial and Item Discrimination

Tags

Blog Post

Grade Smarter

How to Interpret GradeHub’s Statistical Reports

Reliability

Standard Error

Item Difficulty

Point Biserial and Item Discrimination

Share

Tags