# Standardizing Kenya’s National Exams

This issue is an extension of a previous article I wrote about K.C.P.E and K.C.S.E and how both of them have validity and reliability concerns. In this article I will tackle both concerns deeply. I discuss how K.C.P.E and K.C.S.E fail at standardization and how the problems can be corrected.

Let me explain what standardizing KCPE/KCSE means because it’s not an easy process. First, standardization is a rigorous scientific practice. Even though this may sound like a non-essential and painstaking practice that’s not worthy of the government’s attention, I, like most Kenyans, believe that these exams play a major role in deciding the futures of our children and relatives. For that reason, it is important that we don’t take chances when design these exams.

Second, before standardizing an exam or a test we have to admit exams are measurement tools. Exams are like thermometers, rulers, and clocks. In the past, our forefathers used the sun to tell the time. Using the sun was accurate on the grand scale, but wildly inaccurate when it came down to hours, minutes, and seconds. The solstices could also throw their timing of by large margins of error.

In education exams serve a similar purpose as clocks. Often, exams measure students’ learning or whether they are capable of learning. As part of education policy, they tell us whether education is a positive return on investment and whether more resources, changes in strategies, or direct abolition is needed.

## Why exam standardization is important

Sometimes a test can determine whether you live or die. In the United States, offenders have to do various psychiatric tests to ascertain whether they are fit to stand trial. A schizophrenic offender will be deemed unfit for trail and instead of facing incarceration, he might be directed towards psychiatric resources and help.

Offenders who commit murder or those who face the possibility of a death sentence are often required to do IQ tests. These tests inform the juries’ judgement and help determine whether an offender understands the seriousness of the crime committed. (It is believed an offender won’t rehabilitate until he comes to terms with the seriousness of his crimes; low IQs can’t figure out their crimes). Instances where the offender has a low IQ the jury is prompted to give a life sentence for a heinous crime where offenders with above average IQs face a death sentence.

A single test can determine whether someone dies or whether they receive certain privileges. Examinations also fall within this category of tests that determine the course of one’s life. Since this is an important task for schools and governments, it is important to make sure examinations are accurate assessments of the cognitive ability of students.

Standardization entails making sure exams are valid and reliable measures of learning and cognitive ability. Using a non-standardized exam is like using a faulty clock to measure time or a faulty psychiatric test to assess the mental health of offenders.

## How to standardize exams

Standardizing exams entails boosting our trust of an exams’ ability to measure the learning capacity of students. The same way we trust thermometers to tell us temperature or IQ tests to know whether offenders understand their crimes is the same way we use standardized exams to measure learning ability among pupils.

The standardization process follows a scientific and often statistical approach. The questions below often guide the process and it’s only after a question is answered satisfactorily do we move to the next.

What do exams measure? (Construct validity)

How do they measure? (Content validity)

How do we know they measure what they claim to measure? (Criterion validity)

How accurate is the measure? (Reliability)

All these questions have to be answered for standardization to be possible. The bracketed terms are statistical concepts that accompany the process.

## Construct validity

You can’t answer the rest of the questions until you get question one right. The purpose of the question is to help us understand why we give exams and whether we understand what we are testing. An example of a construct is intelligence which is measured by IQ tests. Since we agreed both KCPE and KCSE are measurement tools it’s our number one concern to know what they measure. These exams don’t measure intelligence because we would know by comparing IQ test results with KCSE results (Criterion validity #3).

If these exams don’t measure intelligence, what do they measure? They probably attempt to measure academic performance. I use the word “attempt” because that’s what they are set to do but they don’t.

After identifying academic performance as our construct, we also need to operationalize the definition and come to terms with what it means. From the table below, each construct is followed by the operational definition and the instruments that measure it.

A *construct* is a real or perceived trait which psychologists proceed to measure and ascertain. Intelligence is a real construct characterized by a general factor called g. Once psychologists identify constructs they build tests or instruments to measure them. The last column shows the various instruments used to measure each of the constructs on the left. It’s possible for constructs to be measured by more than one instrument. If academic performance is our construct, KCPE and KCSE can be argued to be the measurement instruments.

To determine whether academic performance is a real construct its validity needs to be established. A series of statistical demonstrations can ascertain whether the construct is valid. If it’s real we need to be able to measure it.

## Content validity

Assuming examinations are the best way to measure academic performance, we need exam questions to help us meet that objective. These questions will form the basis for our instruments. *Content validity* is a demonstration that seeks to understand whether the exam questions or test items devised to measure a construct like academic performance actually do measure it.

Many people never realize exams questions are not supposed to be picked arbitrarily. Each question needs to be properly accounted for and justified. If you include 1+1 in an exam you have to explain why you included it and what purpose it serves. Easier questions in an exam means many student will pass and the exam won’t accurately measure their academic performance. The same is true of very difficult questions which often fail to identify the actual abilities of students.

The table above shows how specific questions/statements are devised to measure a specific construct. Take a look at statement 1,2,3 and compare them to statement 4,5,6. The first three measure internal locus of control while the other three measure external locus of control. Examination questions should also follow the same procedure and should be well crafted to measure a specific construct. The locus of control scale does not ask IQ test questions or personality questions; it only asks the relevant questions regarding locus of control. Assuming I am creating an actual grade one exam how would I do it?

*To create an exam for class one students and to measure a range of abilities corresponding to academic performance I need to be careful with the questions I pick and how I distribute them across the paper. I would include 1+1 because its easier enough and all grade one students can get it. Questions about addition of numbers could cover 15% of the paper. Since subtraction is harder than addition, I would include questions like 9–4; to cover 35% of the paper. I would do the same for division and multiplication. Multiplication happens to be much easier for students than division and would need to be included in the paper accounting for 35% of the questions. Since division is the most difficult, these questions would cover 15% of the paper.*

*If our paper has 50 questions, 7.5 will be addition, 17.5 will be subtraction, another 17.5 will be multiplication, and 7.5 will be addition. You notice that the questions have increasing difficulty and that neither simple nor difficult questions fill the paper. The cleverest student could possibly get all questions right, a few others will have difficulties with division while getting all other questions right. The paper ensures that the weakest students will probably get addition right. We can further expand the variability of these questions by including double digits. Instead of 7.5 addition questions, we could have 4 single digit addition and 3.5 double digit addition questions.These questions clearly show how we can measure both academic ability and the variability of academic performance among grade one pupils.*

## Criterion validity

Since we now have an exam complete with questions the next step is to know whether these questions measure academic performance. Coming up with a test and items does not mean the instrument measures the specific construct you identified. For example, how do we know that the 1+1 we included in the exam measures academic performance. One needs to go a few extra steps to find out. One way to find out is to establish the *criterion validity*. I hinted above that a construct can be measured by more than one instruments. In the case of academic performance, we can measure student’s learning by comparing the results of our exams with the results of other exams that measure academic performance.

If we have developed test A to measure academic performance, we can compare its results with that of test B which has already been established to be a valid and reliable measure of academic performance. Bench marking tests and their items is, therefore, a critical step in standardizing exams.

Another way is to administer the exam to a small sample of grade one students then map out how they perform in each question. If subtraction is much easier to students than you previously thought, you can decide to replace whole numbers with decimals and fractions until you reach a threshold that gives you a normally distributed performance. For KCPE or KCSE, this means administering an exam to a number of schools across the country and see how they perform in these exams. KNEC can make adjustments to the questions where possible. The importance of this procedure is that it accurately measures ability while at the same time allowing us to see the variance among students. It also reduces floor and ceiling effects in a national exam.

## Reliability

The last step of the standardization process is to determine the accuracy of the test in measuring academic performance. We invoke the *reliability* of the instrument or the test in measuring the construct. We know our thermometers are accurate if they give us consistent results. A clock that loses two seconds every minute is not a good clock. Similarly, an exam with low accuracy is not a good exam. For every inaccurate result a child falls through the cracks.

Inaccurate measures of academic performance is the reason we have thousand of unprepared students in universities. It is also the reason some students who score high grades in high school fail at the university level. Inaccurate tests also mean students who fail in high school might actually be smarter than the invalid tests suggested.

A measurement is accurate if we use it twice and get the same result i.e if every minute of our clock is equal to 60 seconds. Like the thermometer, varying results may mean the test instrument is neither accurate nor reliable. A student who sits for KCPE or KCSE twice should get the same result to prove the accuracy of the exam as a measure of academic performance.