A question of difficulty?
In just a few days’ time, we will finally know. After two years of waiting, two years of worrying and two years of wondering, the results of the new English and maths GCSEs will be revealed and teachers will finally have some clarity.
The transition to these new specifications - and the new grading system - has been much criticised by teachers. In one of the most read blogs on the Tes website this year, Chris Curtis, a head of English, explained the problems caused by teachers having to guess what a 4, 5, 6 or any other grade would look like. How could you decide whether to put a student in for a higher or foundation paper in maths? How could you, in good faith, judge where a student was in terms of their progress? And how could you report progress accurately to senior leadership teams? (See “‘All aboard the Titanic catastrophe of the new GCSEs’ - an English teacher’s warning from the frontline”, bit.ly/TitanicGCSEs).
“Consistency” was “missing in action”, he wrote, and teachers flocked to the comment section to add their own woes to the chorus. Like Curtis, they stressed that the uncertainty around the new “more difficult” exams was hampering their ability to teach and disadvantaging students.
Pushing the boundaries
Of course, teachers never do have advanced clarity over grade boundaries or exam difficulty. But, in the past, years of experience of the exams meant that teachers knew what, for example, a C grade piece of work looked like. This year, though, the combination of different exam specifications and a completely new grading system has complicated matters.
Could things have been any different? In particular, could the exam boards have published grade boundaries in advance to give some certainty at a time of great change?
I would argue not. Grade boundaries can never be accurately revealed. And the reason why reveals a lot about exams and how they work - but also a surprising amount about how we learn, too.
Put simply, the reason why examiners can’t tell you grade boundaries in advance is that they can’t tell you how difficult an exam is in advance.
At first, this might sound a bit ridiculous. After all, as one teacher said to me, “Isn’t that what examiners are paid to do?”
But it turns out that predicting the difficulty of individual exam questions, or of whole exam papers, is impossible to do with a high degree of precision. So predicting grade boundaries is, as a result, impossible, too. This is because relatively small changes in the surface structure of a question - how it’s worded, for example - can have a huge impact on how difficult pupils find it. Imagine trying to write a series of questions that all test whether pupils can add two-digit numbers, and that are all of equal difficulty.
Is “10+10” as difficult as “80+80”? What about “83+12” or “87+18”?
If you gave all of those questions to the same group of pupils, would you expect the same success rate for each question? Probably not.
And even smaller changes than that can have a significant impact: is “11+3” as difficult as “3+11’? If you use your fingers to do addition, perhaps not.
Word problems are equally tricky. Consider the following two problems:
A. Joe had three marbles. Then Tom gave him five more marbles. How many marbles does Joe have now?
B. Joe has three marbles. He has five marbles fewer than Tom. How many marbles does Tom have?
One study showed that 97 per cent of pupils got question A right, but only 38 per cent answered question B correctly (detailed in Kevin Durkin and Beatrice Shire’s 1991 book, Language in Mathematical Education).
Of course, there is a lot of research on why pupils find certain questions harder than others, and we can use this research to make broad predictions about difficulty. But that still doesn’t solve our problem. Even if we are fairly certain that one question is more difficult than another, it’s hard to predict how much more difficult it is. For example, most people would predict that pupils would find the word “cat” easier to spell than the word “definitely”. But by how much?
Similarly, look at the following questions:
A. Which is bigger: 3/7 or 5/7?
B. Which is bigger: 5/7 or 5/9?
Most teachers predict, correctly, that more pupils will get question A right than will get question B right. But very few can exactly predict the percentage of pupils getting each right. In one study, 90 per cent of 14-year-olds got the first question right, but only 15 per cent got the second one right (as quoted by educationalist Dylan Wiliam in his 2014 publication, Principled Assessment Design).
Most of these examples are maths questions, but this problem is, if anything, even more acute in other subjects. After all, maths is typically thought of as a fairly objective subject where answers can be marked as either right or wrong. Judging the difficulty of questions is even trickier when you have questions that attempt to assess the originality or the creativity of a pupil’s writing.
For example, the difficulty of unseen reading tests depends to a large extent on the vocabulary and background knowledge required for comprehension. Most English teachers will have stories to tell about how one tricky word in an unseen text can leave pupils completely flummoxed. I can remember two classes struggling with a past GCSE paper for which knowing the meaning of the word “glacier” was vital to understanding the text. When they took a past paper in which the text was of “equivalent” reading difficulty but about a more familiar topic, they did much better.
Small change, big impact
Why are there such differences in success rates between questions that are supposed to test the same thing? Why do small surface changes have a big impact?
It is likely to be because we think and reason in concrete ways. All of us, not just young children, find it hard to transfer knowledge and skills to new problems. Even if the “deep structure” of a problem stays the same, if enough of the surface features change then we will find that problem more or less challenging.
In a low-stakes exam, this issue is not quite so significant, because you can keep the questions exactly the same from one sitting to the next.
The pupils taking the exam in 2014 take the same exam as the pupils in 2013, and so their scores can be compared directly. You can, therefore, set grade boundaries that can be consistent across time.
With low-stakes tests, you can also trial different versions of tests with the same pupils, to see just how comparable they are. A group of pupils might score 55 per cent on one version, but 60 per cent or 65 per cent on another version.
But this approach clearly won’t work for high-stakes exams, which have to be changed from one year to the next. Examiners who create high-stakes tests, such as GCSEs and A levels, are caught in something of a bind.
They have to change the questions from year to year, but doing so changes the difficulty in unpredictable ways. At its simplest, that is why grade boundaries have to change: because the questions change.
So how do we know that a grade 4 this year will be comparable with a grade 4 next year? Or, indeed, that a “pass” from last year will be comparable with a “pass” this year?
The big challenge for examiners is to come up with an accurate and precise way of measuring exactly how difficult different papers are relative to each other, so they can create grade boundaries that represent a consistent standard from year to year. This is a perennial challenge, but when exam specifications change as well, as they have done this year, that adds more complexity.
Statistics may hold the answers. Although GCSE examiners can’t trial tests in advance and see how pupils do on them, they can use statistics in other ways.
They can wait until pupils have taken the exam and see how they perform and then adjust accordingly. They can also use prior attainment information about the pupils taking the exam so they can compare how similar pupils from different year groups perform on different tests.
Or they could set standards just by trying to judge how hard the exam is, with no help from statistics. The history of that, though, does not make such a move appealing.
New Zealand tried such a system in the early part of the century and found that the number of pupils achieving “excellent” in a maths exam varied from 5,000 one year to 70 the next.
For this first year of the new GCSEs, exam regulator Ofqual has come up with a very specific use of statistics that it will employ to ensure comparability between the last year group sitting the old exams and the first year group sitting the new exams. The results of these two year groups will be statistically linked at key grading points.
Broadly, the proportion of pupils getting a 4 and above on the new GCSEs will match the proportion who got a C and above on the old ones. Similarly, the proportion getting a 7 and above will match the proportion who got an A and above.
The statistical link is based on the prior attainment of the cohort at key stage 2. So if this year’s cohort has similar prior attainment to that of last year’s cohort then about 70 per cent of 16-year-olds will get a 4 or above in English language and maths. About 16 per cent will get a 7 or above in English language, and 20 per cent will get a 7 or above in maths.
If the prior attainment of the cohorts is not the same, then the statistical link will still remain. But the headline pass rate might, therefore, rise or fall depending on the change in the profile of the cohorts.
So, despite all the talk of uncertainty, we do actually know something about how the new grades will work this year. And, actually, we can predict with some confidence something else, too: if the exam is harder but grades have been set using a link to last year, then it is possible that quite low raw marks could lead to quite good grades.
Dumbing down?
If this happens, it is entirely likely that some newspapers will leap on this as evidence of “dumbing down”. It won’t be. Because it is so hard to know in advance the precise difficulty of a question, we cannot rely on the number of raw marks needed to pass as a sign that a test is easy or hard. On a very hard test, a low mark may be very impressive. On a very easy test, a high mark may not be nearly as impressive.
Another useful way that statistics can contribute to standard-setting is with reference tests. As we’ve seen, high-stakes examiners can’t trial questions to find out their difficulty. But they can look at how pupils perform on low-stakes reference tests where the questions stay exactly the same. If, over time, pupils with the same prior attainment start to do better on such questions, it’s evidence that pupils really are learning more at school - and that the proportion of pupils receiving good grades should increase.
England’s first national reference test was held in March this year.
So why do we need all these statistics, and why can’t we rely on human judgement? Readers who are familiar with the work of Daniel Kahneman and other behavioural psychologists will be familiar with the answers to those questions: attempting to set consistent exam standards using human judgement is fiendishly hard, and an approach that uses statistics is more reliable. This is because, as Kahneman and other researchers have found, human judgement is prone to all kinds of biases and inconsistencies.
This is why, instead of being a reason to berate the exam boards, the lack of grade boundaries for the new GCSEs should actually be the catalyst for a much-needed discussion about where the domains of human judgement and statistics are best matched in education.
We assume that human judgement will inevitably be superior to an algorithm or statistics. This was certainly the feeling when Ofqual held its consultation in 2014 on how grades should be set. The great majority of the awarding organisations and subject associations that responded to the consultation recommended an approach that used statistics. But the majority of schools and teachers that responded preferred an approach based on judgement.
The disconnect between teachers and assessment experts here is not helpful for anyone. The risk with the current changes is that they will lead to further misunderstanding and confusion. The opportunity is that they will lead to schools and assessment organisations seeking to bridge this gap with better training and dialogue. And in doing so, there is potential for both groups to discover more not just about how we measure learning, but about how we learn, too.
Daisy Christodoulou is director of education at No More Marking and the author of Making Good Progress? and Seven Myths about Education. She tweets @daisychristo
You need a Tes subscription to read this article
Subscribe now to read this article and get other subscriber-only content:
- Unlimited access to all Tes magazine content
- Exclusive subscriber-only stories
- Award-winning email newsletters
Already a subscriber? Log in
You need a subscription to read this article
Subscribe now to read this article and get other subscriber-only content, including:
- Unlimited access to all Tes magazine content
- Exclusive subscriber-only stories
- Award-winning email newsletters