New Cambridge research tested AI essay grading. What should schools take from it?

Research note

4 min read

New Cambridge research tested AI essay grading. What should schools take from it?

New Cambridge research warns that even frontier AI models struggle when asked to independently grade complex essays against academic standards, and it helps clarify the more practical question for schools: which assessment tasks can AI support reliably, and where does teacher judgement remain essential?

Table of content

Introduction

Key points

Cambridge tested three frontier AI systems (including Claude and ChatGPT) on 761 undergraduate psychology essays from Cambridge, Nottingham and Manchester Metropolitan University.
The study tested direct grading of extended written responses against academic standards. It did not test objective scoring, short-answer items, coding, spreadsheets, diagrams, mathematical working or teacher-assisted feedback workflows.
The models matched broad human-awarded degree classification bands between 35% and 65% of the time, with researchers noting that relying on it for grading could result in “homogenised” grading that “underestimates brilliance”.
Comparative judgement research is testing a different approach: comparing two pieces of work rather than assigning a grade to one response directly.
For schools, the important lesson is that “AI marking” is not one thing. The task type, subject, stakes and level of human oversight all matter.
Gradeo’s approach is task-specific: automated scoring where responses are objective, and AI-assisted feedback with teacher-awarded grades where professional judgement is required.

What Cambridge tested

Cambridge has published new research on AI essay grading that should be useful for schools, even though the study itself was conducted in a university setting.

The research tested three frontier AI systems on 761 undergraduate psychology essays submitted between 2022 and 2025. The essays came from the University of Cambridge, Nottingham and Manchester Metropolitan University, and had already been marked through normal university processes.

The AI systems were then asked to grade the essays, and the results were compared with the human-awarded degree classification bands.

The findings were mixed. AI matched the broad human-awarded classification between 35% and 65% of the time. The models also tended to compress marks towards the middle of the range: stronger essays were often marked lower than human assessors had judged them, while weaker essays were often marked higher. The systems were also sensitive to features such as essay length, vocabulary range and sentence complexity.

Why this matters for schools

For schools, the relevance is not just the headline number. The more useful point is that Cambridge tested a particular kind of assessment task: direct grading of complex written work against academic standards.

In school assessment language, this is closest to standards-referenced or criteria-based judgement. A response is considered on its own, against a standard, rubric or marking criteria, and a mark or grade is awarded.

That is a demanding task. In extended writing, quality depends on argument, evidence, subject knowledge, relevance, structure and judgement. It is exactly the kind of response where fluent writing can look persuasive without necessarily showing stronger understanding.

The comparative judgement comparison

Comparative judgement differs by asking judges to compare two pieces of work and decide which is better. After many pairwise comparisons, a statistical model is used to create a scale.

A great source of research here is Daisy Christodoulou's No More Marking which has reported stronger results from AI-assisted comparative judgement than Cambridge found in direct essay grading.

In one trial, AI agreed with human judges in 81% of pairwise decisions, while human-human agreement in a previous Year 7 assessment was reported at 87%. In another primary writing trial, AI agreed with human decisions 82% of the time.

Those results are encouraging, but they also represent distinctive marking methodologies. Where Cambridge tested AI directly assigning grades to individual essays, comparative judgement tests relative decisions between two pieces of work.

The better questions to ask

This is why broad claims about “AI marking” are usually not very helpful. The better questions are more specific:

What type of response is being assessed?
Is the task objective, standards-referenced, comparative or holistic?
Is AI assigning the mark, assisting feedback, checking consistency or flagging responses for review?
How high are the stakes?
What role does the teacher or assessor retain?

Digital assessment is broader than essays

This is especially relevant as school assessment moves beyond the traditional written script. Digital exams are increasingly able to include a wider range of answer types: objective-response items, short-answer responses, extended writing, coding, spreadsheets, diagrams, drawing and structured tables.

Those formats will not all behave the same way under AI.

Objective-response items can often be scored automatically where the answer rules are clear. Coding tasks and spreadsheets may include executable, structured or rule-based elements. Short-answer responses may sit somewhere between objective scoring and teacher judgement. Diagrams, mathematical working and multimodal responses raise different assessment questions again.

The Cambridge study gives useful evidence about one important part of the landscape: direct AI grading of complex essays. It does not settle the question for every subject, every answer type or every assessment workflow.

That is where schools need careful product design and better evidence.

Gradeo’s approach and the future research frontier

Gradeo’s approach treats AI in assessment as task-specific. Where responses are objective, Gradeo supports automated scoring. Where responses are non-objective, AI can assist with feedback and reduce repetitive workload, but teachers retain their central role in the process, reviewing the response and deciding the mark.

This can result in material benefits to schools. In the CSSA Online Trial HSC Examinations 2025, teachers using Gradeo’s AI-assisted feedback reported saving an average of 18 minutes per student per exam compared with regular marking, while keeping final judgement in human hands.

The same trial demonstrates why future research needs to go beyond essays. Students used coding, spreadsheet, diagramming and other digital response formats that better reflect how some subjects are now assessed online. Three in four teachers rated coding, spreadsheet and diagramming tools as suitable or very suitable, while also noting the importance of student practice and familiarisation.

This is the next research frontier for school assessment: not whether AI can “mark” in general, but where it can validly support assessment across different task types, subjects and levels of stakes.

Cambridge’s study is an important warning about direct AI grading of complex essays. Comparative judgement research shows AI may perform differently inside a structured comparison workflow. School assessment will need evidence across an even wider set of response types.

We're contributing to that literature, building assessment workflows where teacher-judgment, automation, and AI-assisted feedback all retain a central role. Our goal is to protect the role of teacher judgment, while reducing workload, improving feedback, and building better evidence around where AI can safely support assessment in real school settings.

Where to read more

[1] University of Cambridge, “AI not yet good enough to mark university essays’”, 2026.

[2] No More Marking, “So, can AI assess writing?”, 2025.

[3] No More Marking, “Can AI assist the Comparative Judgement of primary writing?”, 2025.

[4] Gradeo and CSSA, Taking the Leap: Insights from Australia’s first large-scale online Trial Examinations, 2025.

‍

Frequently asked questions

Written by

Charlie Clark

New Cambridge research tested AI essay grading. What should schools take from it?

Key points

What Cambridge tested

Why this matters for schools

The comparative judgement comparison

The better questions to ask

Digital assessment is broader than essays

Gradeo’s approach and the future research frontier

Where to read more

Frequently asked questions

Future

Explore the Future Today

New Cambridge research tested AI essay grading. What should schools take from it?

Key points

What Cambridge tested

Why this matters for schools

The comparative judgement comparison

The better questions to ask

Digital assessment is broader than essays

Gradeo’s approach and the future research frontier

Where to read more

Frequently asked questions

Future

Explore the Future Today

Explore the Future Today