Cambridge has published new research on AI essay grading that should be useful for schools, even though the study itself was conducted in a university setting.
The research tested three frontier AI systems on 761 undergraduate psychology essays submitted between 2022 and 2025. The essays came from the University of Cambridge, Nottingham and Manchester Metropolitan University, and had already been marked through normal university processes.
The AI systems were then asked to grade the essays, and the results were compared with the human-awarded degree classification bands.
The findings were mixed. AI matched the broad human-awarded classification between 35% and 65% of the time. The models also tended to compress marks towards the middle of the range: stronger essays were often marked lower than human assessors had judged them, while weaker essays were often marked higher. The systems were also sensitive to features such as essay length, vocabulary range and sentence complexity.
For schools, the relevance is not just the headline number. The more useful point is that Cambridge tested a particular kind of assessment task: direct grading of complex written work against academic standards.
In school assessment language, this is closest to standards-referenced or criteria-based judgement. A response is considered on its own, against a standard, rubric or marking criteria, and a mark or grade is awarded.
That is a demanding task. In extended writing, quality depends on argument, evidence, subject knowledge, relevance, structure and judgement. It is exactly the kind of response where fluent writing can look persuasive without necessarily showing stronger understanding.
Comparative judgement differs by asking judges to compare two pieces of work and decide which is better. After many pairwise comparisons, a statistical model is used to create a scale.
A great source of research here is Daisy Christodoulou's No More Marking which has reported stronger results from AI-assisted comparative judgement than Cambridge found in direct essay grading.
In one trial, AI agreed with human judges in 81% of pairwise decisions, while human-human agreement in a previous Year 7 assessment was reported at 87%. In another primary writing trial, AI agreed with human decisions 82% of the time.
Those results are encouraging, but they also represent distinctive marking methodologies. Where Cambridge tested AI directly assigning grades to individual essays, comparative judgement tests relative decisions between two pieces of work.
This is why broad claims about “AI marking” are usually not very helpful. The better questions are more specific:
This is especially relevant as school assessment moves beyond the traditional written script. Digital exams are increasingly able to include a wider range of answer types: objective-response items, short-answer responses, extended writing, coding, spreadsheets, diagrams, drawing and structured tables.
Those formats will not all behave the same way under AI.
Objective-response items can often be scored automatically where the answer rules are clear. Coding tasks and spreadsheets may include executable, structured or rule-based elements. Short-answer responses may sit somewhere between objective scoring and teacher judgement. Diagrams, mathematical working and multimodal responses raise different assessment questions again.
The Cambridge study gives useful evidence about one important part of the landscape: direct AI grading of complex essays. It does not settle the question for every subject, every answer type or every assessment workflow.
That is where schools need careful product design and better evidence.
Gradeo’s approach treats AI in assessment as task-specific. Where responses are objective, Gradeo supports automated scoring. Where responses are non-objective, AI can assist with feedback and reduce repetitive workload, but teachers retain their central role in the process, reviewing the response and deciding the mark.
This can result in material benefits to schools. In the CSSA Online Trial HSC Examinations 2025, teachers using Gradeo’s AI-assisted feedback reported saving an average of 18 minutes per student per exam compared with regular marking, while keeping final judgement in human hands.
The same trial demonstrates why future research needs to go beyond essays. Students used coding, spreadsheet, diagramming and other digital response formats that better reflect how some subjects are now assessed online. Three in four teachers rated coding, spreadsheet and diagramming tools as suitable or very suitable, while also noting the importance of student practice and familiarisation.
This is the next research frontier for school assessment: not whether AI can “mark” in general, but where it can validly support assessment across different task types, subjects and levels of stakes.
Cambridge’s study is an important warning about direct AI grading of complex essays. Comparative judgement research shows AI may perform differently inside a structured comparison workflow. School assessment will need evidence across an even wider set of response types.
We're contributing to that literature, building assessment workflows where teacher-judgment, automation, and AI-assisted feedback all retain a central role. Our goal is to protect the role of teacher judgment, while reducing workload, improving feedback, and building better evidence around where AI can safely support assessment in real school settings.
[1] University of Cambridge, “AI not yet good enough to mark university essays’”, 2026.
[2] No More Marking, “So, can AI assess writing?”, 2025.
[3] No More Marking, “Can AI assist the Comparative Judgement of primary writing?”, 2025.
[4] Gradeo and CSSA, Taking the Leap: Insights from Australia’s first large-scale online Trial Examinations, 2025.
