University of Nottingham's blind-study evaluation of AI in assessment design

A blurred modern office or classroom seen through a transparent glass board. Bright cyan handwriting on the glass asks, “What does LLM stand for?” followed by four multiple-choice answers: “A) Large Language Model,” “B) Linear Language Mechanism,” “C) Lateral Language Methodology,” and “D) Logical Learning Model.”

This academic year, a key focus for Jisc’s AI team is understanding AI’s potential to support marking and feedback. We’re currently running a series of pilots, exploring the impacts of education focused platforms (Graide and TeacherMatic) and general purpose AI tools (including Microsoft Copilot, Chat GPT and Gemini) – more information about these pilots and the insights they are surfacing can be found here.

Given this focus, we’ve also been talking to colleges and universities who’re doing exciting work related to AI in assessment.

The University of Nottingham, for example, has designed and delivered a blind study to evaluate the quality of AI-generated assessment materials- allowing users to focus on outputs, without the influence of tool branding or preconceptions.

Their work offers useful insights into how discrete, focused studies can surface meaningful findings. Their approach highlights the value of pedagogy-first design, structured evaluation, and cross-disciplinary input when making decisions about AI in education.

The following is a guest blog from Cristina De Matteis, Cecilia Goria and Rob Shipman at the University of Nottingham.

Evidence over Presumption: A blind-study evaluation of AI in assessment design

As generative AI reshapes the landscape of higher education, it’s easy for opinions to be driven more by headlines and hype than by real evidence. The risk? Powerful new tools could be dismissed, or embraced, without a clear understanding of their real educational value.

At the University of Nottingham, we wanted to move beyond speculation. Our goal was to engage colleagues, including ourselves, in thinking critically about AI’s role in teaching and assessment, and to build confidence through evidence rather than presumption. To do this, we designed a small but revealing experiment: a blind study that invited educators to judge AI-generated content on its quality alone, stripped of branding or preconceptions.

Putting AI to the Test

We focused on multiple-choice questions (MCQs), a deceptively simple format that crosses disciplines but demands real skill to write well. Plausible distractors, balanced phrasing, and alignment with learning outcomes all require nuanced pedagogical judgement. That made MCQs an ideal testbed for examining what AI can (and can’t) do well.

Teaching content from different disciplines was used as input for generating sets of related MCQs. Seven educators from four faculties reviewed anonymised sets of questions produced by three tools: two generative AI platforms and one AI tool which used only the specific content from the slides to generate questions, rather than broader knowledge embedded in an LLM. The reviewers didn’t know which tool had produced which set, and all outputs were presented in a consistent format.

Prompts were designed to be discipline agnostic and so that the MCQs be mapped to Bloom’s taxonomy, ensuring a meaningful spread of cognitive challenge.

What emerged was not just a simple ranking of tools, but a reminder that performance can vary in ways we don’t expect. Some AI platforms excelled in particular subjects while underperforming in others, suggesting that assumptions about “best” tools can be misleading without structured, contextualised comparison.

What We Learned

The study highlighted the importance of structure and transparency when exploring AI in education.

For anyone planning similar work, we’d recommend:

Start with strong prompts. Align them with your pedagogic objectives.

Expect surprises. Tool performance isn’t uniform, so evaluation should focus on the process and pedagogy, not brand or hype.

Bring in a mix of voices. Cross-disciplinary pilots reveal blind spots and unexpected strengths.

Keep it blind. Not initially disclosing the origins of AI-generated content, as part of these studies, removes preconceptions and keeps the focus on quality.

Gather evidence. Develop a simple methodology for collecting the opinions and experiences of users.

Be transparent with students. When AI supports content creation, our suggestion is that openness builds trust.

Beyond the Findings

By treating AI as something to interrogate rather than fear or follow, we found a shared space for professional reflection.

This evidence-informed mindset had practical outcomes, too. Most educators in our pilot said they would use the AI-generated materials from two of the three tools, after some editing, in their teaching. One tool was widely viewed as unusable. This was the tool that only drew upon specific knowledge sources, not the wider training data of an LLM.

Thoughtful experiments like this one can help shape how institutions engage with emerging tools, anchoring innovation in evidence, pedagogy, and collaboration.

Jisc supported in producing a first draft (based on textual responses provided by the University of Nottingham team in a form submission, and meeting notes), which was produced with the support of AI. This was then shared with the team at the University of Nottingham for editing. For more information about how Jisc uses AI to support in their writing, please see this blog.

Find out more by visiting our Artificial Intelligence page to explore publications and resources, learn more about our communities and sign up for our AI Literacy training.
For regular updates from the team sign up to our mailing list.
Get in touch with the team directly at AI@jisc.ac.uk