Categories
Advice and Guidance

AI Detection and assessment – an update for 2025

Decorative - a magnifying glass on papers.

This blog post is an update to two previous entries that focused on AI detection tools: AI Detection: Latest Recommendations’ (September 2023) and AI Writing Detectors – Concepts and Considerations (March 2023). As we look ahead to the next academic year, this feels like an opportune moment to revisit the topic.

The core concepts and conclusions from those earlier posts still stand, but much has changed. There have been significant advances in AI tools since I last wrote on the subject. We now have more evidence of AI’s capability to complete coursework and on the difficulty humans face in detecting its use. However, there have been no major developments in AI detection technology itself.

This update is partly prompted by the rise of AI assessment scales, traffic-light systems and similar tools. These approaches are undoubtedly helpful in supporting students to understand how AI can be used appropriately. Yet, they also present challenges. I’ll use the assessment scale proposed by Mike Perkins, Leon Furze, Jasper Roe, and Jason MacVaugh as an example, although the same challenges apply to many other approaches, such as ‘amber’ traffic light indicators. In this scale, we have these two levels amongst the five:

  • AI Planning: You may use AI for planning, idea development, and research. Your final submission should show how you have developed and refined these ideas.
  • AI Collaboration: You may use AI to assist with specific tasks such as drafting text, refining and evaluating your work. You must critically evaluate and modify any AI-generated content you use.

This then leads to the question: how do we ensure integrity in assessments that fall into these categories?

The general mood has shifted decisively away from automated AI detection. However, for many assignments, this leaves us relying solely on human detection – unless the assessment is substantially redesigned.

It’s also worth noting that whilst perhaps AI detection has very limited use in colleges and universities, the JCQ guidance for colleges makes several references to AI detection and note:

“The use of detection tools, where used, should form part of a holistic approach to considering the authenticity of students’ work; all available information must be considered when reviewing any malpractice concerns.”

It’s also notable that many of the examples cited by JCQ include the use of AI detection.

We’ll explore AI detection in more detail shortly, but first, let’s take a look at some of the latest research exploring whether AI is actually capable of achieving high marks in assessments.

AI passing coursework

Let’s not forget that the subjects taught and the methods of assessment used across universities and colleges are wide-ranging and diverse. The debate around AI in assessment has largely focused on very traditional forms of written work.

We know that in at least some subjects, the answer to the question “Can trivially generated AI work pass?” is a clear yes.  By this I mean without sophisticated prompting and a lot of work on behalf of the student.

In a study “A real-world test of artificial intelligence infiltration of a university examinations system: A “Turing Test Peter Scarfe and colleagues ran an experiment where they added AI generated assignments to batches of student submissions for a Psychology degree. They found:

“On average grades achieved by AI submissions were just over half a classification boundary higher than that achieved by real students, though this varied across modules”

A similar result was found by Liz Hardie and colleagues, looking across multiple assignment types in their NCFE funded study “Developing robust assessment in the light of Generative AI developments .” They found:

“The 17 assessment types (assessed over 59 questions) used in this research were generally not robust in the face of GAI; either the GAI answers performed well and achieved a passing mark, or marker training increased the number of false positives. The most robust assessment types were Audience-tailored, observation by learner and reflection on work practice, which align with what is often called ‘authentic assessment’ although GAI answers to these were still capable of achieving a passing grade.”

It’s worth noting the final statement – that AI tools can still pass ‘authentic’ assessments.  Alexander Kofinas and colleagues found the same in their paper “The impact of generative AI on academic integrity of authentic assessments within a higher education context”:

“… GenAI can generate authentic assessments that pass the scrutiny of experienced academics”.

In a way, this isn’t surprising. If GenAI is finding a place in the workplace, it’s because it’s good at these sorts of task, and authentic assessments are usually trying to replicate the kinds of tasks we might see in the workplace.

AI tools are getting better and better at long form writing

Most of the studies I’ve mentioned pre-date the development of ‘Deep Research’ models. These models form a multi-step strategy to answering a problem, searching the web at each stage. They take 10’s of minutes rather than seconds, but the results, at least on the surface can be quite impressive.  If you’ve not seen this in action before, I used Gemini as an example, and asked it to produce an updated version of my Sept 2023 blog post.  The response is long and thorough – I’ve shared it as Google Doc. So AI tools are getting better and better at more complex assessment types.

AI Detection tools – An update

We’ve long assumed that generative AI tools would generally outpace the development of AI detection tools and that is indeed proving to be the case.

On Accuracy

The question we’re usually asked is: How accurate are AI detection tools? Unfortunately, that’s not a simple question to answer, and it’s even more difficult to compare accuracy between different products as they often calculate and display metrics very differently – for example at a sentence, paragraph or document level.

Let’s start with a figure that isn’t useful: how often a tool correctly identifies AI-generated text as being AI-generated. On the surface, that might seem like a useful metric – but consider this: a tool that labels everything as AI-generated would achieve 100% success at identifying AI content, but it would also misclassify every human-written submission. In other words, it would be entirely useless in practice.

What Actually Matters: Discrimination and False Positives

What we’re truly interested in is how accurately a tool can distinguish between human-written and AI-generated text and, crucially, how often it produces false positives.

False positives are particularly important in education, where the consequences of being falsely accused of misconduct are severe. It’s essential to remember that AI detection tools may have different intended use cases. For instance, a tool designed to flag AI-generated social media posts might tolerate a high false-positive rate because the stakes are low. That kind of tool would be completely inappropriate for use in an academic setting.

This means that the most accurate tool in general terms may not be the best for education. In our context, a low false positive rate is far more important than an overall high detection rate.

What the Research Says

Most mainstream, paid AI detection tools perform reasonably well at identifying content that is entirely generated by AI provided the user hasn’t attempted to defeat the tool (e.g., via paraphrasing or rewriting). However, these tools are relatively easy to circumvent. That said, there are signs that some detection tools are improving in their ability to recognise certain patterns of AI-assisted manipulation.

Much of the research in this area was published in 2023 and 2024, likely reflecting growing awareness of the limitations of detection-based approaches.

In the following sections I am including some tables with data.  These are useful in seeing trends and the spread of data.  They should not be used to determine the current effectiveness of any product though. It’s clear that the results depend partly on the exact version of the LLM used to create the text, the exact version of the detection tool, and the nature of the text being used for the test. The first two factors are constantly changing, and so these figures aren’t representative of the performance of the tool today.

I’ve focussed mostly on papers published in 2024 or later, although I’ve also include the Weber-Wulff paper as that was so widely quoted at the time.  Each of the studies look at different tools, so these vary between tables.

Spotting AI-Generated Work

In general, mainstream tools such as Turnitin and CopyLeaks have no trouble identifying text that is purely AI-generated and left unmodified. However, the picture changes dramatically when using lesser-known tools found via a casual internet search. Researchers have found that many of these perform poorly, even in ideal conditions. Some don’t even work at all!

To start, we’ll examine two studies that focus specifically on the correct identification of AI-generated text. Remember: while this metric isn’t sufficient on its own, it offers a useful foundation for further discussion.

Product Kar et al

(Published May 2024)

Lui et al

(Published May 2024)

Content at Scale 52%
Content Detector 78.26%
Copyleaks 100%
DupliChecker 0%
GPT Zero 97% 70%
Originality.ai 100%
Quillbot 100%
Sapling 100%
Scispace 67%
Turnitin 94%
UndetectableAI 100%
Word Tune 100%
ZeroGPT 95.03% 96%

We can see that largely, with some exceptions, the tools perform remarkable well at identifying AI generated text.  One didn’t work at all!

We’ll now look at two related metrics that assess overall accuracy. While each study uses a slightly different methodology, both broadly aim to measure how well tools can correctly identify both human-written and AI-generated text. Once again, the results show significant variability across tools.

Product Perkins et al

(Published Sept 2024)

Weber-Wulff

(Published Dec 2023)

Compilatio 73%
Content at Scale 33%
Copyleaks 64.8%
Crossplag 60.8% 69%
Detect GPT 46%
Go Winston 67%
GPT Zero 26.3% 54%
GPTKit 46.1%
Turnitin 61% 76%
ZeroGPT 46.1% 59%

False Positive Rates

In general, false positive rates for mainstream, paid AI detectors such as Turnitin are relatively low. The best tools report rates around 1-2%, with Turnitin often cited as among the most reliable in this regard. However, it’s important to note that most of the published studies involve relatively small sample sizes. In some cases, this means a tool may appear to produce zero false positives but that doesn’t necessarily mean the tool is flawless. It may simply reflect the limits of the sample.

This low false positive performance does not extend to many of the free or lesser-known AI detectors that users might discover through a quick internet search. Testing of these tools has revealed some alarming false positive rates. It’s important to remember that the results presented in these studies reflect a specific moment in time, with a particular version of the detection tool and a specific large language model (LLM) generating the text. These variables change frequently, so the findings may not be representative of the tool’s current performance.

Turnitin, which is explicitly designed for educational use, prioritises low false positives, and the available data supports this focus.

While a false positive rate of 1-2% might seem low, the scale of educational assessment means this could still translate into a substantial number of false accusations. For example, consider an institution with:

  • 20,000 students
  • Each taking 8 modules per year
  • With 3 assessments per module

That would amount to 480,000 assessments per year. Even a 1% false positive rate would generate approximately 4,800 false positives annuallya huge burden to investigate and manage and potentially damaging for student trust and wellbeing.

The following table includes false positives found in a few different studies.

Product Perkins et al

(Published Sept 2024)

Hyatt et al

(Published April 2025)

Lui et al

(Published May 2024)

Content at Scale 28%
Copyleaks 50% 0%
Crossplag 30%  

 

Detect GPT 18%
GPT Zero 10% 2% 22%
GPTKit 0%
Originality.ai 2% 10%
Turnitin 0% 0%
ZeroGPT 0% 16%

Defeating AI Detectors

The relationship between AI detectors and techniques to bypass them has become a classic cat-and-mouse game. One of the earliest and most common strategies has been the use of paraphrasing tools such as Quillbot to modify AI-generated text in ways that evade detection. It appears that some detection companies have since begun accounting for this.

In general, we see that paraphrasing or manual manipulation of AI-generated text such as introducing spelling errors, adjusting sentence structure, or altering vocabulary can significantly reduce the effectiveness of detection tools. The impact varies, but in many cases, it leads to a substantial drop in accuracy.

The data discussed below highlights a few outliers, most notably, Sapling and Undetectable.ai, both of which showed surprising results when tested against paraphrased content from Quillbot.

A particularly noteworthy case is Undetectable.ai, a product specifically designed to help users defeat AI detectors. It works by analysing whether text is likely to be flagged as AI-generated, then applying transformations to “humanise” it and make it appear original. Many other similar tools now exist, marketed explicitly to help users avoid detection.

The following table shows percentage detected after some sort of evasion technique.  The exact technique varies between studies but you can see it is generally effective.

Product Perkins et al

(Published Sept 2024)

Kar et al

(Published May 2024)

Lui et al

(Published May 2024)

Content at scale 48%
Content Detector 64.58%
Copyleaks 58.7% 0%
Crossplag 32.4%
Detect GPT
DupliChecker 0.1%
GPT Zero 16.7% 32.31% 20%
GPTKit 4.5%
Originality.ai 100%
Quillbot 90%
Sapling 100%
Scispace 29%
Turnitin 7.9% 30%
UndetectableAI 100%
Word Tune 0%
ZeroGPT 17.3% 95.03% 88%

Humans Spotting AI-Written Text: Myths and Realities

There are countless tips, tricks, and myths circulating about how to identify AI-generated text. At the time of writing, one popular belief was that the use of an em dash (—) is a strong signal, presumably because it’s less commonly used or harder to type than an en dash (-). In the past, the word “delve” was frequently flagged as an AI giveaway – ironically, a word I use all the time!

In truth, most of these supposed indicators are unreliable. Unless a reader knows the writer personally and can detect a marked deviation from their usual style, humans generally aren’t good at spotting AI written work.

In blind tests, there is evidence that markers simply don’t spot AI written work.  This was seen in the study by Peter Scare and colleagues mentioned in the first section, where they added AI written answers to a batch of student submission and found:

“Overall, AI submissions verged on being undetectable, with 94% not being detected. If we adopt a stricter criterion for “detection” with a need for the flag to mention AI specifically, 97% of AI submissions were undetected”

People are also prone to false positives, more so than many AI detectors. In their paper looking a “Using aggregated AI detector outcomes to eliminate false positives in STEM-student writing” Jon-Philippe K. Hyatt and colleagues found a false positive rate of 1.3% in their AI detectors, and 5% by the human raters.

A similar result was found by Liz Hardie and colleagues in their “‘Developing robust assessment in the light of Generative AI developmentspaperthey found that whilst training people to spot AI work improved their ability to spot AI work, it also significantly increased their false positive rate.

So when are people good at spotting AI written work? The title of the paper “People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text” by Jenna Russell and colleagues is self explanatory!

The results are striking – people who consider themselves expert GenAI users are hugely better at spotting AI text than those that don’t.

Metric Non-experts Experts
Average True Positive Rate (TPR) 56.7 92.7
Average False Positive Rate (FPR) 51.7 4.0
Average Confidence (self-rated) 4.03 4.39

So, the confidence of both groups is similar but the accuracy couldn’t be more different. This underscores the risk of relying on subjective judgments or “gut feeling” in AI detection – particularly by non-expert AI users.

So, what does this mean?  Most people aren’t great at spotting AI.  Even the best only approach the accuracy of the best tools, and the many are very bad at it indeed.

Approaches for an imperfect world

So where does that leave us? And why have I written so much about AI detection when we ‘know it doesn’t work’.  It’s because we’ve moved to a situation where much of the focus is on guidance to students.  Philip Dawson and Danny Lui describe this as a discursive approach in their article “Talk is Cheap: Why structural assessment changes are needed for a time of GenAI and say:

“These frameworks remain powerless to prevent AI use when they rely solely on student compliance. They say much but change little. They direct behaviour they cannot monitor. They prohibit actions they cannot detect. In other words, when it comes to appropriate assessment change for a time of AI, talk is cheap.”

I’ve looked at some of the evidence in this blog, and it strongly support this.

  • AI is capable of passing many types of assessment, including those designed to be more authentic and reflective of real-world skills.
  • AI detection tools can identify the most obvious examples of AI-generated text, but an entire industry now exists to help users circumvent these tools.
  • Humans are generally worse than AI detectors at identifying AI-generated work, and tend to have higher false positive rates when doing so.

So, where does that leave us?

In an ideal world, we would be urgently re-engineering assessments across the board. It’s increasingly clear that, for many disciplines, the only viable long-term solution is to redesign assessments to include at least some elements that are completely secure and AI-resilient.

One of the most prominent examples of this shift is at the University of Sydney, which differentiates between secure and non-secure assessment pathways. In the UK, Michael Veale and colleagues at UCL have made a clear commitment to ensuring that between 50% and 100% of assessments in the Law faculty are conducted in an AI-secure format.

In the meantime, though, is there a space for AI detection software? The JCQ guidance for colleges suggests its use as part of an overall approach, especially in the examples.

What is absolutely clear though is that not all AI detectors are equal. I think the first question I’d ask if told work had been put through an AI detector would be “which one?”, and the second would be – “you or your institution have done a DPIA on it haven’t you?”  We can see that there are now many detectors on the web with very, very broad terms and conditions, that certainly wouldn’t, for example protect a student’s intellectual property rights.

If we are going to use an assessment scale type approach, especially scales that imply AI can’t be used to generate any of the final text, we need to ask how we are going to enforce this. If it’s by spotting if AI was used, the best AI detection tools are better than most people and have a far lower false positive rate. This may well be an uncomfortable thing to read.

Some practical advice

We are asked this a lot:

“If students are allowed to use AI in assessments under the traffic light system, but then submit their assignments stating they haven’t used AI and provide no references to it, what is an assessor supposed to do in that situation?” 

We recommend that institutions incorporate the guidance into their overall academic misconduct processes and continue to follow those procedures as they always have. Each institution needs to make its own decision about the set of tools and techniques it will use as part of this process. This might include a combination of:

  • A carefully selected institutional AI detection tool, with relevant training provided.

  • Review by an expert generative AI user to provide an informed opinion.

  • Discussion with the student about the content of their work, their process, and the tools they used.

We suggest that decisions should never be based solely on AI detection, whether by a tool or a person. However, detection tools may be useful in informing discussions with the student.

And to be very clear: you must not simply find an AI detection tool on the internet and run a student’s work through it. It may be wildly inaccurate, and you have no idea what will happen to the student’s work once submitted.

Concluding thoughts: Why does this matter?

Let’s start with one of my least favourite quotes:

“AI won’t take your job, but someone who knows how to use it will.”

Or some variation of it. I’m not entirely sure who first coined it. I’ve seen it so many times at so many events, attributed to so many different people.

I can see where this has come from. Generative AI brought AI into public consciousness, and if November 2023 is your year zero, this might very much appear to be the case. I suppose that’s one of the benefits of being a bit long in the tooth. Having been around AI for a few decades, it becomes easier to step back and see the broader direction of travel. That direction points toward AI becoming an invisible technology, at least to most people, much like electricity. It’s the hidden force that makes things work.  And let’s face it…

Electricity won’t take your job, but someone who knows how to use it will

…doesn’t work as a cool presentation slide these days.

Knowing how to use the tools of the day is important. The same was true when word processors arrived, or when the internet became mainstream.  My team are spending a lot of time providing training and resources to help with this because I strongly believe that right now, we are going through a short period when having the ability to understand and use tools that rely on interacting with generative AI is both useful and important, for study, employability and life in general.

It might feel like AI tools are somehow different or vastly more powerful than everything we’ve seen before, but we see this with every wave of technology. The ability to write easily in a non-linear way when word processors first came along was staggering. I used to teach people how to do this!  Nobody talks about that now.

So, I agree we need to ensure our assessment processes give students the skills they need to use the generative AI tools of today. But it mustn’t come at the expense of deep subject expertise, creativity, communication skills, and teamwork.

These are the skills that will still matter long after today’s tools have changed.


Find out more by visiting our Artificial Intelligence page to view publications and resources, join us for events and discover what AI has to offer through our range of interactive online demos.

Join our AI in Education communities to stay up to date and engage with other members.

Get in touch with the team directly at AI@jisc.ac.uk

 

3 replies on “AI Detection and assessment – an update for 2025”

I agree that this page has been extremely helpful! Thank you, and thank you to Anna Mills, who posted above, for providing this link in a slide show she had created. She shared the slide show in a webinar I attended that was hosted by Inside Higher Ed, so thanks to IHE, too!

Leave a Reply

Your email address will not be published. Required fields are marked *