AI Content Detection Tools: We Tested Five and the Results Were Mixed


AI content detection has become a cottage industry. Publishers, educators, clients, and SEO teams all want to know whether a piece of content was written by a human or an AI. The tools promising to answer this question have multiplied rapidly, but their accuracy claims rarely hold up under scrutiny.

I tested five popular AI content detectors against a controlled set of 50 text samples: 20 written entirely by humans, 15 generated entirely by AI (GPT-4, Claude, and Gemini), and 15 that were human-written with AI assistance (AI-generated drafts edited by humans). Each sample was 600-800 words.

The results were… instructive.

The Tools Tested

  • Originality.ai — positioned for content marketers and publishers
  • GPTZero — widely used in education
  • Copyleaks — enterprise-focused with API access
  • ZeroGPT — free tool with a simple interface
  • Turnitin AI Detection — integrated into the academic plagiarism platform

Overall Accuracy

Here’s the headline: none of these tools are reliable enough to make definitive judgments about individual pieces of content.

ToolCorrectly ID’d HumanCorrectly ID’d AICorrectly ID’d MixedOverall Accuracy
Originality.ai75% (15/20)80% (12/15)40% (6/15)66%
GPTZero70% (14/20)73% (11/15)33% (5/15)60%
Copyleaks80% (16/20)67% (10/15)27% (4/15)60%
ZeroGPT55% (11/20)87% (13/15)20% (3/15)54%
Turnitin85% (17/20)60% (9/15)33% (5/15)62%

The accuracy percentages look mediocre, and they are. But the pattern matters more than the averages.

False Positives Are the Biggest Problem

False positives — flagging human-written content as AI-generated — happened consistently across all tools. Five of my 20 human-written samples were flagged as “likely AI” by at least one tool. Two were flagged by three or more tools.

The falsely flagged human content shared certain characteristics: formal writing style, structured argumentation, and technical vocabulary. In other words, well-written, professional content that follows clear logic triggers AI detectors because it resembles the kind of text AI produces.

This has real consequences. Students have been wrongly accused of cheating. Freelance writers have lost clients because their work was flagged as AI-generated. Content creators are being penalised for writing well.

ZeroGPT was the worst offender for false positives, flagging 9 of 20 human samples as AI-generated. At a 45% false positive rate, the tool is essentially a coin flip for human content.

AI-Only Content Was Easier to Detect (Mostly)

Pure AI-generated content was detected most reliably, particularly by Originality.ai and ZeroGPT. These tools have trained on large datasets of AI output and can identify statistical patterns in word choice, sentence structure, and token probability distributions that characterise machine-generated text.

But even here, accuracy was imperfect. Some AI-generated samples passed as human, particularly those produced with specific prompting techniques (writing in first person with informal language, deliberately varying sentence length, or asking the AI to include personal anecdotes).

The arms race between generation and detection is real. As detection tools improve, generation techniques adapt. AI models are increasingly capable of producing text that doesn’t match the statistical patterns detectors look for.

Mixed Content Is a Black Box

The most realistic scenario — human-written content with AI assistance — was where every tool struggled badly. Only 27-40% of mixed-content samples were correctly identified as AI-assisted.

This isn’t surprising. When a human takes an AI-generated draft and edits it — changing sentences, adding personal observations, restructuring paragraphs, and injecting their own voice — the result is genuinely a blend that doesn’t fit neatly into “human” or “AI” categories.

The detectors typically classified mixed content as either fully human or fully AI, rarely acknowledging the mixed nature. Originality.ai does provide a percentage likelihood, which is conceptually better, but the percentages didn’t correlate well with the actual degree of AI involvement.

The Technical Limitations

AI content detectors work by analysing statistical properties of text — perplexity (how predictable the word choices are) and burstiness (variation in sentence complexity). AI text tends to be more uniform in both metrics.

But these are statistical tendencies, not definitive markers. Human writers who are methodical and consistent can produce text with low perplexity. AI prompted to be creative and varied can produce text with high burstiness. The overlap between human and AI statistical profiles is substantial.

Language also matters. Most detectors are trained primarily on English text. They perform worse on other languages and on English written by non-native speakers — whose writing patterns may diverge from the “native English human” baseline the detectors expect.

Technical and scientific writing has naturally lower burstiness than creative writing, which is why it gets flagged more often. A researcher writing a methodology section uses precise, consistent language — and gets flagged as AI. The detection tools conflate “structured, formal writing” with “machine-generated,” which is a fundamental flaw.

What This Means Practically

For content publishers: Don’t rely on AI detectors as the sole arbiter of content quality or origin. Use them as one signal among many. Focus on editorial review, fact-checking, and whether the content provides genuine value rather than whether a detector says it’s AI.

For educators: Turnitin’s integration with existing academic workflows makes it the most practical option, but its 60% accuracy on AI content and 15% false positive rate mean it shouldn’t be used as sole evidence of academic dishonesty. Pair detection tools with conversation — ask students to explain their reasoning and demonstrate understanding of what they submitted.

For SEO and content teams: Google has stated that its ranking systems focus on content quality regardless of production method. Obsessing over AI detection scores is less productive than focusing on whether content is accurate, helpful, and original. That said, understanding what AI detection looks for can inform your content workflows.

For teams figuring out how AI fits into their content production processes, AI project delivery expertise can help design workflows that produce high-quality output while maintaining editorial standards, rather than relying on detection tools after the fact.

For writers worried about false flags: If you write clearly and logically, your content may be flagged as AI by some tools. This is a problem with the tools, not with your writing. Keep writing well. If a client or employer uses detection tools, have a conversation about the tools’ limitations rather than deliberately degrading your writing quality to appear “more human.”

The Bottom Line

AI content detection tools in 2026 are better than random chance but worse than their marketing claims. They work best on pure AI-generated content and worst on the most common real-world scenario — AI-assisted human writing. False positives remain a significant problem, particularly for clear, structured prose.

These tools will improve over time, but so will AI writing. The detection game is inherently asymmetric: detectors need to identify statistical patterns, while generators only need to avoid them. The long-term trajectory favours the generators.

Rather than treating AI content detection as a binary pass/fail, it’s more useful to think of it as one imperfect signal in a broader quality assessment. The question “was this written by AI?” is becoming less meaningful than “is this content accurate, useful, and worth publishing?” Focus on the latter.