Checkmark Plagiarism Logo
Checkmark Plagiarism
Menu
Back to Blogs
IndustryProduct~7 min read

Pangram AI Detection, ESL Writing, and Reasoning Models: An Honest Look

A fair, teacher-focused evaluation of how Pangram's AI detector handles non-native English writing and the new wave of reasoning models, and what the claims actually mean for your classroom.

The Checkmark Plagiarism Team
Pangram AI Detection, ESL Writing, and Reasoning Models: An Honest Look

Every few months a new AI detector arrives with a headline number attached. Ninety-nine percent accuracy. Near-zero false positives. Detects the latest model on day one. Pangram is one of the more talked-about names in that conversation right now, and it has published its own posts on two of the questions teachers actually care about: how it handles writing from English language learners, and how well it catches output from the new reasoning models like the o-series and their cousins. Those are exactly the right questions. They are also the two places where almost every detector quietly struggles. So it is worth slowing down and asking what the claims really mean before you let them shape how you grade.

This is not a takedown and it is not an ad. We build a detection product ourselves, so we have every incentive to be skeptical of a competitor and every reason to be honest about where the whole category is hard. Here is the fair version.

Why ESL writing is the hardest case for any detector

Start with the uncomfortable truth that predates Pangram and applies to all of us. AI detectors learn to recognize statistical fingerprints of machine text: smoother word choices, more predictable sentence rhythm, a certain flattened evenness. The problem is that fluent-but-non-idiomatic human writing can look the same to a model. A student writing in their second or third language often produces simpler vocabulary, more regular structure, and fewer of the quirky, low-probability word choices that detectors treat as a signature of being human.

That overlap is not a Pangram-specific bug. It is the central fairness problem of the field, and it has real history. A widely cited 2023 Stanford study found that several popular detectors flagged essays by non-native English speakers as AI-generated at strikingly high rates while rarely misclassifying native writers. When a detector's mistakes land disproportionately on one group of students, that is not a rounding error. That is a civil rights problem wearing a percentage sign.

So when Pangram publishes a post arguing it performs well on ESL text, the instinct should be neither to believe it nor to dismiss it. The instinct should be to ask the follow-up questions that separate a real result from a marketing line.

Reading the ESL claim like a teacher, not a buyer

A claim like "accurate on ESL writing" is only as good as the test behind it. Here is what actually matters.

First, what was the false positive rate specifically on human ESL writing? Overall accuracy can stay high while the error concentrates entirely on the group you are worried about. The number you want is not "how often is Pangram right," it is "how often does Pangram call a real ESL student's real essay AI." If that figure is not stated plainly and separately, the headline does not answer your question.

Second, where did the test essays come from? A detector tuned on one population of non-native writers, say university applicants from a handful of countries, may behave very differently on a tenth-grade ELL classroom with a different first language and a different level. Generalization is the whole game, and a clean in-house benchmark does not prove it.

Third, what is the confidence band on a single essay? A detector can be calibrated well across thousands of documents and still be shaky on the one paper sitting in front of you. Aggregate accuracy is a property of a pile of essays. A grade is a decision about one student.

To Pangram's credit, the company tends to publish more methodology than the average vendor, and treating ESL detection as a named problem worth a dedicated post is better than pretending it does not exist. That is the right posture. The caution is simply that no published number, from Pangram or from us, should ever be the thing that decides a case on its own.

The reasoning-model problem is newer and stranger

The second post tackles a moving target. Reasoning models, the ones that generate long internal chains of thought before answering, write differently from the chatbots detectors were originally trained on. They produce more structured arguments, more deliberate transitions, and sometimes a slightly more "considered" texture than a quick GPT reply. For a detector, a new model is a new distribution, and the honest baseline expectation is that detection accuracy dips when a model arrives that nobody has trained against yet.

Pangram's argument is essentially that its approach holds up against these newer models. There is a plausible mechanism for that. Detectors built to capture broad statistical properties of machine text, rather than memorizing the quirks of one specific model, can transfer better to unseen models than people expect. But "can" is doing heavy lifting in that sentence. The only way to know is continuous testing against each new release, and the only honest framing is that today's number is a snapshot with a short shelf life.

There is also a subtler wrinkle with reasoning models. Students rarely paste raw output. They paraphrase it, run it through a humanizer, mix it with their own sentences, or ask the model to write "like a tired teenager." Every one of those steps degrades a detector's signal. A score against clean, unedited reasoning-model text is a best case, not a typical case. The messy real-world version is where detection gets genuinely hard, and it is the version your students are actually living in.

What none of these numbers can do

Here is the part that matters most, and it is true of every detector including our own. A probability is evidence, not a verdict. A score of ninety percent AI is a reason to look closer. It is not proof, and it is certainly not grounds for an academic integrity charge on its own.

The danger with a strong, confident number is that it invites teachers to outsource judgment to a dashboard. That is exactly backwards. The right workflow treats any detector as one input alongside the things software cannot see: the student's draft history, their performance in class, a quick conversation, a comparison to their earlier writing. Tools like version history and revision tracking often tell you more than a single percentage ever will, because they show the work being built rather than guessing at the finished product.

This is especially urgent for the two cases Pangram is writing about. The ESL student wrongly flagged and the reasoning-model essay missed are mirror images of the same lesson: the cost of a mistake is not symmetric. A missed case of cheating is a missed teaching moment. A false accusation against a vulnerable student can derail a kid's year. Treat those outcomes with the gravity they deserve and the percentage stops being the headline.

The bottom line for your classroom

Pangram is doing something we respect, which is publishing on the hard cases instead of hiding from them. ESL writing and reasoning models are the two frontiers where the entire detection field is being tested, and naming them openly is the right move. Read its posts. Read ours. Run your own informal trials on writing you already know the origin of, because nothing builds calibrated trust like watching a tool succeed and fail on samples you can verify yourself.

Then keep the percentage in its proper place: a flag that starts a conversation, never a gavel that ends one. The best AI detector in the world is still just the second-best teacher in the room.

Pangram AI Detection, ESL Writing, and Reasoning Models: An Honest Look