Algorithm and blues: The challenge of detecting AI-enabled cheating in education

I was a child when Texas Instruments unveiled the TI-81, its first graphing calculator.

This was no normal calculator. For a fledgling nerd like me, who proudly wore a Casio wristwatch that had more features than I could count, it was an item of lust.

Its face was adorned with dozens of buttons in varying shades of blues and grays. The top third of its display was a ginormous black and white LCD screen, featuring a palatial canvas of 96×96 pixels.

It barely fit in my hand. And it came with a cover that slid off the top and slid on its back with a very satisfying click.

The TI-81 calculator was a nerd's dream.

Sure, it could add, subtract, multiply, and even divide. But what made it so exciting was that you could write programs on it. It included a flavor of the BASIC programming language, and it could store and retain data even when you changed the batteries – assuming you had a newer model that came with a backup battery in addition to the four AAs it used, or were quick enough at swapping out the individual AA batteries, a skill I mastered with finesse.

Although the TI-81 was my dream calculator, it was a nightmare for teachers. While some classes allowed students to use traditional calculators during an exam, the TI-81 could do much more than perform simple arithmetic; it could store formulas, answers to likely questions, and other tidbits to give a student an edge.

In other words, creative students could cheat with it.

This presented an ethical quandary for teachers. On the one hand, it was a powerful teaching tool, one of the first calculators capable of quickly drawing graphs and helping students visualize mathematical formulas. On the other hand, not every student could afford this Rolls Royce of calculators. Students stuck with standard, inexpensive calculators were put at a disadvantage.

And then there was the matter of tests. If a teacher allowed calculators during an exam, how could they tell whether students were using their TI-81s to perform simple arithmetic, versus using the advanced features to cheat in some way?

The solution to that ended up being pretty simple. Each classroom was stocked with enough basic calculators for each student, and during exams, students were given an “authorized” calculator to borrow. They couldn’t use their own.

Fast forward to college. I served as a Teaching Assistant, where I helped grade student assignments in introductory computer science classes.

I was cautioned that students sometimes took “collaboration” too far, conspiring with each other to submit identical (or largely identical) code for assignments that were meant to be worked on individually.

One of my professors taught me his foolproof solution to catch students in the act: the “light test.” He demonstrated this by taking two assignment submissions, placing them on top of each other, and holding them up to the light. If the physical structure of the code was roughly the same, or if the characters matched up perfectly, he suspected plagiarism was at play.

Students would try to circumvent his light test by changing the names of parameters and variables, but that was easy to detect. Coming up with variable names is easy; coming up with the logic and algorithms that makes a program work is much harder, and the cheating students weren’t very good at modifying the logic to appear different. If they could do that, they wouldn’t need to waste time cheating in the first place. It would take less time to just do the assignment from scratch.

Catching students in the act of cheating has always been a cat-and-mouse game, with students finding new ways to cheat (such as buying “pre-written” essays on-line), and teaching institutions finding ways to combat it (such as running student submissions through software that looks for telltale signs of plagiarism.)

But that cat-and-mouse game recently became a mountain lion-and-mouse game. With generative AI, the advantage has tipped significantly towards the cheaters.

In mere seconds, large language models (LLMs) such as ChatGPT can be asked to write essays of arbitrary lengths on almost any topic. You can steer them towards writing in a particular style, or concentrating on certain aspects, and they’ll eagerly oblige. They can be very useful tools to assist students in brainstorming, editing, and other actions that don’t necessarily cross the line of cheating, but that’s a fine line, and these models are more than happy to aid and abet students who are attempting to pass others’ work (in this case, the LLM’s — and, I suppose, all of humanity’s knowledge that they trained on) as their own.

What’s different is that teachers have very few options for detecting when a student has utilized AI to cheat. And the few options that exist aren’t very good. Unfortunately, they’re unlikely to get better any time soon.

WIRED recently compiled a list of articles that touch on the various techniques used to detect whether something was created by an AI system. The gist of the compilation is that non of the techniques really work, either failing to detect when something was AI-generated, or falsely claiming something was AI-generated when it wasn’t.

There are a variety of reasons why AI-detection is such a difficult problem.

Plagiarism is the act of presenting someone else’s work as your own, and if teachers can find the original work a student copied, they can assert that the student cheated. But finding an “original” work for AI-generated content is likely impossible, since an AI can generate infinite responses for a given prompt. In other words, if two students ask ChatGPT to write a paper on the same subject, the resulting papers will differ, each being a unique set of words not found anywhere else in the world.

One technique for AI detection is to look for certain “watermarks” left behind by the AI system. These watermarks might be unintentional ramifications of the system’s design. For example, perhaps a generator has a bias towards using certain phrases or sentence structure, while avoiding other structures.

Or, some systems might intentionally place a watermark within their content. Many AI-focused companies are beginning to intentionally place watermarks in their AI-generated images with the intent of making it obvious that the images were artificially created. However, these watermarks can often be stripped away by nefarious actors, or unintentionally obfuscated during normal image manipulation such as cropping or rotating. A WIRED story last fall described how researchers were able to trivially break all major AI watermarking systems.

There are various commercial tools available that claim to aid schools and universities in detecting AI-generated content, but while they generated some early excitement (and sales), they haven’t worked too well, making some professors and teachers nervous to use them. False positives can lead to false accusations of cheating against students, and with no way of definitively proving that content was plagiarized, teachers are hesitant to lob accusations that could carry severe consequences for students.

But it is likely that millions of papers are being submitted by students each year that contain at least a portion of AI-generated content.

And generative AI technology is advancing at a far faster rate than AI-detection techniques, so this problem is only going to get worse.

Some teachers have determined that the only way to combat it is to go back to an old-style quizzing technique: oral exams. Although it doesn’t scale too well, by talking with and quizzing students individually, teachers can assess whether they have a solid understanding of a given subject.

At least, assuming there isn’t an AI whispering into their earbuds.