Adversarial AI: Can the training be tricked?

Modern AI systems continue to wow us with their ability to produce creative works of text, choose the right exposure for a photograph, or keep our car centered within the lanes of a winding highway. But can we really trust them to do the right thing in every circumstance? What if a nefarious actor tried to “poison” an AI system so that it intentionally produced incorrect results in certain situations – a concept called adversarial AI? We’ll explore that possibility in a moment, but first let’s turn the clock back a few decades.

Back when I was in college (which was, increasingly, awhile ago, and required walking uphill through the snow every day), I majored in computer science and chose Artificial Intelligence as one of my specialty tracks. At the time, AI research was focused on areas such as neural networks and expert systems, which attempted to emulate the decision-making abilities of human experts.

Inspired by the synaptic connections that occur within our (human) brains, neural networks aimed to produce “correct” outputs based on a wide variety of inputs. For example, if you input the current state of a game of Tic-Tac-Toe to a neural network and ask it for the next best move, a properly-trained neural network would provide the ideal move in that scenario.

That, in fact, was one of my programming projects: to produce a neural network that could absolutely dominate in Tic-Tac-Toe. Granted, it’s not a complicated game, and my neural network wasn’t going to pull out any brilliant moves that a child couldn’t figure out. Dominating in Tic-Tac-Toe means achieving a perpetual tie, which is probably why Tic-Tac-Toe has never taken off as a televised sport.

But even getting a neural network to meet the skills of child requires quite a bit of work. To settle into a domain of expertise, a neural network must be trained, which is a process of wiring up all the synaptic connections between the inputs and outputs. First, you feed different inputs through many different layers in the neural network. Those layers, which consist of digital neurons, apply mathematical functions to the inputs, until a particular output flows out the other end.

Your first pass at training the network is unlikely to yield great results, as it is somewhat random. I used to think it was like a game of Plinko: you drop a disc at the top of the board, and it randomly bounces around various pegs on the way down until it lands in some slot at the bottom – hopefully a slot worth a lot of money. But you didn’t want a random output; you wanted the best output – you wanted to win the game of Plinko so you could pocket some cash and make it to the final showdown.

So you calculate errors in the output, and then use a process called backpropagation by starting at the output side and moving backwards through the network, updating the weights used by those mathematical functions in the hopes that doing so reduces future errors.

This process repeats like a game of ping-pong, sometimes through many, many iterations. As the training continues, ideally, the outputs will produce less and less errors until further iterations of backpropagation result in diminishing returns – reaching the point of “good enough.” If that doesn’t seem to happen, you might tinker with the number of connections and neurons, or make other seemingly random adjustments to the model until things start to jive.

A human in an intense game of Tic-Tac-Toe with a computer.

The result is a bunch of weights and functions in a large model that can somehow very quickly convert inputs into correct outputs. Training my Tic-Tac-Toe model was computationally expensive (at least, on the blazing-fast 40-Hz Mac IIfx I had at the time). But once the concrete set, it was very quick to feed data through and yield a result. That was the power of a neural network.

But I found this unsettling, and brought my concern to one of my professors.

“I’ve created this tangle of code and data that can play Tic-Tac-Toe like a boss, but what makes it really… work? It all feels a bit random. How can I really know what is happening within that network?”

The professor gave me a knowing smile, and conceded that this was a weakness of neural networks.

“We don’t really know how to assign any conceptual meaning to all the weights and transformations within the model,” he explained. “In other words, Scott, we don’t really know what makes them work under the hood, just like we really don’t know what makes the human brain work. It’s like an opaque box. The computer doesn’t have any inherent understanding of the game of Tic-Tac-Toe. It’s merely performing transformations on data.”

“So how can we know it’s really producing the best result? Or that it won’t suddenly produce a very bad result for some input?”

“There’s really no way to validate these systems in any scientific way,” the professor replied. “That is, perhaps, a weakness of the technology.”

Artificial Intelligence has advanced quite a bit since my college days, but that fundamental concern I had – we really don’t know what makes these models work, do we? – exists just as much today as it did then. Only today, the opaque boxes are much, much larger, and with consequences much greater than simply losing a game of Tic-Tac-Toe.

As I described in an earlier story, vast amounts of content are being consumed by today’s AI systems as part of their training, so much so that even the entire Internet isn’t satisfying their voracious appetite for data.

The techniques used to transform inputs (such as “draw me a picture of a skateboarding aardvark on the moon”) into outputs (such as the lovely picture presented below) vary, but they all effectively end up with some very large model that consists of a bunch of mathematical transformations. And by a bunch, I mean a whole bunch. So much so that no human could manually examine every aspect of a model or begin to understand how it might react in various circumstances.

And unlike the Tic-Tac-Toe model that I created in college, where I was in full control over the creation of the data used to train it, many of today’s AI systems are being trained on data from untrusted sources. While many AI companies are secretive about the exact sources of data they use for training, it’s likely that they’re mining web content such as Wikipedia, YouTube, Instagram photos, and a whole lot more. In most cases, that consists of data created by anonymous users. Anyone can post a photo to Instagram; anyone can share a video on YouTube.

Humans make mistakes, so that data can include inaccuracies or biases that find their way into the models, leading to effects such as hallucinations where a model confidently and authoritatively presents an answer that is, quite simply, wrong.

But what if humans try to intentionally feed a model bad training data in order to affect its future results? That’s an increasing concern of AI researchers, and they’ve coined a name for it: Adversial AI.
Thankfully, this concept remains largely theoretical, although there have been instances of adversarial AI in the wild, including a rather public one a few years ago.

In 2016, Microsoft released a chatbot called Tay, inviting the public to interact with it. Like many AI-powered systems, Tay was designed to learn as it interacted with humans, in an effort to continually improve its performance. Recognizing this, some human users decided to have a bit of fun, creating interactions with Tay that were designed explicitly to manipulate the chatbot’s future outputs.

In short, by poisoning the data they were feeding into Tay, they caused Tay to very quickly evolve into a racist, sexist jerk. Microsoft quickly shut down the experiment.

While this form of adversarial AI was perhaps more silly than dangerous, it caught the attention of security researchers, who began to realize the perils of training models off of user-supplied, or untrusted, data.

In a 2018 paper titled DARTS: Deceiving Autonomous Cars with Toxic Signs, authors Chawin Sitawarin, Arjun Nitin Bhagoji, Arsalan Mosenia, Mung Chiang, and Prateek Mittal described how intentional misclassification of traffic signs during a model’s training by nefarious actors could lead to disastrous (and potentially fatal) consequences for self-driving cars. What makes these kinds of attacks so haunting is that it might not be obvious that a model has been poisoned until it is too late. And even then, it might not be easy to determine that an incorrect output was the result of intentional model poisoning.

This continues to be an area of active research, and perhaps one of many examples where the pace of AI development is exceeding the ability to keep up with necessary safeguards.

When we consult with human experts such as doctors or lawyers, we can generally trust their bona fides by validating what schools they went to and what degrees they obtained. If they make a medical or legal recommendation to us, we can ask how they arrived at it, and expect to hear a reasoned, thoughtful response.

But when we consult with AI systems for their virtual expertise, we can’t receive these kinds of assurances. They present answers (or decisions, like whether to bring a car to a stop at a stop sign) that are difficult for a human to validate, and the AI systems themselves have no inherent understanding of how they arrived at a result; they’re just performing mindless transformations to turn inputs into outputs.