Attacking using (and defending against) Input manipulation attacks against AI
This blog post is a first, in a series of articles that share my learning in the areas of Attacking and Defending AI.
The techniques discussed here, and the knowledge shared is for collective good of the community and I'm not responsible for any malicious usage of the same. Be responsible and don't be an idiot.
π§βπ» Background - Why AI Security?
According to me, AI is a broader term for algorithms/technology where machines can perform tasks that generally require human intelligence, such as problem solving, reasoning, etc. AI is currently used in many applications across many sectors. But the most important sector we would be interested in AI's usage in sectors where there is a loss of life or bypass of a security system, etc.
Imagine these scenarios:
There is a "Be on lookout" guidance on a specific vehicle number. This could be issued post identification of a miscreant using the vehicle. An AI system responsible for identifying this vehicle is bypassed by an adversary.
Self-driving cars are very popular and are evolving as we speak. Imagine an adversary figures out a way to crash the vehicle or render it useless?
AI is being used in medicine, transportation, productivity among many other sectors and ironically these sectors are something that an adversary would love to disrupt or cause harm.
π Approach
We all agree that AI security is important. The intention of the blog is to discuss these attacks from a security engineer's perspective instead of a data scientist and the reason for that is a security engineer will look at the AI system and connected systems (systems which leverage the AI models or provide input to them).
π€ What's Input Manipulation attack?
According to OWASP's page:
Input Manipulation Attacks is an umbrella term, which include Adversarial attacks, a type of attack in which an attacker deliberately alters input data to mislead the model.
Popularly known as "Adversarial Examples" in the AI world, this is basically a situation where an adversary creates specially crafted input in order to influence the outcome of the AI system. These attacks are created by giving inputs that are intentionally designed to cause an AI model to make a mistake. They are like optical illusions for machines. These inputs can be either in images or audio or textual format.
π Exploitation Techniques
This section covers few easy to exploit and ITW exploitation techniques that I have observed/identified. In no way, I claim this is to be a definitive guide or an exhaustive list.
β© Hide the input from being processed
π Description
It's common sense that if an input is unreadable or unrecognizable, it's not possible for the AI system to function as expected. Hiding input from being processed can be done through plethora of ways.
π€How to prevent?
If an AI system can be fooled by just hiding the input, it means the system isn't built to sanitize the input properly. When the system detects a malformed input such as unable to detect input, it should handle the situation gracefully either by asking for a better input or reducing confidence, etc. The system that's consuming the AI system should handle this situation by alerting the user about possible misclassification.
One elegant way to manage such scenario is how proxies classify websites which are unreachable. Often red teamers use captchas to avoid the websites from being scanned. Proxy systems classify the website as "Unclassified", and they're blocked by default (in most cases).
π In the wild exploitation case studies
Scenario: Autonomous Number Plate detection system bypass
Automatic number plate systems are used almost in all developing and developed countries. It is used for detecting vehicles which are stolen, or cars involved in nefarious activities.
By just adding dirt to a number plate, it's possible to bypass the autonomous systems. That being said, not all the ASNR algorithms are vulnerable to do this and it's quite possible to detect dirty number plates but achieving 100% accuracy in this isn't possible if the number can't be read completely. Below is an example of an image of a number plate which is partially filled with dirt bypassing a commercially available number plate recognition system.
All being said, there are several other exploitation techniques that can be classified into this category. Techniques such as Usage of encrypted channels can be counted in this category.
β© Modify input in another form (language or format, etc.) that's not supported by the algorithm:
π Description
Primarily targeted towards non-image-based algorithms, when the input is sent in a format that's unrecognizable for the system, it'll simply bypass the control.
π€ How to prevent?
When there's an input that's unrecognized, either AI system or consuming system should handle the situation by either stopping the processing or alert the user.
Alternatively, the AI model can be trained with possible input types such as for text-based models: emojis can be used for training, etc.
π³ Exploitation Examples
π In the wild exploitation case studies
Scenario: Hate speech algorithm bypass
For example, consider this notebook which is intended to detect hate speech using RoBERTa model. While variation of algorithms used might be different for enterprise setups (such as social media companies, etc.).
Another example of the same is using a language that is not supported by the algorithm to detect as hate speech. Few major social media platforms are vulnerable to these techniques.
β© Add noise to the input:
π Description
Noise is basically random gibberish characters added to the input that in most cases will not make a difference to naked eye. When noise is added to the image, the AI model will perceive it differently and misclassify the data.
π€ How to prevent?
Before the input is used in the AI model, the noise should be removed or at least the fact that noise exists should be considered by the model.
Alternatively, an AI model can be trained with possible adversarial examples to identify instances of exploitation.
Scenario: Autonomous driving cars
The noise need not be always invisible to human eye. Consider the stop signs below. They're all legitimate stop signs for a human but they do have noise which *might* be incomprehensible for an AI. Of course, the current generation of self-driving cars might be equipped to stop this (which I highly doubt). An attacker could target autonomous vehicles by just pasting stickers or paint the signs and AI system would either interpret a "non-stop" sign as a STOP sign or could ignore an original stop sign.
β© Reverse engineer the algorithm and manipulate.
π Description
This is more of a broader umbrella technique that covers, what isn't already covered. The idea is to understand the type of parameters that are given importance and modify them. For instance, few spam filtering algos check if an email contains unsubscribe button and they increase trust score if it contains.
π€ How to prevent?
If and when adversarial examples are discovered, either the logic has to be patched (if possible) and in the cases where this is not possible, input validation has to be performed and additionally an AI model to detect these malicious inputs can be created and trained with identified examples.
Identify features that are having most weight associated with them.
Modify the input by introducing artificial features and provide it as an input.
While this is a broad topic, below is an example: Phishing Checklist | Trusted Sec.
The blog covers very good techniques that a red teamer can start with. However, few things caught my attention:
If you are a seasoned red teamer, this wouldn't make a lot of difference to you but certain things like sender reputation or presence of links on the email are few things that are often considered by a spam filtering algorithm to determine if an email is spammy or not. While it's not a new thing, think of it in this way:
Using trial and error (sending multiple emails) you understood that AI algorithm is basically taking the following parameters into consideration while classifying emails as spammy/not-spammy.
Is sender a reputable sender?
Does the email have any spammy keywords?
Does the email have a valid unsubscribe button?
Does the email have any images or links to known trackers, etc.?
You would incorporate most (if not all) the features to ensure that the email bypasses the spam filters.
π How to test if your AI system is vulnerable?
As with code-based vulnerabilities, having a Whitebox approach helps. As a security researcher, what you are looking for is for "edge cases" where your AI system fails to detect/classify, or mis detects/classifies. More often than not, it's not just the AI system but the system that's consuming the AI is responsible to cover these edge cases.
If you are doing a Whitebox assessment, understanding the logic and identifying the gaps and patching them is the best way. For instance, you know our algorithm doesn't check for some edge case, go and patch it.
If you are doing a Blackbox assessment, try to fuzz the input with various ways to see if it's giving any unintended or misclassified output. Contrary to traditional fuzzing where in most cases, you know the list of outputs to expect when there is a vulnerability, it's a bit tough to do so in this case as it completely depends on the context.
πPrevious work
Several *great* people have worked in this field and have created algorithms that can help create adversarial examples. Below are a few:
π Resources
The OWASP foundation (famous for OWASP TOP-10), one of the premier most not-for-profit organization that puts up a lot of effort in creating collaterals and guides in various areas of security has recently created an ML Top 10 attacks. Input manipulation attack is actually rated at the top for this year.
Open AI has created a really nice blog post related to Adversarial examples (with a good chunk of math).
If you see any mistake or have any feedback for the post, please reach out to me on LinkedIn or Twitter.
Last updated