Category | Attack Scenario | Guidance |
Prompt Attacks: Crafting adversarial prompts that allow an adversary to influence the behavior of the model, and hence the output in ways that were not intended by the application. | Prompt injections that are invisible to victims and change the state of the victim's account or or any of their assets. | In Scope |
Prompt injections into any tools in which the response is used to make decisions that directly affect victim users. | In Scope |
Prompt or preamble extraction in which a user is able to extract the initial prompt used to prime the model only when sensitive information is present in the extracted preamble. | In Scope |
Using a product to generate violative, misleading, or factually incorrect content in your own session: e.g. 'jailbreaks'. This includes 'hallucinations' and factually inaccurate responses. Google's generative AI products already have a dedicated reporting channel for these types of content issues. | Out of Scope |
Training Data Extraction: Attacks that are able to successfully reconstruct verbatim training examples that contain sensitive information. Also called membership inference.
| Training data extraction that reconstructs items used in the training data set that leak sensitive, non-public information. | In Scope |
Extraction that reconstructs nonsensitive/public information. | Out of Scope |
Manipulating Models: An attacker able to covertly change the behavior of a model such that they can trigger pre-defined adversarial behaviors.
| Adversarial output or behavior that an attacker can reliably trigger via specific input in a model owned and operated by Google ("backdoors"). Only in-scope when a model's output is used to change the state of a victim's account or data. | In Scope |
Attacks in which an attacker manipulates the training data of the model to influence the model’s output in a victim's session according to the attacker’s preference. Only in-scope when a model's output is used to change the state of a victim's account or data. | In Scope |
Adversarial Perturbation: Inputs that are provided to a model that results in a deterministic, but highly unexpected output from the model. | Contexts in which an adversary can reliably trigger a misclassification in a security control that can be abused for malicious use or adversarial gain. | In Scope |
Contexts in which a model's incorrect output or classification does not pose a compelling attack scenario or feasible path to Google or user harm. | Out of Scope |
Model Theft / Exfiltration: AI models often include sensitive intellectual property, so we place a high priority on protecting these assets. Exfiltration attacks allow attackers to steal details about a model such as its architecture or weights. | Attacks in which the exact architecture or weights of a confidential/proprietary model are extracted. | In Scope |
Attacks in which the architecture and weights are not extracted precisely, or when they're extracted from a non-confidential model. | Out of Scope |
If you find a flaw in an AI-powered tool other than what is listed above, you can still submit, provided that it meets the qualifications listed on our program page.
| A bug or behavior that clearly meets our qualifications for a valid security or abuse issue.
| In Scope |
Using an AI product to do something potentially harmful that is already possible with other tools. For example, finding a vulnerability in open source software (already possible using publicly-available static analysis tools) and producing the answer to a harmful question when the answer is already available online. | Out of Scope |
As consistent with our program, issues that we already know about are not eligible for reward. | Out of Scope |
Potential copyright issues: findings in which products return content appearing to be copyright-protected. Google's generative AI products already have a dedicated reporting channel for these types of content issues. | Out of Scope |
No comments :
Post a Comment