Indirect prompt injection (IPI) is an evolving threat vector targeting users of complex AI applications with multiple data sources, such as Workspace with Gemini. This technique enables the attacker to influence the behavior of an LLM by injecting malicious instructions into the data or tools used by the LLM as it completes the user’s query. This may even be possible without any input directly from the user.
IPI is not the kind of technical problem you “solve” and move on. Sophisticated LLMs with increasing use of agentic automation combined with a wide range of content create an ultra-dynamic and evolving playground for adversarial attacks. That’s why Google takes a sophisticated and comprehensive approach to these attacks. We’re continuously improving LLM resistance to IPI attacks and launching AI application capabilities with ever-improving defenses. Staying ahead of the latest indirect prompt injection attacks is critical to our mission of securing Workspace with Gemini.
In our previous blog “Mitigating prompt injection attacks with a layered defense strategy”, we reviewed the layered architecture of our IPI defenses. In this blog, we’ll share more detail on the continuous approach we take to improve these defenses and to solve for new attacks.
New attack discovery
By proactively discovering and cataloging new attack vectors through internal and external programs, we can identify vulnerabilities and deploy robust defenses ahead of adversarial activity.
Human Red-Teaming
Human Red-Teaming uses adversarial simulations to uncover security and safety vulnerabilities. Specialized teams execute attacks based on realistic user profiles to exploit weaknesses, coordinating with product teams to resolve identified issues.
Automated Red-Teaming
Automated Red-Teaming is done via dynamic, machine-learning-driven frameworks to stress-test environments. By algorithmically generating and iterating on attack payloads, we can mimic the behavior of sophisticated threats at scale. This allows us to map complex attack paths and validate the effectiveness of our security controls across a much wider range of edge cases than manual testing could achieve on its own.
Google AI Vulnerability Rewards Program (VRP)
The Google AI Vulnerability Rewards Program (VRP) is a critical tool for enabling collaboration between Google and external security researchers who discover new attacks leveraging IPI. Through this VRP, we recognize and reward contributors for their research. We also host regular, live hacking events where we provide invited researchers access to pre-release features, proactively uncovering novel vulnerabilities. These partnerships enable Google to quickly validate, reproduce, and resolve externally-discovered issues.
Publicly disclosed AI attacks
Google utilizes open-source intelligence feeds to stay on top of the latest publicly disclosed IPI attacks, across social media, press releases, blogs, and more. From there, new AI vulnerabilities are sourced, reproduced, and catalogued internally to ensure our products are not impacted.
Vulnerability catalog
All newly discovered vulnerabilities go through a comprehensive analysis process performed by the Google Trust, Security, & Safety teams. Each new vulnerability is reproduced, checked for duplications, mapped into attack technique / impact category, and assigned to relevant owners. The combination of new attack discovery sources and vulnerability catalog process helps Google stay on top of the latest attacks in an actionable manner.
Synthetic data generation
After we discover, curate, and catalog new attacks, we use Simula to generate synthetic data expanding these new attacks. This process is essential because it allows the team to develop attack variants for completeness and coverage, and to prepare new training and validation data sets. This accelerated workflow has boosted synthetic data generation by 75%, supporting large-scale defense model evaluation and retraining, as well as updating the data set used for calculating and reporting on defense effectiveness.
Ongoing defense refinement
Continually updating and enhancing our defense mechanisms allows us to address a broader range of attack techniques, effectively reducing the overall attack surface. Updating each defense type requires different tasks, from config updates, to prompt engineering and ML model retraining.
Deterministic Defenses
Deterministic defenses, including user confirmation, URL sanitization, and tool chaining policies, are designed for rapid response against new or emerging prompt injection attacks by relying on simple configuration updates. These defenses are governed by a centralized Policy Engine, with configurations for policies like baseline tool calls, URL sanitization, and tool chaining. For immediate threats, this configuration-based system facilitates a streamlined process for "point fixes," such as regex takedowns, providing an agile defense layer that acts faster than traditional ML/LLM model refresh cycles.
ML-Based Defenses
After generating synthetic data that expands new attacks into variants, the next step is to retrain our ML-based defenses to mitigate these new attacks. We partition the synthetic data described above into separate training and validation sets to ensure performance is evaluated against held-out examples. This approach ensures repeatability, data consistency for fixed training/testing, and establishes a scalable architecture to support future extensions towards fully automated model refresh.
LLM-Based Defenses
Using the new synthetic data examples, our LLM-based defenses go through prompt engineering with refined system instructions. The goal is to iteratively optimize these prompts against agreed-upon defense effectiveness metrics, ensuring the models remain resilient against evolving threat vectors.
Gemini Model Hardening
Beyond system-level guardrails and application-level defenses, we prioritize ‘model hardening’, a process that improves the Gemini model's internal capability to identify and ignore harmful instructions within data. By utilizing synthetic datasets and fresh attack patterns, we can model various threat iterations. This enables us to strengthen the Gemini model's ability to disregard harmful embedded commands while following the user's intended request. Through this process of model hardening, Gemini has become significantly more adept at detecting and disregarding injected instructions. This has led to a reduction in the success rate of attacks without compromising the model's efficiency during routine operations.
Defense effectiveness
To measure the real-world impact of defense improvements, we simulate attacks against many Workspace features. This process leverages the newly generated synthetic attack data described on this blog, to create a robust, end-to-end evaluation. The simulation is run against multiple Workspace apps, such as Gmail and Docs, using a standardized set of assets to ensure reliable results. To determine the exact impact of a defense improvement (e.g., an updated ML model or a new LLM prompt optimization), the end-to-end evaluation is run with and without the defense enabled. This comparative testing provides the essential "before and after" metrics needed to validate defense efficacy and drive continuous improvement.
Moving forward
Our commitment to AI security is rooted in the principle that every day you’re safer with Google. While the threat landscape of indirect prompt injection evolves, we are building Workspace with Gemini to be a secure and trustworthy platform for AI-first work. IPI is a complex security challenge, which requires a defense-in-depth strategy and continuous mitigation approach. To get there, we’re combining world-class security research, automated pipelines, and advanced ML/LLM-based models. This robust and iterative framework helps to ensure we not only stay ahead of evolving threats but also provide a powerful, secure experience for both our users and customers.
Nenhum comentário :
Postar um comentário