Google Online Security Blog: Working Together to Filter Automated Data-Center Traffic

Security Blog

The latest news and insights from Google on security and safety on the Internet

Working Together to Filter Automated Data-Center Traffic

21 juillet 2015

Posted by Vegard Johnsen, Product Manager Google Ad Traffic Quality
Today the Trustworthy Accountability Group (TAG) announced a new pilot blacklist to protect advertisers across the industry. This blacklist comprises data-center IP addresses associated with non-human ad requests. We're happy to support this effort along with other industry leaders—Dstillery, Facebook, MediaMath, Quantcast, Rubicon Project, TubeMogul and Yahoo—and contribute our own data-center blacklist. As mentioned to Ad Age and in our recent call to action, we believe that if we work together we can raise the fraud-fighting bar for the whole industry.
Data-center traffic is one of many types of non-human or illegitimate ad traffic. The newly shared blacklist identifies web robots or “bots” that are being run in data centers but that avoid detection by the IAB/ABC International Spiders & Bots List. Well-behaved bots announce that they're bots as they surf the web by including a bot identifier in their declared User-Agent strings. The bots filtered by this new blacklist are different. They masquerade as human visitors by using User-Agent strings that are indistinguishable from those of typical web browsers.
In this post, we take a closer look at a few examples of data-center traffic to show why it’s so important to filter this traffic across the industry.
Impact of the data-center blacklist
When observing the traffic generated by the IP addresses in the newly shared blacklist, we found significantly distorted click metrics. In May of 2015 on DoubleClick Campaign Manager alone, we found the blacklist filtered 8.9% of all clicks. Without filtering these clicks from campaign metrics, advertiser click-through rates would have been incorrect and for some advertisers this error would have been very large.
Below is a plot that shows how much click-through rates in May would have been inflated across the most impacted of DoubleClick Campaign Manager’s larger advertisers.

Two examples of bad data-center traffic
There are two distinct types of invalid data-center traffic: where the intent is malicious and where the impact on advertisers is accidental. In this section we consider two interesting examples where we’ve observed traffic that was likely generated with malicious intent.
Publishers use many different strategies to increase the traffic to their sites. Unfortunately, some are willing to use any means necessary to do so. In our investigations we’ve seen instances where publishers have been running software tools in data centers to intentionally mislead advertisers with fake impressions and fake clicks.
First example
UrlSpirit is just one example of software that some unscrupulous publishers have been using to collaboratively drive automated traffic to their websites. Participating publishers install the UrlSpirit application on Windows machines and they each submit up to three URLs through the application’s interface. Submitted URLs are then distributed to other installed instances of the application, where Internet Explorer is used to automatically visit the list of target URLs. Publishers who have not installed the application can also leverage the network of installations by paying a fee.
At the end of May more than 82% of the UrlSpirit installations were being run on machines in data centers. There were more than 6,500 data-center installations of UrlSpirit, with each data-center installation running in a separate virtual machine. In aggregate, the data-center installations of UrlSpirit were generating a monthly rate of at least half a billion ad requests— an average of 2,500 fraudulent ad requests per installation per day.
Second example
HitLeap is another example of software that some publishers are using to collaboratively drive automated traffic to their websites. The software also runs on Windows machines, and each instance uses the Chromium Embedded Framework to automatically browse the websites of participating publishers—rather than using Internet Explorer.
Before publishers can use the network of installations to drive traffic to their websites, they need browsing minutes. Participating publishers earn browsing minutes by running the application on their computers. Alternatively, they can simply buy browsing minutes—with bundles starting at $9 for 10,000 minutes or up to 1,000,000 minutes for $625.
Publishers can specify as many target URLs as they like. The number of visits they receive from the network of installations is a function of how long they want the network of bots to spend on their sites. For example, ten browsing minutes will get a publisher five visits if the publisher requests two-minute visit durations.
In mid-June, at least 4,800 HitLeap installations were being run in virtual machines in data centers, with a unique IP associated with each HitLeap installation. The data-center installations of HitLeap made up 16% of the total HitLeap network, which was substantially larger than the UrlSpirit network.
In aggregate, the data-center installations of HitLeap were generating a monthly rate of at least a billion fraudulent ad requests—or an average of 1,600 ad requests per installation per day.Not only were these publishers collectively responsible for billions of automated ad requests, but their websites were also often extremely deceptive. For example, of the top ten webpages visited by HitLeap bots in June, nine of these included hidden ad slots -- meaning that not only was the traffic fake, but the ads couldn’t have been seen even if they had been legitimate human visitors.
http://vedgre.com/7/gg.html is illustrative of these nine webpages with hidden ad slots. The webpage has no visible content other than a single 300×250px ad. This visible ad is actually in a 300×250px iframe that includes two ads, the second of which is hidden. Additionally, there are also twenty-seven 0×0px hidden iframes on this page with each hidden iframe including two ad slots. In total there are fifty-five hidden ads on this page and one visible ad. Finally, the ads served on http://vedgre.com/7/gg.html appear to advertisers as though they have been served on legitimate websites like indiatimes.com, scotsman.com, autotrader.co.uk, allrecipes.com, dictionary.com and nypost.com, because the tags used on http://vedgre.com/7/gg.html to request the ad creatives have been deliberately spoofed.
An example of collateral damage
Unlike the traffic described above, there is also automated data-center traffic that impacts advertising campaigns but that hasn’t been generated for malicious purposes. An interesting example of this is an advertising competitive intelligence company that is generating a large volume of undeclared non-human traffic.
This company uses bots to scrape the web to find out which ad creatives are being served on which websites and at what scale. The company’s scrapers also click ad creatives to analyse the landing page destinations. To provide its clients with the most accurate possible intelligence, this company’s scrapers operate at extraordinary scale and they also do so without including bot identifiers in their User-Agent strings.
While the aim of this company is not to cause advertisers to pay for fake traffic, the company’s scrapers do waste advertiser spend. They not only generate non-human impressions; they also distort the metrics that advertisers use to evaluate campaign performance—in particular, click metrics. Looking at the data across DoubleClick Campaign Manager this company’s scrapers were responsible for 65% of the automated data-center clicks recorded in the month of May.
Going forward
Google has always invested to prevent this and other types of invalid traffic from entering our ad platforms. By contributing our data-center blacklist to TAG, we hope to help others in the industry protect themselves.
We’re excited by the collaborative spirit we’ve seen working with other industry leaders on this initiative. This is an important, early step toward tackling fraudulent and illegitimate inventory across the industry and we look forward to sharing more in the future. By pooling our collective efforts and working with industry bodies, we can create strong defenses against those looking to take advantage of our ecosystem. We look forward to working with the TAG Anti-fraud working group to turn this pilot program into an industry-wide tool.