Our system analyzes a number of webpage features to help make a verdict about whether a site is a phishing site. Starting with a page’s URL, we look to see if there is anything unusual about the host, such as whether the hostname is unusually long or whether the URL uses an IP address to specify the host. We also look to see if the URL contains any phrases like “banking” or “login” that might indicate that the page is trying to steal information.
We don’t just look at the URL, though. After all, a perfectly legitimate site could certainly use words like “banking” or “login.” We collect a snapshot of the page’s content to examine it closely for phishing behavior. For example, we check to see if the page has a password field or whether most of the links point to a common phishing target, as both of these characteristics can be a sign of phishing. Additionally, we pick out some of the most characteristic terms that show up on a page (as defined by their
TF-IDF scores), and look for terms like “password” or “PIN number,” which also may indicate that the page is intended for phishing.
We also check the page’s hosting information to find out which networks host the page and where the page’s servers are located geographically. If a site purporting to be an American bank runs its servers in a different country and is hosted on a local residential ISP’s network, we have a strong signal that the site is bad.
Finally, we check the page’s
PageRank to see if the page is popular or not, and we check the spam reputation of the page’s domain. We discovered in our research findings that almost all phishing pages are found on domains that almost exclusively send spam. You can observe this trend in the
CCDF graph of the spam reputation scores for phishing pages as compared to the graph of other, non-phishing pages.
How we learn to recognize phishing pages
We use a sample of the data that our system generates to train the classifier that lies at the core of our automatic system using a machine learning algorithm. Coming up with good labels (phishing/not phishing) for this data is tricky because we can’t label each of the millions of pages ourselves. Instead, we use our published phishing page list, largely generated by our classifier, to assign labels for the training data.
You might be wondering if this system is going to lead to situations where the classifier makes a mistake, puts that mistake on our list, and then uses the list to learn to make more mistakes. Fortunately, the chain doesn’t make it that far. Our classifier only makes a relatively small number of mistakes, which we can correct manually when you
report them to us. Our learning algorithms can handle a few temporary errors in the training labels, and the overall learning process remains stable.
How well does this work?Of the millions of webpages that our scanners analyze for phishing, we successfully identify 9 out of 10 phishing pages. Our classification system only incorrectly flags a non-phishing site as a phishing site about 1 in 10,000 times, which is significantly better than similar systems. In our experience, these “false positive” sites are usually built to distribute spam or may be involved with other suspicious activity. While phishers are constantly changing their strategies, we find that they do not change them enough to reliably escape our system. Our experiments showed that our classification system remained effective for over a month without retraining.
If you are a webmaster and would like more information about how to keep your site from looking like a phishing site, please check out our
post on the
Webmaster Central Blog. If you find that your site has been added to our phishing page list ("Reported Web Forgery!") by mistake, please
report the error to us.