The affected website
The website in question is a product catalogue site for a Dutch furniture brand. The site contains about 300 pieces of furniture, and includes the option to filter based on three different categories. The website is available in three different languages and contains about 100 other pages besides the 300 furniture pages.
With love, from Google
We had our doubts about the conclusion that a Googlebot would 'attack' us like this. So we first verified if the requests were actually made by Google by using a reverse DNS Lookup on several IP addresses. Google does not publish which IP addresses they use to visit sites meaning a reverse DNS lookup is the only way to be sure it's a Googlebot.
The lookups however, confirmed that it was indeed a Googlebot. So, we immediately jumped to the Google Search Console to see if we could find anything strange there.
And the answer was yes. There was definitely some strange and suspicious behavior going on. Just take a look at the graph below. Google had been steadily increasing its crawl budget until it reached a staggering 947,000 crawls per day (a.k.a. 10 per second) and was using over 25 GB of bandwidth a day.
The site went live at the end of September 2017 and after a slight 'reindexing peak', Google crawled about 400 pages a day. Starting midway through November however, Google started indexing more and more every day until it reached these overwhelming numbers.
The first step we took was to immediately reduce the crawlspeed in Google Search Console to the bare minimum.
The effect was almost immediate and Google complied with the crawlspeed. The problem however was far from solved. Even with the bare minimum crawl speed, there was still one request per second or 86,400 requests per day. A number that almost the amount of requests we were still receiving each day.
Since the site contains about 400 pages, that's about the amount of crawled pages we should be getting a day. So, back to the drawing board.
The root of the problem
It quickly became clear that it was the product catalogue page that was getting the brunt of the 'attack'. The page has three separate options to filter the results; these are the category, designer and model of a product.
It's relatively standard functionality to filter results that has been used on many of our sites and doesn't really contain anything that stands out as odd.
Aside from the aforementioned filtering, it is also possible to select multiple filters at once.
A closer inspection of the HTML however did reveal something. One of our colleagues had decided to not use the default select HTML-element for the dropdowns ([select][option][/option][select]). Instead, the functionality was implemented using a hidden list containing links ([div][a href][/a][a href][/a][/div]).
By clicking one of the options, the filter was added to the querystring resulting in a new URL. Because multiple filters could be added the amount of unique URLs became seemingly limitless (an approximation of the amount of combinations is 5! * 10! * 53!, a.k.a. a 1 with no less than 78 zeros).
The power of Twitter
We know that crawling is just retrieving data from a site by following the links on said site, but we had secretly expected that Google would prevent these kinds of extreme actions during its spidering. It was hard for us to believe that the Googlebot would contain this little logic, so we turned to Twitter to see if we could get some advice from Googles very own John Mu. Within hours we received confirmation that the filtering and navigation were the cause and we were advised to look further into 'faceted navigation' and the fact that we had offered an endless amount of filter combinations.
To solve the issue, we found two solutions:
- Add a 'nofollow' to the links you do not want Google to crawl or
- Exclude certain querystringparameters in the robots.txt
Since the entire problem fascinated us, we decided to try both solutions separately from one another.
Solution 1: Adding Rel=“nofollow” to all links
Google will not follow or index a link if it contains the rel='nofollow' attribute. Or so the theory goes. The solution was implemented within minutes and within a day we saw a clear difference. The Googlebot was listening so well, we turned the crawl speed back to automatic.
Problem solved! Right? Right?
No. Problem not solved. We found out that other bots weren't as well behaved as the Googlebot and just ignored our no-follow instructions. Bad bots! Down! Sit! So, despite taming the Googlebot, the other bots kept going on their merry way, resulting in many 'unnecessary' requests. The only thing that (nearly) all bots listened too was an exclusion in the robots.txt.
Solution 2: robots.txt
The alternative was to indicate which URLs shouldn't be indexed using the robots.txt. First, we reverted the changes from solution 1 to get a clear result for comparison. We implemented this second solution afterwards. We decided not to exclude specific bots or URLs, but to use a regular expression to exclude the three querystrings:
- # Block specific querystring parameters
- User-agent: *
- Disallow: *categorie=*
- Disallow: *ontwerper=*
- Disallow: *model=*
And within three days all bots had seen our new robots.txt and the DDOS attack was finally over.
By now, the amount of crawlactions on the site has stabilized thanks to the solution provided above. It was remarkable to learn that the Googlebot does not contain the necessary intelligence to recognize a situation like this and put a stop to it on its own. Aside from that, we learned that not all bots are as well behaved as the Googlebot. So, it's better to exclude URLs using robots.txt. And finally, Twitter is still an excellent medium to get an experts opinion!