The affected website
The website in question is a product catalogue site for a Dutch furniture brand. The site contains about 300 pieces of furniture, and includes the option to filter based on three different categories. The website is available in three different languages and contains about 100 other pages besides the 300 furniture pages.
With love, from Google
We had our doubts about the conclusion that a Googlebot would 'attack' us like this. So we first verified if the requests were actually made by Google by using a reverse DNS Lookup on several IP addresses. Google does not publish which IP addresses they use to visit sites meaning a reverse DNS lookup is the only way to be sure it's a Googlebot.
The lookups however, confirmed that it was indeed a Googlebot. So, we immediately jumped to the Google Search Console to see if we could find anything strange there.
And the answer was yes. There was definitely some strange and suspicious behavior going on. Just take a look at the graph below. Google had been steadily increasing its crawl budget until it reached a staggering 947,000 crawls per day (a.k.a. 10 per second) and was using over 25 GB of bandwidth a day.
The crawlstatistics in Google Search Console: crawled pages
The crawlstatistics in Google Search Console: downloaded kilobytes
The site went live at the end of September 2017 and after a slight 'reindexing peak', Google crawled about 400 pages a day. Starting midway through November however, Google started indexing more and more every day until it reached these overwhelming numbers.
The first step we took was to immediately reduce the crawlspeed in Google Search Console to the bare minimum.
Decreasing the crawlbudget of the Googlebot
The effect was almost immediate and Google complied with the crawlspeed. The problem however was far from solved. Even with the bare minimum crawl speed, there was still one request per second or 86,400 requests per day. A number that almost the amount of requests we were still receiving each day.
Since the site contains about 400 pages, that's about the amount of crawled pages we should be getting a day. So, back to the drawing board.