The Wassup log of this site recently started filling with Google spider attempts to access way too many pages at nonsense appended links as in the following three typical samples, the rest of numerous similar crawls being variations of these.
It appeared almost like an infinite loop because these attempts kept on going in variations for many minutes, often close to an hour, several times a day.
There may be a link between this and the otherwise awesome Magazine Basic Theme, or the child theme I made and used with it, because about two months ago my posts pagination was messed up due to a different number of displayed posts in Magazine Basic and in my WordPress Reading Settings, set to 5 and 10, respectively. The pagination got back to normal after I set both to the same number and I don’t think this caused these problems, but I’m posting this here just in case someone else is facing the same issue and needs guidance or can provide guidance how to fix the cause. Moreover, I also know that that the “pw_post_layout” string that appears in some of the logged nonsense url crawl attempts was related to a bug in that theme that got fixed some time ago.
Another possible cause could be the mobile theme which I was using originally but got rid of several months ago and the string wpmp_switcher=desktop and wpmp_switcher=mobile points at it.
Not sure what was causing it, I tried to block these fruitless crawls with URL parameters in Google webmaster tools, and I was partially successful, but there were too many different lines to block, while Googlebot continued with its senseless crawl rampage with lines that weren’t blocked yet.
Meanwhile I reinstalled my WordPress from scratch in a different directory, even setting up a new database and using a different theme, then imported the xml backup made by selecting WordPress Tools -> Export and choosing a full export. After a day or two spent on installing, double-checking and configuring all the plugins I still want, I switched over. After everything worked OK, I first renamed, then later backed up and deleted the old subfolder, so this is a fresh, and fully clean installation only containing the old posts, pages and pictures. Still, the Googlebot crawls of these nonsense spaghetti links with appended pages came back to haunt me in the logs.
Then I remembered another powerful tool to block these – robots.txt, which I should have used right from the beginning. Still lacking the understanding for true reasons of this behavior, my fix was symptomatic. Below you can find a Disallow statement from my robots.txt that prevents Googlebot from browsing any link containing a question mark, which appears in every single one of these log lines showing Googlebot’s “crazy” attempts to browse these redundant links and doesn’t exist in any permanent and legit urls at my site.
This also more elegantly improves and replaces lines used in my previous attempts to block the AdSense bot from crawling my preview pages while I’m writing and editing a post before publishing it, since every temporary preview link also has a question mark in it. (Caution: I know for certain that question marks aren’t used in my WordPress site legit links, but you should double-check yours before doing anything drastic). Here’s the line:
The statement above currently exists in my robots.txt under the following lines:
User-agent: Googlebot, User-agent: Adsbot-Google, User-agent: Mediapartners-Google, and in User-agent: * for every other spider bot that understands wildcards, which is often not the case.
Every instance of the “bad” links these were trying to crawl had a questionmark in them. Putting the above disallow question mark wildcard statement in sections dedicated to google bots prevented further “spam crawling” by Google bots until I figure out why they kept doing this. Currently the Baiduspider seems to be the only one that still keeps trying to crawl the nonsense urls, only on a far much lesser scale.
So that’s the current story and its history. I’d love to hear from anybody who still has or had a similar problem and a solution.