Jul 242012
 

The Wassup log of this site recently started filling with Google spider attempts to access way too many pages at nonsense appended links as in the following three typical samples, the rest of numerous similar crawls being variations of these.

http://deputycio.com/?wpmp_switcher=desktop/page/31/page/3/page/33/page/3/page/34/page/2/page/3

http://deputycio.com/category/mobile?pw_post_layout/page/5/page/3/page/2/page/3/page/3/page/7/page/2/page/7/page/3/page/2

http://deputycio.com/page/2?pw_post_layout/page/4/page/3&wpmp_switcher=mobile

It appeared almost like an infinite loop because these attempts kept on going in variations for many minutes, often close to an hour, several times a day.

There may be a link between this and the otherwise awesome Magazine Basic Theme, or the child theme I made and used with it, because about two months ago my posts pagination was messed up due to a different number of displayed posts in Magazine Basic and in my WordPress Reading Settings, set to 5 and 10, respectively.  The pagination got back to normal after I set both to the same number and I don’t think this caused these problems, but I’m posting this here just in case someone else is facing the same issue and needs guidance or can provide guidance how to fix the cause.  Moreover, I also know that  that the “pw_post_layout” string that appears in some of the logged nonsense url crawl attempts was related to a bug in that theme that got fixed some time ago.

Another possible cause could be the mobile theme which I was using originally but got rid of several months ago and the string wpmp_switcher=desktop and wpmp_switcher=mobile points at it.

Not sure what was causing it, I tried to block these fruitless crawls with URL parameters in Google webmaster tools, and I was partially successful, but there were too many different lines to block, while Googlebot continued with its senseless crawl rampage with lines that weren’t blocked yet.

Meanwhile I reinstalled my WordPress from scratch in a different directory, even setting up a new database and using a different theme, then imported the xml backup made by selecting WordPress Tools -> Export and choosing a full export.  After a day or two spent on installing, double-checking and configuring all the plugins I still want, I switched over.  After everything worked OK, I first renamed, then later backed up and deleted the old subfolder, so this is a fresh, and fully clean installation only containing the old posts, pages and pictures.  Still, the  Googlebot crawls of these nonsense spaghetti links with appended pages came back to haunt me in the logs.

Then I remembered another powerful tool to block these – robots.txt, which I should have used right from the beginning. Still lacking the understanding for true reasons of this behavior, my fix was symptomatic.  Below you can find a Disallow statement from my robots.txt that prevents Googlebot from browsing any link containing a question mark, which appears in every single one of these log lines showing Googlebot’s “crazy” attempts to browse these redundant links and doesn’t exist in any permanent and legit urls at my site.

This also more elegantly improves and replaces lines used in my previous attempts to block the AdSense bot from crawling my preview pages while I’m writing and editing a post before publishing it, since every temporary preview link also has a question mark in it. (Caution: I know for certain that question marks aren’t used in my WordPress site legit links, but you should double-check yours before doing anything drastic). Here’s the line:

Disallow: /*?

The statement above currently exists in my robots.txt under the following lines:

User-agent: Googlebot, User-agent: Adsbot-Google, User-agent: Mediapartners-Google, and in User-agent: * for every other spider bot that understands wildcards, which is often not the case.

Every instance of the “bad” links these were trying to crawl had a questionmark in them. Putting the above disallow question mark wildcard statement in sections dedicated to google bots prevented further “spam crawling” by Google bots until I figure out why they kept doing this.  Currently the Baiduspider seems to be the only one that still keeps trying to crawl the nonsense urls, only on a far much lesser scale.

So that’s the current story and its history.  I’d love to hear from anybody who still has or had a similar problem and a solution.

  9 Responses to “Googlebot Gone Crazy (Maybe Not Its Fault)”

  1. I have this same problem, and it also seems that my site isn’t showing up at all in Google except for like this. E.g. I search for specific blocks of text from my site in quotation marks and the results do not show up in Google. This is a seasoned, 5 year old site which previously showed up in Google. I’m using the Ascetica theme by Galin Simeonov.

    • I too was surprised to find out that these links have taken priority over the older regular ones, especially for newer posts.
      My visitor numbers fell around that time too and I thought it was the infamous Google Penguin updates, but the dates didn’t really coincide. Besides disallowing all links containing a question mark, I also disallowed googlebot from a multitude of less important link entry points such as /author/, /date/, /tag/ and /category/, so that it only accesses my posts through their main links. After all these maneuvers in robots.txt, my posts are now again showing in their normal URLs in Google.

      Just to try to determine whether our sites have anything in common, are you using, or have you used a mobile theme as well? I was using the Baap Mobile Version for about a year and then decided to remove it, but that was far before I discovered this, so I’m not sure whether that’s related.

  2. thats stupid plugin make 200 error in Google web master tool

    • Which plugin? Are you having the same problem?

      • “BAAP Mobile Version”

        I don’t know if their is alternative Mobile Plugin do the same job withe this errors .
        by the way I block Googlebot from crawl that url
        Using
        Disallow: *?wpmp_switcher
        with out any success

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)