May 302012
 

Occasionally I see a “Page Not Found” message in Google AdSense Crawler errors report, and the URL appears to be in a form of

“http://deputycio.com/?p=XXYY&preview=true”,

where XXYY is a unique “post_id” number that WordPress assigns to every new post. Links containing the “preview=true” string are temporary WordPress preview pages, and this shows the cause of the problem.

When I preview a yet unpublished post, Google AdSense ads get displayed on the preview page just like throughout the rest of my website. However, Google crawler then tries to read the article to “learn” which contextual ads would fit it best, but it fails because the preview pages aren’t public.

Once published, the post gets a new and permanent URL, keeping only the same unique number from the temporary link, and the web spider (AKA web crawler) will successfully read it, but it will also keep trying to access the long gone temporary URL for a while (if you know for how long, please comment). These errors are more cosmetic than critical, informing the webmaster that the page wasn’t accessible, which in any other case could signal an impaired visitor experience, so this is otherwise a beneficial behavior on Google’s side. However, here it is an annoyance, so to prevent these preview-page-caused errors from showing in the future I added this line to my robots.txt file below the “User-Agent: *” disallow statements (section that applies to all crawlers):

Disallow: /*preview=true$

UPDATE EDIT June 22, 2012: For some reason this didn’t seem to stop another crawler error from happening, so I’m trying the more generic statement without the dollar sign ‘$’ in the end.

Disallow: /*preview=true

UPDATE EDIT July 23, 2012: Neither of these worked because I put them in a wrong section of my robots.txt file, under User-Agent: * instead of under User-Agent: Googlebot, which was on the other hand overriding it because it was set to allow Googlebot to snoop around anything and everything.  Moreover, this syntax works only with Google bots so it shouldn’t really be in User-agent: *.  I described another extremely simple and elegant statement that disallows both preview and many other undesired links in my Googlebot Gone Crazy post.

The first star ‘*’ makes the rule apply whenever a “preview=true” string (without quotes) exists in any URL, while the dollar sign ‘$’ in the end makes my statement more specific and disallows the crawlers from browsing an URL only if “preview=true” appears in the end, which is always the case with preview URLs. My statement should work even without the dollar sign, but I just wanted to be more verbose, specific, particular, meticulous, genuine, accurate, intrinsic, precise, to the point, exact, supercalifragilisticexpialidocious…

  2 Responses to “Hiding WordPress Preview Pages from Web Crawlers”

  1. Nice experiment buddy. Keep It up. 🙂 

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)