Yesterday, while contemplating methods to 301 redirect a series of dynamic URLs at my Fortune 500 job, in an effort to canonicalize them and avoid a potential duplicate content issue, I thought I would check our account on Google Webmaster Tools.
The first thing that caught my attention was the “crawl errors… not found” section of the dashboard. Google had found several new unreachable URLs, the majority of which were PDF files. How had these PDF files suddenly become unreachable to Google’s spider?
A few weeks ago, we had several PDF documents recreated, wrapping their data into new a corporate-branded theme. Based on some outdated “that’s the way it’s always been done” guidelines, the new PDF documents had their name changed to include the initials of the person who made the updates. “That way, we can always tell who touched them last”, I was told.
The updated PDFs are uploaded to the Web site via a content management system, which associates the new PDF to the correct product. Visit a product page and the newly revised PDF is instantly available to any customer who needs it. When the site is re-spidered, Google finds and indexes the new PDF. What’s the problem?
The problem is multi-faceted, not the least of which is the fact that Google followed these links from other Web sites. Yes, other sites had taken the time to point to our content and we unknowingly broke the links. New rule. Standardize the naming convention of these documents and never change them again.
OK, so now we have a large number of PDFs we need to 301 redirect to their new address. 301 redirecting static Web pages is easy, but what about PDF documents? A PDF file doesn’t have a header to hook into with 301 redirect code. If we were on a Linux server, we could easily do this by editing the .htaccess file. We’re running IIS Web server, so that’s out. Looks like we need a different method.
While poking around on the Web, I found an article by John Honeck, entitled “Page by Page redirects in IIS for .asp, .html, .pdf, etc.“. He presented an ingenious method for implementing redirection of a PDF document. Rename or remove the old PDF document from the directory. In its place, create a new folder and give it the exact same name as the outdated PDF, including the use of the .pdf extension.
Inside the new folder, place a default Web page (index.htm, default.htm, etc.). In the head of the default Web page, place meta redirect code. A user (or Google spider) visits the directory, finds an object with the correct name and the default Web page redirects them to the new content!
Here’s sample code:
<!DOCTYPE html PUBLIC “-//W3C//DTD XHTML 1.0 Transitional//EN” “http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd”>
<title>Moved to new URL: http://www.example.com/pdfs/thenewpdf.pdf</title>
<meta http-equiv=refresh content=”0; url=http://www.example.com/pdfs/thenewpdf.pdf” />
<meta name=”robots” content=”noindex,follow” />
<h1>This page has been moved to http://www.example.com/pdfs/thenewpdf.pdf</h1>
<p>If your browser doesn’t redirect you to the new location please <a href=”http://www.example.com/pdfs/thenewpdf.pdf”><b>click here</b></a></p>
<!– Redirect for file URL: http://www.example.com/pdfs/theoldpdf.pdf –>
<!– Set as default page inside folder named theoldpdf.pdf –>
Prior to implementing this, I wanted to verify that Google would indeed see the meta refresh as a 301 redirect. I had considered adding 301 redirect code, followed by a meta refresh. Upon some additional searching, I found an excellent article by SebastianX entitled, “Google and Yahoo accept undelayed meta refreshs as 301 redirects “.
Sebastian’s article presents a satisfying argument, illustrating that Google will indeed see these meta refreshes as 301 redirects. Our customers will find the content they came looking for. Google will find the updated content, and should no longer designate the items as unreachable content. And just as soon as I finish creating the last of these PDF redirects, I’ll be able to get back to my journey of finding the best solution for 301 redirecting and canonicalizing the dynamic URLs I was working on previously.