These days we ask quite a lot from our Content Management Systems. They need to be mobile friendly, e-commerce friendly SEO and user friendly and in some case multi-lingual.
As usual the splendid people at Moz (the new name for SEOmoz) have an excellent blog post on the subject.
This wasn’t always the case, and as a result there are a few systems out there that are, shall we say…wanting.
But, I’m not here to bash the software providers. What I want to discuss is the need to remove URL’s from the search engines index when your website feels the need to create more than one page with the exact same content.
Why remove pages – Isn’t all content good
Content is still king. However, it should be good quality content. Over the years search engines have become really good at spotting duplicate content. Take a look at this test.
Click this link to see the number of results for an exact match search on this famous Macbeth quote.
“Double, double toil and trouble;Fire burn, and caldron bubble”
As you can see there are approximately 85,8000 results that all match that exact phrase. So, if Google can identify this, it will be able to spot that you have 15 pages with the same product on it.
There are other reasons for wanting to remove content.
Publish and be dammed. This isn’t a good philosophy. But sometimes you do publish something by accident. If so, you want it out of the index and fast.
And finally you got a very nicely worded notice in Google’s Webmaster tools and a drop in website visits. Need to sort that out too.
So, when you have a lot of pages to de-list for whatever reason, here are a few suggestions to help you take the manual RSI pain out off the process.
NoIndex
If you have access to code of your website you could try and add the NoIndex tag. This is fairly simply to do. Add the following code into the header area of each page you want not to be indexed.
The effects of this will not be immediate. It will rely on the Search Engines to re-crawl this page. When the Google, Bing, et al find the code it will either remove it from its index or not include it in the first place.
This code will not affect the rest of the website. It is purely for the page its on.
The Robots.txt file
All search engines will look for a robots.txt file. This file is basically a set of instruction to the search engines to categorize a web sites, by either ignore specific pages, directories or urls.
However, this doesn’t form the same function as above. This method is more preventative. If a page is in the index already this method will not work.
Finally, the URL removal tool in Google’s webmaster tools
Not for the faint-hearted. If you are using this tool, take care. Any mistake will remove pages from your index for months.