Stop Google from accsesing pages without robots.txt
Is it possible to get Google to stop crawling certain pages without using robots.txt?
My understanding is robots.txt can block Google from accsesing a page, but if using noindex (meta or header) Google may still crawl the page.
As I have millions of pages I want to block Google from ever accessing certain pages, to focus it's indexing on my more important pages.
Background:
My site had over 11 million pages in English. Many of these contained statistical information, for comparisson consider a site listing demographic information by postal code/ZIP. Due to requests I translated one type of page which accounts for most of the pages into several languages. This gives me a total of about 100 million pages. Around 250,000 of them have paragraph text on them - the rest just have statistical content.
According to Google Webmaster Tools, Google has indexed around 30 million pages. Many of the pages may be indexed in German or French, but not English. Since English is my main language I want to instead block all the non-English pages from Google expect non-English pages that have paragraph text; which I think will leave Google to largely index my new update of close to 30 million pages in English.
I believe the only way to do this would be to structure the site something like this:
domain.tld/en/ <- index everything in here
domain.tld/de/ <- index everything in here
domain.tld/dex/ <- block with robots.txt
So my pages with German paragraph text would be available via urls beginning /de/ and those without /dex/.
It's true that robots.txt indications could be easily ignored by searchers in certain situations.
Knowing this, in addition to the noindex tag I also return a 410 status (GONE) instead of 200 (OK). That prevents the URL from being indexed even when it has links from external sites.
To sum up the needed actions:
- NoIndex or robots.txt disallow
- 410 status
In case you need to index some language subfolder for a period of time before deindexing it, you can use unavailable_after tag.
First of all, I don't think that trying to force Google into anything is a good idea. It may temporarily work and then fail badly at some point.
I would use the locale to return the page content in the right language. Google indexes locale adaptive pages since Jan 2015.
What this means is simple, the browser sends you a header tell you which languages the user prefers. Using that information, you return the page in English, German, French... The exact same URL. So you have nothing to block or worry about because Google is capable of requesting all the versions, somehow. (because Google Deutschland will request de
as its primary language, then probably fallback to en
.)
Of course, that requires your system to support such a feature.
Now, I've seen many websites trying to show languages using a path (as you've shown, http://example.com/fr
) or a sub-domain (http://fr.example.com/
), but in reality those solutions are wrong since you really are saving the exact same page with content tweaked for the language... so there should be no need to create a completely separate page. Instead, just make sure you server the correct language and you should be just fine after that. Google may still end up adding more of your other language pages in their index. I'm not too sure why that would happen, but it should not change the number of pages indexed in English nor the number of hits you currently get.
Comments
Post a Comment