Stop Google from accsesing pages without robots.txt

Stop Google from accsesing pages without robots.txt - Google Search Console is a free application that allows you to identify, troubleshoot, and resolve any issues that Google may encounter as it crawls and attempts to index your website in search results. If you’re not the most technical person in the world, some of the errors you’re likely to encounter there may leave you scratching your head. We wanted to make it a bit easier, so we put together this handy set of tips about seo, google-search, web-crawlers, googlebot to guide you along the way. Read the discuss below, we share some tips to fix the issue about Stop Google from accsesing pages without robots.txt.Problem :


Is it possible to get Google to stop crawling certain pages without using robots.txt?



My understanding is robots.txt can block Google from accsesing a page, but if using noindex (meta or header) Google may still crawl the page.



As I have millions of pages I want to block Google from ever accessing certain pages, to focus it's indexing on my more important pages.






Background:



My site had over 11 million pages in English. Many of these contained statistical information, for comparisson consider a site listing demographic information by postal code/ZIP. Due to requests I translated one type of page which accounts for most of the pages into several languages. This gives me a total of about 100 million pages. Around 250,000 of them have paragraph text on them - the rest just have statistical content.



According to Google Webmaster Tools, Google has indexed around 30 million pages. Many of the pages may be indexed in German or French, but not English. Since English is my main language I want to instead block all the non-English pages from Google expect non-English pages that have paragraph text; which I think will leave Google to largely index my new update of close to 30 million pages in English.



I believe the only way to do this would be to structure the site something like this:



domain.tld/en/ <- index everything in here
domain.tld/de/ <- index everything in here
domain.tld/dex/ <- block with robots.txt


So my pages with German paragraph text would be available via urls beginning /de/ and those without /dex/.


Solution :

It's true that robots.txt indications could be easily ignored by searchers in certain situations.



Knowing this, in addition to the noindex tag I also return a 410 status (GONE) instead of 200 (OK). That prevents the URL from being indexed even when it has links from external sites.



To sum up the needed actions:




  1. NoIndex or robots.txt disallow

  2. 410 status



In case you need to index some language subfolder for a period of time before deindexing it, you can use unavailable_after tag.



First of all, I don't think that trying to force Google into anything is a good idea. It may temporarily work and then fail badly at some point.



I would use the locale to return the page content in the right language. Google indexes locale adaptive pages since Jan 2015.



What this means is simple, the browser sends you a header tell you which languages the user prefers. Using that information, you return the page in English, German, French... The exact same URL. So you have nothing to block or worry about because Google is capable of requesting all the versions, somehow. (because Google Deutschland will request de as its primary language, then probably fallback to en.)



Of course, that requires your system to support such a feature.



Now, I've seen many websites trying to show languages using a path (as you've shown, http://example.com/fr) or a sub-domain (http://fr.example.com/), but in reality those solutions are wrong since you really are saving the exact same page with content tweaked for the language... so there should be no need to create a completely separate page. Instead, just make sure you server the correct language and you should be just fine after that. Google may still end up adding more of your other language pages in their index. I'm not too sure why that would happen, but it should not change the number of pages indexed in English nor the number of hits you currently get.


If the issue about seo, google-search, web-crawlers, googlebot is resolved, there’s a good chance that your content will get indexed and you’ll start to show up in Google search results. This means a greater chance to drive organic search traffic to your site.

Comments

Popular posts from this blog

Years after news site changed name, Google is appending the old name to search titles and news stories

Is it possible to outrank Google for a search term on their own search engine?

Load Wikipedia sourced biographies via Ajax or render it with the rest of the page as part of the initial request?