Stop Google from crawling proxied pages but still allow the proxy itself to be found via search engines

Stop Google from crawling proxied pages but still allow the proxy itself to be found via search engines - Google Search Console is a free application that allows you to identify, troubleshoot, and resolve any issues that Google may encounter as it crawls and attempts to index your website in search results. If you’re not the most technical person in the world, some of the errors you’re likely to encounter there may leave you scratching your head. We wanted to make it a bit easier, so we put together this handy set of tips about seo, google, google-search, web-crawlers to guide you along the way. Read the discuss below, we share some tips to fix the issue about Stop Google from crawling proxied pages but still allow the proxy itself to be found via search engines.Problem :


I have a proxy at http://rahul2001.com/proxy/



I site can be 'proxied' by like this:
http://rahul2001.com/proxy/proxy.php/http://site-to-be-proxied.example.com/



The problem is that Google seems to be crawling pages through my proxy :/



uh oh..



I do NOT link a proxied version of Yahoo food or Google Careers or Google Search Hindi Help, but these still turn up in the search results...



PROBLEMS:



-I DO NOT want to block my website/proxy from search engines entirely, I still want the proxy service itself to be able to be located in search.



-I DO NOT want Google to use up my bandwidth by crawling useless sites.



-I DO NOT want to use captcha since a few of my apps use this proxy.



-I DO NOT want Google to spoil the search results of my website in this manner.



What do I do??



ALSO, why is Google entering random URLs into the form??



EDIT
After adding the meta tag, I get an error :(



proxy.php (first few lines):



<head><meta name="robots" content="noindex, nofollow" />
</head>
<?php
/*
miniProxy - A simple PHP web proxy. <https://github.com/joshdick/miniProxy>
Written and maintained by Joshua Dick <http://joshdick.net>.
miniProxy is licensed under the GNU GPL v3 <http://www.gnu.org/licenses/gpl.html>.
*/


Error:



Warning: Cannot modify header information - headers already sent by (output started at /home/rahulcom/public_html/proxy/proxy.php:3) in /home/rahulcom/public_html/proxy/proxy.php


HELL BREAKS LOOSE - GOOGLE IS CRAWLING THE INTERNET USING MY PROXY!!



https://www.google.co.in/search?q=site:rahul2001.com+proxy


Solution :

Add this in robots.txt (Just places this file in root folder, mostly under public_html or where your home page sits)



User-agent: *



Disallow: /proxy/*



OK, so, you do not want to block the proxy from search engines, but you don't want the result to show up on search engines? Sorry, I don't get it :) Is it that you want the original site to show up on top on the proxies? Google decides which pages are more important I'm afraid.


Also be careful with duplicate content. The same information should only be on one URL. For duplicates you should use canonical links (http://speckyboy.com/2012/07/16/what-a-canonical-link-is-and-how-to-use-it-properly/) to tell Google which of the page is the original one. This is probably the one that will show up on Google.


There is not need for two URLs to show up on Google with the same information in it?


EDIT


From comments i was lead to believe that you want to show /proxy/ but not anything under this page, like /proxy/subpage. In this case, use robots.txt like this:



User-agent: *


Disallow: /proxy/


Allow: /proxy/$



If the issue about seo, google, google-search, web-crawlers is resolved, there’s a good chance that your content will get indexed and you’ll start to show up in Google search results. This means a greater chance to drive organic search traffic to your site.

Comments

Popular posts from this blog

Years after news site changed name, Google is appending the old name to search titles and news stories

Is it possible to outrank Google for a search term on their own search engine?

Load Wikipedia sourced biographies via Ajax or render it with the rest of the page as part of the initial request?