Preventing robots from crawling specific part of a page
As a webmaster in charge of a tiny site that has a forum, I regularly receive complains from users that both the internal search engine and that external searches (like when using Google) are totally polluted by my users' signatures (they're using long signatures and that's part of the forum's experience because signatures makes a lot of sense in my forum).
So basically I'm seeing two options as of now:
Rendering the signature as a picture and when a user click on the "signature picture" it gets taken to a page that contains the real signature (with the links in the signature etc.) and that page is set as being non-crawlable by search engine spiders). This would consume some bandwidth and need some work (because I'd need an HTML renderer producing the picture etc.) but obviously it would solve the issue (there are tiny gotchas in that the signature wouldn't respect the font/color scheme of the users but my users are very creative with their signatures anyway, using custom fonts/colors/size etc. so it's not that much of an issue).
Marking every part of the webpage that contains a signature as being non-crawlable.
However I'm not sure about the later: is this something that can be done? Can you just mark specific parts of a webpage as being non-crawlable?
Here is the same answer I provided to noindex tag for google on Stack Overflow:
You can prevent Google from seeing portions of the page by putting those portions in iframes that are blocked by robots.txt.
robots.txt
Disallow: /nocrawl/
index.html
This text is crawlable, but the following is
text that search engines can't see:
<iframe src="/nocrawl/content.html" width="100%" height=300 scrolling=no>
/nocrawl/content.html
Search engines cannot see this text.
Instead of using using iframes, you could load the contents of the hidden file using AJAX. Here is an example that uses jquery ajax to do so:
his text is crawlable, but the following is
text that search engines can't see:
<div id="hidden"></div>
<script>
$.get(
"/nocrawl/content.html",
function(data)$('#hidden').html(data),
);
</script>
Another solution is to wrap the sig in a span or div with style set to display:none
and then use Javascript to take that away so the text displays for browsers with Javascript on. Search engines know it's not going to be displayed so shouldn't index it.
This bit of HTML, CSS and javascript should do it:
HTML:
<span class="sig">signature goes here</span>
CSS:
.sig
display:none;
javascript:
<script type="text/javascript">
$(document).ready(function()
$(".sig").show();
</script>
You'll need to include a jquery library.
I had a similar problem, I solved it with css but it can be done with javascript and jquery too.
1 - I created a class that I will call "disallowed-for-crawlers
" and place that class in everything that I did not want the Google bot to see, or place it inside a span with that class.
2 - In the main CSS of the page I will have something like
.disallowed-for-crawlers
display:none;
3- Create a CSS file called disallow.css and add that to the robots.txt to be disallowed to be crawled, so crawlers wont access that file, but add it as reference to your page after the main css.
4- In disallow.css
I placed the code:
.disallowed-for-crawlers
display:block !important;
You can play with javascript or css. I just took advantage of the disallow and the css classes. :) hope it helps someone.
One way to do this is to use an image of text rather than plain text.
If the issue about html, search-engines, forum, web-crawlers is resolved, there’s a good chance that your content will get indexed and you’ll start to show up in Google search results. This means a greater chance to drive organic search traffic to your site.
Comments
Post a Comment