任何人都可以将你的网站从搜索引擎结果中删除
[ 2008-05-21 23:24:47 | 作者: hesatao ]
原理:
简单说来,就是引诱Google通过代理索引你的网站,因为域名不一样,Google认为这是两个不同的网站,所以会建立一个你网站的拷贝版本,然后,拷贝版本取代你网站在搜索引擎的表现。
网上有很多在线代理工具,通过这些代理工具我们可以匿名浏览很多被G-F-W的网站,而且内容传输都是经过加密的。
如果Google蜘蛛发现这个网址: http://www.someproxy.com/proxy/www.example.com,它会沿着这个链接抓取内容
Google的智商显然不如人类,或者Google向来不惮以最坏的恶意来揣测网民,由于域名不一样,Google并不知道它访问的是同一个网站,所以,一个复制的网站出现了。
大家都知道,复制内容是要被搜索引擎惩罚的。由于Google有非常多的数据中心,在比较正常的情况下:
网站网页在Google一些数据中心保存着
Google通过代理获得网站的拷贝版本,把拷贝版本存在某个数据中心,然后慢慢向所有数据中心同步
Google蜘蛛抓取原来网站的时候,发现数据中心已经有相同内容,错误的把原始网站网页当作重复版本
原始网站被删除或者降权
也许你认为你的域名注册于上世纪90年代,网站PR也已经6+,Google肯定不会发现一个重复版本就对我的”十年老店”降权。
这种想法不无道理,”十年老店”也不是绝对的安全,试想,如果有几千几万个重复版本呢?保不齐你就在一个小阴沟里翻船了。
Dan Thies 在一年前发现这个问题的时候,联系了Google的工程师 Matt Cutts ,但是这个问题至今都没有解决,是Google没有时间处理还是暂时解决不了?Google一向都很忙的,忙着数钱。所以,我们不能眼巴巴的干等”最亲的Google”修复此漏洞。
如何判断你的网站有没有中招?
找一个你首页上唯一的短语或者句子。比如说,你网站有个句子:“阿里巴巴和阿里妈妈都不是好鸟咿呀咿呀呦”,在Google中搜一下这个句子,注意带上引号。如果发现排第一的是别的站,而且该站网址类似于
www.example.com/nph-proxy.pl/011110A/http/www.mattcutts.com/blog/ 或者类似于这种 http://www.proxysnow.com/index.php?hl=d0&q=uggc%3A%2F%2Fjjj.tbbtyr.pbz%2F ,那恭喜你中招了。如果你只会采集或者运营网站,不会代码,但你又会点英文,可以联系这个proxyreports@gmail.com,请求他帮忙,注意要用英文写信。
如何避免?由于涉及服务器、代码等,未必每个人都能看懂,特摘原文如下以供有需者:
There are basically three main possibilities for your situation:
Situation 1: You are running an Apache server. We have 2 solutions in this case, that were developed by Jaimie Sirovich (co-author of Professional Search Engine Optimization with PHP). We’ve worked some late nights on this.
Solution #1 uses mod_write and .htaccess, to pass all spider requests through a PHP script that validates the request. This will only defends against being hacked via “normal” anonymous proxies that pass long the user agent - it only inspects visits from the “Big 4″ search engines (Ask, Google, MSN, and Yahoo). I call this the “first tier” defense - it won’t stop every proxy that exists, but it will come close, and you can implement it without modifying any of your applications. It wil even work if your web site is all static pages. This is what I’m implementing. Jaimie doesn’t like it because it’s kind of a hack - and he would rather you didn’t use it at all.
Solution #2 is a PHP script that implements the “reverse cloaking” defense, putting a “nonindex, nofollow” robots meta tag into your pages unless it’s a spider that you have configured the script to recognize. This will only be possible if your site is built on PHP. It wouldn’t be terribly difficult for a competent PHP user to implement this in an all-static site, you’d just need to change .htaccess so that your .html files are parsed as PHP. A Wordpress plug-in will follow soon. This is a more robust defense, against more proxies.
How to get the code: An implementation guide is provided on Jaimie’s blog, along with a testing environment that you can use to check spider user agents & IP addresses, and of course the source code for both solutions. No warranty is given. This is hard core code for a hard core situation. Don’t use it if you don’t need it, and all code should really be deployed by professionals who can understand what it does, modify it to suit unique environments, etc.
Situation 2: You are running a Microsoft (IIS) server. Jaimie is working on an IIS/ASP solution similar to the Apache/PHP solution, which should be available soon. Think days, not weeks, in other words. Much sooner than his new book (Professional SEO with ASP), which is also in the pipeline.
I want to thank Jaimie for stepping up to provide these solutions on very short notice. I had some code of my own but he’s a real programmer, and I’m just a guy who hacks scripts together when I need something in a hurry. This isn’t a job that should be trusted to a guy who hacks code part time. You want an expert.
Situation 3: You are on a hosted solution, aren’t running PHP scripts that you can edit, don’t control the web server, etc. This is a more complex situation. I will have another post tomorrow that will offer some possible solutions, including one that involves creating your own caching proxy on a separate server. In this case, I don’t recommend doing anything unless you really believe that you have a problem with proxies.
In fact, I have mixed feelings about recommending any “defensive” measures for anyone who isn’t actually being affected… unless losing your Google traffic for a few weeks is such a daunting prospect that you feel you must put up the walls. Just understand - running extra code before you deliver a page will have a cost, in terms of server load and response times. Personally, I am putting up the walls on all of my sites.
Further disclaimer: these solutions are based, at least in part, on information that the search engines have published regarding the right way to validate spider visits. It would be nice if they would publish the information once and then stick by it, but Yahoo gave us instructions shortly after Google did, then they recently changed the domain they crawl from (was inktomisearch.com, now crawl.yahoo.net). Once you start doing this stuff, you have to keep up with what the search engines are doing. I’ll certainly try to keep my subscribers informed, but not everyone gets my newsletter. Keeping up to speed on this stuff is up to you.
There are other solutions available. Bill Atchison’s Crawlwall is a professional (commercial) solution, that does a lot more to prevent content theft, etc. If you have the means, you may want to consider this instead, and move the burden of “keeping up with the spiders” onto Bill’s shoulders. Jaimie is working on a more general proxy-blocking solution as well. Ekstreme has the beginnings of a spider validation solution in the PHP Search Engine Bot Authentication code they published.
评论Feed: http://www.zys8.com/myblog/feed.asp?q=comment&id=301
这篇日志没有评论。
此日志不可发表评论。






