Spam in blogs

Spam in blogs (also called simply blog spam or comment spam) is a form of spamdexing. It is done by automatically posting random comments, promoting commercial services, to blogs, wikis, guestbooks, or other publicly-accessible online discussion boards. Any web application that accepts and displays hyperlinks submitted by visitors may be a target.

Adding links that point to the spammer's web site artificially increases the site's search engine ranking. An increased ranking often results in the spammer's commercial site being listed ahead of other sites for certain searches, increasing the number of potential visitors and paying customers.

History

This type of spam originally appeared in internet guestbooks, where spammers repeatedly fill a guestbook with links to their own site and no relevant comment to increase search engine rankings. If an actual comment is given it is often just "cool page", "nice website", or keywords of the spammed link.

In 2003, spammers began to take advantage of the open nature of comments in the blogging software like Movable Type by repeatedly placing comments to various blog posts that provided nothing more than a link to the spammer's commercial web site. Jay Allen created a free plugin, called MT-BlackList, for the Movable Type weblog tool (versions prior to 3.2) that attempted to alleviate this problem. Many current blogging packages now have methods of preventing or reducing the effect of blog spam.

Possible solutions

Blocking by keyword

This is simplest form of blocking, which yields very good results, because comment spam is targeted at bots, so it must be readable by simple software. A lot of spam can be blocked by banning names of popular pharmaceuticals and casino games.

rel="nofollow"

In early 2005 Google announced that hyperlinks with rel="nofollow" attribute would not influence the link target's ranking in the search engine's index.

(rel="nofollow" actually tells a search engine "Don't score this link" rather than "Don't follow this link." This differs from the meaning of nofollow as used within a robots meta tag, which does tell a search engine: "Do not follow any of the hyperlinks in the body of this document.")

Using rel="nofollow" is a much easier solution that makes the improvised techniques above irrelevant. Most weblog software now marks reader-submitted links this way by default (with no option to disable it without code modification). A more sophisticated server software could spare the nofollow for links submitted by trusted users like those registered for a long time or on a whitelist or with a high karma. Some server software adds rel="nofollow" to pages that have been recently edited but omits it from stable pages, under the theory that stable pages will have had offending links removed by human editors.

Some weblog authors object to the use of rel="nofollow", arguing, for example[1], that

Link spammers will continue to spam everyone to reach the sites that do not use rel="nofollow"
Link spammers will continue to place links for clicking (by surfers), even if those links are ignored by search engines.
Google is advocating the use of rel="nofollow" in order to reduce the effect of heavy inter-blog linking on page ranking

In particular, on the English Wikipedia, after a discussion, it was decided not to use rel="nofollow" in articles and to use a URL blacklist instead. In this way, Wikipedia contributes to the scores of the pages it links to, and expects editors to link to relevant pages. However, Wikipedia does use rel="nofollow" on pages that are not considered to be part of the actual encyclopedia, such as discussion pages, and Wikipedia projects in languages other than English also use it in articles.[2]

Other websites like Slashdot, with high user participation, use improvised nofollow implementations like adding rel="nofollow" only for potentially misbehaving users. Potential spammers posing as users can be determined through various heuristics like age of registered account and other factors. Slashdot editor and Slashcode maintainer Pudge used a user's karma as a determinant in attaching a nofollow tag to user submitted links.

Turing tests

Various methods requiring humans to do spamming by hand have been attempted. A variety of captcha gateways have been implemented, in an effort to prevent bots from submitting entries. Drawbacks to this are the annoyance it poses for regular users, the lack of any alternative for visually impaired users, and the ability of some advanced bots to fool simple captchas most of the time.

Redirects

Instead of displaying a direct hyperlink submitted by a visitor, a web application could display a link to a script on its own website that redirects to the correct URL. This will not prevent all spam since spammers do not always check for link redirection, but effectively prevents against increasing their PageRank, just as rel=nofollow. An added benefit is that the redirection script can count how many people visit external URLs, although it will increase the load on the site.

Redirects should be server-side to avoid accessibility issues related to client-side redirects. This can be done via the .htaccess file in Apache.

Another way of preventing PageRank leakage is to make use of public redirection services such as TinyURL or My-Own.Net. For example,

<a href="http://my-own.net/alias_of_target" rel="nofollow" >Link</a>

where 'alias_of_target' is the alias of target address.

Distributed approaches

This approach is very new to addressing link spam. One of the shortcomings of link spam filters is that most sites only receive one link from each domain which is running a spam campaign. If the spammer varies IP addresses, there is little to no distiguishable pattern left on the vandalized site. The pattern, however, is left across the thousands of sites that were hit quickly with the same links.

A distributed approach, like the free LinkSleeve, uses XML-RPC to communicate between the various server applications (such as blogs, guestbooks, forums, and wikis) and the filter server, in this case LinkSleeve. The posted data is stripped of urls and each url is checked against recently submitted urls across the web. If a threshold is exceeded, a "reject" response is returned, thus deleting the comment, message, or posting. Otherwise, an "accept" message is sent.

A more robust distributed approach is Akismet, which uses a similar approach to LinkSleeve but uses API keys to assign trust to nodes and also has wider distribution as a result of being bundled with the 2.0 release of WordPress. They claim over 140,000 blogs contributing to their system. Akismet libraries have been implemented for Java, Python, Ruby, and PHP, but its adoption may be hindered by the requirement of an API key and its commercial use restrictions. No such restrictions are in place for Linksleeve.

Ajax

Some blog software such as Typo (software) allow the blog administrator to only allow comments submitted via Ajax XMLHttpRequests, and discard regular form POST requests. Although Ajax comment forms can be easily defeated after examining the page source, spammers so far have mainly chosen to pass such opportunities by.

Switching off comments

Some bloggers have chosen to turn off comments because of the volume of spam.

External links

Anti-spam Features of MediaWiki
Six Apart Comment Spam Guide, fairly broad overview from Movable Type's authors.
The (Evil) Genius of Comment Spammers, an article on link spam from Wired magazine.
Gilad Mishne, David Carmel and Ronny Lempel: Blocking Blog Spam with Language Model Disagreement, PDF. From the First International Workshop on Adversarial Information Retrieval (AIRWeb'05) Chiba, Japan, 2005.
A Comprehensive Guide to Protecting Your Blog from Spam - a series of measures you can follow to making your WordPress Blog spamfree
Spam Huntress The Norwegian Spam Huntress - Ann Elisabeth
LinkSleeve XML-RPC, free tool to integrate with blogs, forums, wikis, and guestbooks to fight link spam.
Anti Spam Articles. -Anti Spam Articles and lots of information.
Tim Longhurst article on Coca-Cola - explores Coca-Cola's link spam campaign to promote Coke Zero. With links to affected bulletin boards and communities.
Protect Web Form Free service of verification images. Anti spam project.
www.cerospam.com.ar Service of form protection/validation, free.
Blam! Blog Spam protector Stops spam on any web page, forum, blog or comment area.
Spam Blocker Crawler Free public service for scanning guestbooks and send abuse in the Google and spammer's hoster.
SecuriTeam Blogs Spam section Intensive technical posting by the Gadi Evron on blog spam techniques and counter-measures.
SignedPing An open specification for blog security to combat spam.