Jump to content

CAPTCHA

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Sakurambo (talk | contribs) at 09:18, 12 July 2008 (→‎External links: removed link to un-noteworthy implementation). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Early CAPTCHAs such as these, generated by the EZ-Gimpy program, were used on Yahoo!. However, technology was developed to read this type of CAPTCHA[1]
A modern CAPTCHA, rather than attempting to create a distorted background and high levels of warping on the text, might focus on making segmentation difficult by adding an angled line
Another way to make segmentation difficult is to crowd symbols together. This can be read by most humans but cannot be segmented by bots

A CAPTCHA (/ˈkæptʃə/) is a type of challenge-response test used in computing to ensure that the response is not generated by a computer. The process involves one computer (a server) asking a user to complete a simple test which the computer is able to generate and grade. Because other computers are unable to solve the CAPTCHA, any user entering a correct solution is presumed to be human. A common type of CAPTCHA requires that the user type the letters or digits of a distorted image that appears on the screen.

The term "CAPTCHA" was coined in 2000 by Luis von Ahn, Manuel Blum, Nicholas J. Hopper (all of Carnegie Mellon University), and John Langford (then of IBM). It is a contrived acronym for "Completely Automated Public Turing test to tell Computers and Humans Apart", trademarked by Carnegie Mellon University.[2]

A CAPTCHA is sometimes described as a reverse Turing test, because it is administered by a machine and targeted to a human, in contrast to the standard Turing test that is typically administered by a human and targeted to a machine.

Currently, reCAPTCHA is recommended as the official CAPTCHA implementation by the original CAPTCHA creators.[3]

Characteristics

A CAPTCHA system is a means of automatically generating new challenges which:

  • Current software is unable to solve accurately.
  • Most humans can solve.
  • Does not rely on the type of CAPTCHA being new to the attacker.

Although a checkbox "check here if you are not a bot" might serve to distinguish between humans and computers, it is not a CAPTCHA because it relies on the fact that an attacker has not spent effort to break that specific form.

Withholding of the algorithm can increase the integrity of a limited set of systems, as in the practice of security through obscurity. The most important factor in deciding whether an algorithm should be made open or restricted is the size of the system. Although an algorithm which survives scrutiny by security experts may be assumed to be more conceptually secure than an unevaluated algorithm, an unevaluated algorithm specific to a very limited set of systems is always of less interest to those engaging in automated abuse. Breaking a CAPTCHA generally requires some effort specific to that particular CAPTCHA implementation, and an abuser may decide that the benefit granted by automated bypass is negated by the effort required to engage in abuse of that system in the first place.

History

While often uncredited, Moni Naor was the first person to theorize a list of ways to verify that a request comes from a human and not a bot[4]. Primitive CAPTCHAs seem to have been developed in 1997 at AltaVista by Andrei Broder and his colleagues to prevent bots from adding URLs to their search engine. In order to make the images resistant to OCR (Optical Character Recognition), the team simulated situations that scanner manuals claimed resulted in bad OCR. In 2000, Luis von Ahn and Manuel Blum coined the term 'CAPTCHA', improved and publicized the notion, which included any program that can distinguish humans from computers. They invented multiple examples of CAPTCHAs, including the first CAPTCHAs to be widely used, which were those adopted by Yahoo!.

Applications

CAPTCHAs are used to prevent automated software from performing actions which degrade the quality of service of a given system, whether due to abuse or resource expenditure. Although CAPTCHAs are most often deployed as a response to encroachment by commercial interests, the notion that they exist to stop only spammers is mistaken.[citation needed] CAPTCHAs can be deployed to protect systems vulnerable to e-mail spam, such as the webmail services of Gmail, Hotmail, and Yahoo! Mail. In February 2008, Websense claimed to have discovered a way to compromise Gmail's CAPTCHA[5]. CAPTCHAs have also found active use in stopping automated posting to blogs, forums and wikis, whether as a result of commercial promotion, or harassment and vandalism. CAPTCHAs also serve an important function in rate limiting, as automated usage of a service might be desirable until such usage is done in excess, and to the detriment of human users. In such a case, a CAPTCHA can enforce automated usage policies as set by the administrator when certain usage metrics exceed a given threshold. The article rating systems used by many news web sites are another example of an online facility vulnerable to manipulation by automated software.[6].

Accessibility

Because CAPTCHAs rely on visual perception, users unable to view a CAPTCHA (for example, due to a disability or because it is difficult to read) will be unable to perform the task protected by a CAPTCHA. As such, sites implementing CAPTCHAs may provide an audio version of the CAPTCHA in addition to the visual method. The official CAPTCHA site recommends providing an audio CAPTCHA for accessibility reasons.

Attempts at more accessible CAPTCHAs

Even an audio and visual CAPTCHA will require manual intervention for some users, such as those who are both deaf and blind. There have been various attempts at creating CAPTCHAs that are more accessible. Attempts include the use of JavaScript,[7] mathematical questions ("what is 1+1" or even more complex problems like derivatives or polynomial factorization -- also known as a MAPTCHA, or Mathematical CAPTCHA), or "common sense" questions ("what color is the sky"). These attempts violate one or both of the principles of CAPTCHAs: either they cannot be automatically generated or they can be easily cracked given the state of artificial intelligence. As such, the only security these CAPTCHAs provide is security through obscurity; an attacker is unlikely to have encountered the formulation of the CAPTCHA in question, and unlikely to find it worth the time spending resources to break the CAPTCHA of a small site.

Due to the lack of security provided by text based CAPTCHAs, most sites choose to use an audio and visual CAPTCHA as a way of balancing accessibility and security. Often, email or telephone support is used to manually provide access to users who are unable to solve a CAPTCHA.

Circumvention

There are a few approaches to defeating CAPTCHAs: exploiting bugs in the implementation that allow the attacker to completely bypass the CAPTCHA, improving character recognition software, or using cheap human labor to process the tests.

Insecure implementation

Like any security system, design flaws in a system implementation can prevent the theoretical security from being realized. Many CAPTCHA implementations, especially those which have not been designed and reviewed by experts in the fields of security, are prone to common attacks.

Some CAPTCHA protection systems can be bypassed without using OCR simply by re-using the session ID of a known CAPTCHA image. A correctly designed CAPTCHA does not allow multiple solution attempts at one CAPTCHA. This prevents the reuse of a correct CAPTCHA solution or making a second guess after an incorrect OCR attempt.[8]. Other CAPTCHA implementations use a hash (such as an MD5 hash) of the solution as a key passed to the client to validate the CAPTCHA. Often the CAPTCHA is of small enough size that this hash could be cracked.[9] Further, the hash could assist an OCR based attempt. A more secure scheme would use an HMAC. Finally, some implementations use only a small fixed pool of CAPTCHA images. Eventually, when enough CAPTCHA image solutions have been collected by an attacker over a period of time, the CAPTCHA can be broken by simply looking up solutions in a table, based on a hash of the challenge image.

Computer character recognition

A number of research projects have attempted (often with success) to beat visual CAPTCHAs by creating programs that contain the following functionality:

  1. Pre-processing: Removal of background clutter and noise.
  2. Segmentation: Splitting the image into regions which each contain a single character.
  3. Classification: Identifying the character in each region.

Steps 1 and 3 are easy tasks for computers [10] The only step where humans still outperform computers is segmentation. If the background clutter consists of shapes similar to letter shapes, and the letters are connected by this clutter, the segmentation becomes nearly impossible with current software. Hence, an effective CAPTCHA should focus on the segmentation.

Several research projects have broken real world CAPTCHAs, including one of Yahoo's early CAPTCHAs called "EZ-Gimpy"[11] and the CAPTCHA used by popular sites such as Paypal [12], LiveJournal, phpBB, and other open source solutions [13] [14] [15]. In January 2008 Network Security Research released their program for automated Yahoo! CAPTCHA recognition.[16] Windows Live Hotmail and Gmail, the other two major free email providers, were cracked shortly after.[17] [18]

In February 2008 it was reported that spammers had achieved a success rate of 30% to 35%, using a bot, in responding to CAPTCHAs for Microsoft's Live Mail service [19] and a success rate of 20% against Google's Gmail CAPTCHA.[20] A Newcastle University research team has defeated the segmentation part of Microsoft's CAPTCHA with a 90% success rate, and claim that this could lead to a complete crack with a greater than 60% rate.[21]

Human solvers

CAPTCHA is vulnerable to a relay attack that uses humans to solve the puzzles. One approach involves relaying the puzzles to a group of human operators who can solve CAPTCHAs. In this scheme, a computer fills out a form and when it reaches a CAPTCHA, it gives the CAPTCHA to the human operator to solve.

Another variation of this technique involves copying the CAPTCHA images and using them as CAPTCHAs for a high-traffic site owned by the attacker. With enough traffic, the attacker can get a solution to the CAPTCHA puzzle in time to relay it back to the target site.[22] In October 2007, a piece of malware appeared in the wild which enticed users to solve CAPTCHAs in order to see progressively further into a series of "striptease" images.[23][24]

Legality

The circumvention of CAPTCHAs may violate the anti-circumvention clause of the Digital Millennium Copyright Act (DMCA) in the United States. In 2007, Ticketmaster sued software maker RMG Technologies[25] for its product which circumvented the ticket seller's CAPTCHAs on the basis that it violates the anti-circumvention clause of the DMCA. In October 2007, an injunction was issued stating that Ticketmaster would likely succeed in making its case.[26] In June 2008, Ticketmaster filed for Default Judgment against RMG. The Court granted Ticketmaster the Default and entered an $18.2M judgment in favor of Ticketmaster.

Image-recognition CAPTCHAs

Some researchers promote image recognition CAPTCHAs as a possible alternative for text based CAPTCHAs. To date, no major website has made use of an image based CAPTCHA. However, many amateur users of the phpBB forum software (which has suffered greatly from spam) have implemented an open source image recognition CAPTCHA system in the form of an addon called KittenAuth[27] which in its default form presents a question requiring the user to select a stated type of animal from an array of thumbnail images of assorted animals. The images (and the challenge questions) can be customized, for example to present questions and images which would be easily answered by the forum's target userbase.

Image recognition CAPTCHAs face many potential problems which have not been fully studied.

It is difficult for a small site to acquire a large dictionary of images which an attacker does not have access to and without a means of automatically acquiring new labelled images, an image based challenge does not meet the definition of a CAPTCHA. KittenAuth, by default, only had 42 images in its database.[27] Microsoft's "Asirra", which it is providing as a free web service, attempts to address this by means of Microsoft Research's partnership with Petfinder.com, which has provided it with more than three million images of cats and dogs, classified by people at thousands of US animal shelters.[28]

Human solvers are a potential weakness for strategies such as Asirra. If the database of cat and dog photos can be downloaded, then paying workers $0.01 to classify each photo as either a dog or a cat means that almost the entire database of photos can be deciphered for $30,000. Photos that are subsequently added to the Asirra database are then a relatively small data set that can be classified as they first appear. Causing minor changes to images each time they appear will not prevent a computer from recognizing a repeated image as there are robust image comparator functions (e.g., image hashes, color histograms) that are insensitive to many simple image distortions. Warping an image sufficiently to fool a computer will likely also be troublesome to a human.[29]

Another potential weakness is that only a yes/no answer for each picture is required by most designs. Even with sixteen images, a bot has a 1 in 65536 (216) chance of getting the captcha right purely by chance. Furthermore, such chance identifications can be used to accumulate knowledge about the correct identification of the images, allowing the bot to progressively improve the accuracy of its guesses over time. In order for the CAPTCHA to be resistant to such chance-guessing botnet attacks, the user would need to be forced to solve an annoyingly large number of images.

Microsoft Asirra has mitigated all these weaknesses. The image database is not downloadable as it includes images of already adopted pets, which is 10 times the size of pets for adoption. Bot guessing is solved by creating both IP and session based buckets — once IP has misclassified a challenge, a human needs to just solve two Asirras in a row from the same browser session reducing brute force probability to 1 in less than 5 million.

3D CAPTCHA

3D computer graphics can be used to automatically juxtapose several objects in a single, visually-complex scene, with parts of those objects marked with different letters of the alphabet. The user will be asked to type the alphanumeric character that overlies a particular feature. This process can automatically generate an effectively infinite number of image-recognition CAPTCHA.

Designing a computer vision program that can recognize the objects within the 3-D CAPTCHA images is intrinsically difficult. In addition, a compromised object will be automatically identified by the sudden influx of responses that correctly name the compromised object while incorrectly naming the other objects. The object will be automatically and instantly removed from the library and replaced with a new item.[30]

The instructions that accompany the 3-D CAPTCHA image are bound by language dependency. Any entity deploying the 3-D CAPTCHA will need to select the language to be used for the instructions that will accompany the image.

Collateral benefits

Some of the original inventors of the CAPTCHA system have implemented a means by which some of the effort and time spent by people who are responding challenges can be harnessed as a distributed work system. This system, called reCAPTCHA, works by including "solved" and "unrecognized" elements (images which were not successfully recognized via OCR) in each challenge. The respondent thus answers both elements and roughly half of his or her effort validates the challenge while the other half is captured as work.

See also

References

  1. ^ Breaking a Visual CAPTCHA
  2. ^ "Computer Literacy Tests: Are You Human?". Time (magazine). Retrieved 2008-06-12. The Carnegie Mellon team came back with the CAPTCHA. (It stands for "completely automated public Turing test to tell computers and humans apart"; no, the acronym doesn't really fit.) The point of the CAPTCHA is that reading those swirly letters is something that computers aren't very good at. {{cite news}}: Cite has empty unknown parameter: |coauthors= (help)
  3. ^ The Official CAPTCHA Site
  4. ^ Moni Naor (July, 1996). "Verification of a human in the loop or Identification via the Turing Test" (PS). Retrieved 2008-07-06. {{cite journal}}: Check date values in: |date= (help); Cite journal requires |journal= (help)
  5. ^ Websense - Blog: Google’s CAPTCHA busted in recent spammer tactics
  6. ^ Amrinder Arora (2007). "Statistics Hacking — Exploiting Vulnerabilities in News Websites" (PDF). International Journal of Computer Science and Network Security. 7: 342–347.
  7. ^ Smart Captcha - Protect Web Form .COM
  8. ^ "Breaking CAPTCHAs Without Using OCR". Howard Yeend (pureMango.co.uk). 2005. Retrieved 2006-08-22.
  9. ^ "Online services allow MD5 hashes to be cracked". Retrieved 2007-01-04.
  10. ^ Kumar Chellapilla, Kevin Larson, Patrice Simard, Mary Czerwinski (2005). "Computers beat Humans at Single Character Recognition in Reading based Human Interaction Proofs (HIPs)" (PDF). Microsoft Research. Retrieved 2006-08-02. {{cite journal}}: Cite journal requires |journal= (help)CS1 maint: multiple names: authors list (link)
  11. ^ Breaking a Visual CAPTCHA
  12. ^ Breaking the PayPal HIP
  13. ^ Breaking ASP Security Image Generator
  14. ^ PWNtcha - captcha decoder
  15. ^ Examples of breakings - CAPTCHA.ru
  16. ^ Network Security Research and AI
  17. ^ Dawson (2008-04-15). "Windows Live Hotmail CAPTCHA Cracked, Exploited". Slashdot. SourceForge. Retrieved 2008-04-16. {{cite news}}: Check date values in: |date= (help); Cite has empty unknown parameter: |coauthors= (help)
  18. ^ Dawson (2008-02-26). "Gmail CAPTCHA Cracked". Slashdot. SourceForge. Retrieved 2008-04-16. {{cite news}}: Check date values in: |date= (help); Cite has empty unknown parameter: |coauthors= (help)
  19. ^ Gregg Keizer, "Spammers' bot cracks Microsoft's CAPTCHA: Bot beats Windows Live Mail's registration test 30% to 35% of the time, says Websense", Computerworld"', February 7, 2008
  20. ^ Websense® - Blog: Google’s CAPTCHA busted in recent spammer tactics
  21. ^ A Low-cost Attack on a Microsoft CAPTCHA
  22. ^ Doctorow, Cory (2004-01-27). "Solving and creating CAPTCHAs with free porn". Boing Boing. Retrieved 2006-08-22.
  23. ^ AP: Scams Use Striptease to Break Web Traps[dead link]
  24. ^ PC Magazine: Striptease Used to Recruit Help in Cracking Sites
  25. ^ Ulanoff, Lance (October 31, 2007). "Deep-Sixing CAPTCHA". PC Magazine. Ziff Davis Media. Retrieved 2007-12-12. {{cite web}}: Check date values in: |date= (help); Cite has empty unknown parameter: |coauthors= (help)
  26. ^ "TicketMaster v. RMG".
  27. ^ a b The Cutest Human-Test: KittenAuth from ThePCSpy.com
  28. ^ Asirra from Microsoft Research (PDF)
  29. ^ Asirra: A CAPTCHA that Exploits Interest-Aligned Manual Image Categorization from Microsoft Research (PDF)
  30. ^ The 3-D CAPTCHA from SpamFizzle.com

Defeating CAPTCHAs