Testing Web App CAPTCHA controls
August 20, 2009
CAPTCHA ("Completely Automated Public Turing test to tell Computers and Humans Apart") is a type of challenge-response test used by many web applications to ensure that the response is not generated by a computer. CAPTCHA implementations are often vulnerable to various kinds of attacks even if the generated CAPTCHA is unbreakable.
I've had a few questions on testing CAPTCHAs as of late and decided to do a quick write-up on how I test the strength of a CAPTCHA or in some cases write a CAPTCHA breaker. I will start below with a quick test that I use to gauge the initial strength of a CAPTCHA implementation (Microsoft Onenote has excellent handwriting detection and is very easy to use for this purpose):
- Copy the image contents to my clipboard
- Open up onenote (or your favorite OCR tool)
- Paste the image onto a one note page.
- Choose copy text from picture
- Now you will have the contents on your clipboard. Paste that into notepad and compare the results.
- If there is noise in the middle of the text, such as a curved line, make the image very large and stretch the image vertically. Then pass this through a handwriting detection library. The stretching appears to make the noise in the middle less prominent. Note: This is based on my own personal tests and not concrete science.
A few other things I will also try before attempting to solve the image. Remember if you can't script the transformation then you have defeated the purpose of the test.
- Convert the image to black and white (this, for whatever reason, filters out a ton of background noise).
- Many CAPTCHAS use a static piece of noise like curved line the middle of the word. You can often get around this by doing a static crop of a region of the image.
- Cut the image up into a grid. This can easily be achieved using a Photoshop script or ImageMagick, but I have not gone through the trouble of making one in a long time. See the example in Figure 2. This can be achieved by examining each pixel in the image and identifying the leftmost black pixel as a starting point and identifying the rightmost boundaries of each letter where the black pixels are continuous. This assumes there is a clear boundary however between each letter. This may be easier to solve by treating each CAPTCHA as a series of images in favor of a single image.
There is a huge weakness in the CAPTCHA in Figure 2 (in use by many prominent online retailers) due to none of the characters actually touching. You could easily write a script that identified the first area of the image that identified the leftmost black pixel and the rightmost where all the black dots were touching. This would give you the locations of the character boundaries which could then be used to create a grid containing each letter. You may have trouble when you run into certain characters like the number one, lowercase L and the letter I; however it is for this very reason that many CAPTCHAs exclude those characters from the character set.
In many ways, automating CAPTCHA strength testing is very similar to handwriting detection and simple tools are widely available for this task including FOSS libraries.
A couple other CAPTCHA solver libraries are out there, including the somewhat dated PWNCAPTCHA that was recently open sourced. Here is a list of a few other helpful tools that you can use to make your own CAPTCHA solvers:
- Perl OCR Libraries - http://search.cpan.org/search?query=ocr&mode=all
- Ruby OCR Libraries - http://code.google.com/p/ocropus/
- Perl IMAGEMAGICK Image Manipulation Library - http://www.imagemagick.org/script/perl-magick.php
The script below is a framework for a tool performing some of the image transformations I described using ImageMagick
# CAPTCHA Solver v1 - A simple tool for image transformations and OCR to solve CAPTCHA
# Author: Mark Maxey - firstname.lastname@example.org
# Version 1.0
# read in the image
my $image = Image::Magick->new;
# turn the image to black and white
# cropping the image to eliminate static noise
# resize the image
my $img_width = '2000';
my $ratio_main = '1';
my $img_height = '2000';
$image->Resize(width=>$img_width * $ratio_main, height=>$img_height * $ratio_main);
# OCR Code here
# if you can't figure this part out you shouldn't be doing this
# end OCR
Some key things to remember when testing a CAPTCHA:
1. Eliminate as much noise as you can, which is generally easy by just converting the image to black and white
2. Identify areas where static cropping of noise can be eliminated
3. Some OCR toolkits can limit the character set to specific characters (no special characters and all lowercase for example). Use this where applicable to improve the accuracy of the test
4. Turning the CAPTCHA into a grid will often make it very easy to solve by clearly defining word boundaries
5. If the CAPTCHA does not involve text you probably can't solve it using the methods I described above
6. Increase the size of the image, this will help you hone in on where the boundaries are and makes a lot of the noise much easier to deal with
7. Sometimes a CAPTCHA, if there are parameters available for tampering, can be used to DoS a site or cause other problems. Quite often you will see a parameter like width=200&height=350, so what if you make this 999999999999 x 99999999999999999 etc.