This site is soon to be deprecated by http://www.johnleitch.net

Tuesday, March 2, 2010

Scraping - reCAPTCHA Hack

After reading about the $25 million online ticket heist and the involvement of the reCAPTCHA service I decided to see if the reported flaw was still present. From the article:

[The perpetrators] wrote a script that impersonated users trying to access Facebook, and downloaded hundreds of thousands of possible CAPTCHA challenges from reCAPTCHA. They identified the file ID of each CAPTCHA challenge and created a database of CAPTCHA “answers” to correspond to each ID. The bot would then identify the file ID of a challenge at Ticketmaster and feed back the corresponding answer.

If the writer was referring to the ID passed to http://api.recaptcha.net/image via query string, the vulnerability appears to be fixed as the ID is temporary. However, the images are still the same and through the use of a cryptographic hash function such as MD5 we can identify duplicates. The following C# console application downloads a number (specified by the imageCount variable) of CAPTCHA images from reCAPTCHA, hashes each, groups the results by hash, then writes the results to a text file. Downloading as few as 1024 images can yield several identical images. Building on this one could potentially pull off the reCAPTCHA attack described in the article.


using System;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;
using System.Net;
using System.Collections.Generic;
using System.Security.Cryptography;

namespace reCAPTCHAScrape
{
class Program
{
static string Request(string Url)
{
HttpWebRequest request = WebRequest.Create(Url) as HttpWebRequest;

string s;

using (StreamReader reader =
new StreamReader(request.GetResponse().GetResponseStream()))
s = reader.ReadToEnd();

return s;
}

static void GetCaptchaImage(int FileNum)
{
Regex scriptURLRegex =
new Regex(@"<script\s*type\s*=\s*""text/javascript""\s*" +
@"src\s*=\s*""([^""]+)""\s*><\s*/script>");

Regex scriptRegex = new Regex(@"challenge\s*:\s*'([^']+)'");

string pageURL = "http://recaptcha.net/fastcgi/demo/recaptcha";

string resp = Request(pageURL);

string scriptURL = scriptURLRegex.Match(resp).Groups[1].Value;

resp = Request(scriptURL);

string ID = scriptRegex.Match(resp).Groups[1].Value;

string imageURL = "http://api.recaptcha.net/image?c=" + ID;

HttpWebRequest request =
WebRequest.Create(imageURL) as HttpWebRequest;

byte[] buffer = new byte[1048576];

using (Stream s = request.GetResponse().GetResponseStream())
{
int len = s.Read(buffer, 0, 1048576);

Array.Resize(ref buffer, len);
}

using (FileStream stream = File.Create(FileNum + ".jpg"))
stream.Write(buffer, 0, buffer.Length);
}

static void DigestImages(string Path)
{
DirectoryInfo info = new DirectoryInfo(Path);

FileInfo[] files = info.GetFiles("*.jpg");

MD5CryptoServiceProvider md5 = new MD5CryptoServiceProvider();

Dictionary<string, List<FileInfo>> digestDictionary =
new Dictionary<string, List<FileInfo>>();

foreach (FileInfo f in files)
{
byte[] buffer = File.ReadAllBytes(f.FullName);

byte[] digest = md5.ComputeHash(buffer);

StringBuilder hexStringBuilder = new StringBuilder();

foreach (byte b in digest)
hexStringBuilder.Append(Convert.ToString(b,
16).PadLeft(2, '0'));

string hexString = hexStringBuilder.ToString();

if (digestDictionary.ContainsKey(hexString))
digestDictionary[hexString].Add(f);
else
digestDictionary.Add(hexString, new List<FileInfo>() { f });
}

StringBuilder results = new StringBuilder();

foreach (string s in digestDictionary.Keys)
{
results.AppendLine(s);

foreach (FileInfo f in digestDictionary[s])
results.AppendLine(f.FullName);

results.AppendLine();
}

string filename = @".\Results_" + Environment.TickCount + ".txt";

File.WriteAllText(filename, results.ToString());
}

static void Main(string[] args)
{
const int imageCount = 1024;

Console.Write("Downloading images");

for (int i = 0; i < imageCount; i++)
{
try
{
GetCaptchaImage(i);

Console.Write(".");
}
catch (System.Exception ex)
{
Console.WriteLine(ex.ToString());
}
}

Console.WriteLine("\r\nSearching for matches...");

DigestImages(@".\");

Console.WriteLine("Complete. Press any key to continue...");
Console.ReadKey();
}
}
}


A match in the output looks like this:

cf75401ef23c167260aa6d93bb7fbc42
C:\Source\reCAPTCHAScrape\reCAPTCHAScrape\bin\Debug\533.jpg
C:\Source\reCAPTCHAScrape\reCAPTCHAScrape\bin\Debug\869.jpg

5 comments:

  1. I'm currently testing this (in php though), but it doesnt work.
    After 2000 files, no matching :(

    ReplyDelete
  2. I just ran it again and didn't get any either. It looks like the fun might be over for now, although I have another method in mind.

    ReplyDelete
  3. Hello John I read your comment about having a new new method in mind. I am interested in it . Would you mind posting an article or comment about it?

    ReplyDelete
  4. When I get some free time I'll post an article about it.

    ReplyDelete
  5. John you get that free time? Interested as well in any new techniques you have in mind

    ReplyDelete