nemetral.net | Insightful posts on design and code

June 10, 2008 · Category Webdev · icon1

The pursuit of APIness (part 1)

This article was written by nemetral.

Voices matter! Please feel free to share your opinion, ask for more explanations or point out divergences using comments.

No time to read this now? Bookmark it and come back later..

Say you need to upload a set of 100 pictures on Flickr everyday. Not so difficult: you login to your Flickr account and start to manually upload the pictures. After a few days though, you start to feel a bit weird about having to spend all this time to manually upload files at an era where computers, after all, are supposed to replace us for repetitive tasks. Say your daily sets of 100 files are prepared in advance: wouldn’t it be great if your computer could upload them to Flickr on its own?

Part 1: The dirty way

First thing that pops up in your mind is: let’s make my computer fill the forms alone! The idea is pretty straightforward: you write a script able to login to your Flickr account using your credentials and, once there, to fill the form fields. To fill the form fields? Does it mean I will see my computer in action and the cursor moving from one field to another one, opening the file selection dialog box, double clicking on one file, moving on to the next field etc.? Well, not really, for all this is pure client-side stuff.

In a form, only three things matter: the method (usually GET or POST), the names of the fields and the action URL (i.e. the URL your browser sends a request to when you press the “submit” button). At the end of the day, “automatically filling out a form” simply means sending a GET or POST request to the action URL with correct field names and values.

Let’s move on to a simple case study. When filling out a GET form on cool website http://www.community.com (fictitious example), here is what 23 years old Joe from Indianapolis would send back to the server:

GET /submit.php?name=joe&city=indianapolis&age=23 HTTP/1.1
Host: www.community.com

In case the form was a POST, here is what Joe would have sent back:

POST /submit.php HTTP/1.1
Host: www.community.com
Content-Length: 33
Content-Type: application/x-www-form-urlencoded

name=joe&city=indianapolis&age=23

Note: the requests above are simplified ones, containing only the required headers for GET and POST methods. In fact, many more headers would usually be sent along with these ones (see the list of HTTP headers on Wikipedia).

That was great. Now the funny thing is: upon receiving the request, the servers looks up the database to find other people aged 23 living in Indianapolis and sends a webpage back to Joe with the list of names on it. After Joe submitted his personal information, he will see a webpage featuring members Anna and Lisa.

HTTP/1.1 200 OK
Date: Tue, 10 Jun 2008 19:38:07 GMT
Content-Length: 268
Content-Type: text/html

<html>
   <head>
      <title>Community.com rocks!</title>
   </head>
   <body>
      <h1>There are 2 members aged 23 and living in your town:</h1>
      <ul>
         <li>Member #1: Anna</li>
         <li>Member #2: Lisa</li>
      </ul>
   </body>
</html>

Note: the webpage above is a simplified one. Come on there’s not even a DOCTYPE declaration in it :) !

That was so cool. By filling out a plain form, Joe was able to know some of his neighbours also belonging to http://www.community.com. Rephrased version: by sending out a request, Joe received a reply with data in it. A human (supposedly non geek) would read the data as displayed on his browser; but a script could parse the HTML and extract relevant bits of information.

Now Joe is a clever guy knowing about computers so he decides to write a script exploiting this form and aimed at retrieving the members from http://www.community.com aged 23 and living in Indianapolis. The script would essentially consist in emulating the web form by sending a request to http://www.community.com and then parsing the HTML result. Such a script could easily be used to know when new users of the same age and living in the same town become members of http://www.community.com: to do so, Joe would have to run it daily or make it a daily CRON task.

Joe knows PHP and will use a wrapper called cURL to write the requests and then a regex to parse the results. Here is what the script could look like:

<?php

   // STEP 1: SEND THE POST REQUEST
   $url = 'http://www.community.com/submit.php';
   $handle = curl_init();
   curl_setopt($handle, CURLOPT_URL, $url);
   curl_setopt($handle, CURLOPT_POST, 1);
   curl_setopt($handle, CURLOPT_POSTFIELDS,
               'name=joe&city=indianapolis&age=23');
   curl_setopt($handle, CURLOPT_RETURNTRANSFER, 1);
   $result = curl_exec($handle);
   curl_close($handle);

   // STEP 2: PARSE THE RESULT AND DISPLAY THE NAMES
   preg_match_all("|<li>Member #[0-9]+: (.*)</li>|", $result, $match);
   print_r($match);

?>

Note: it is safer to use more options when writing a cURL request (especially options like CURLOPT_TIMEOUT).

In this script, Joe decided to simply print_r() the array of names, but he could have decided as well to insert them in a database or send them by email. The important thing is: Joe was able to get structured data out of a request using variables.

The next step would be to extend this script to all ages and cities in the US. Using two nested loops featuring all ages between 21 and 100 and all main cities of the US for example, it would be possible to rebuild a fair part of http://www.community.com’s members database.

As a matter of fact, the script Joe is writing goes well beyond the normal use of the initial form and is certainly not supported by the website owners. Providing that they no longer want such scripts to automatically exploit the form and spit members’ names, the developers at http://www.community.com have several weapons at their disposal:

  1. first thing they can do is to regularly change the names of the form fields, which means that Joe would have to update his own script each time there’s a such a change otherwise the script would blindly keep on sending outdated variables to the server
  2. second thing they can do is to regularly update the HTML code of the delivered webpage: for example, adding a simple class to the lines would break the regex
  3. third move: limiting the number of requests in a given timeframe, i.e. making it impossible for a single IP to send more than one request every 10 minutes for example
  4. lethal weapon: adding a captcha, which is a much trickier obstacle to skip

The script Joe has written is a hacker’s self-made API of http://www.community.com’s members database: it’s a bridge towards the content stored in there. Now wouldn’t it be nicer if, instead of sniffing variable names and updating a regex, it could be possible to send a structured query to http://www.community.com and get a structured, officially supported and constant response?

(go to part 2 : XML rocks)

Entries (RSS) Did you enjoy this post? Consider subscribing to the RSS feed!

1 pingback · Leave yours

  1. Pingback : http://nokrosis.com/?p=6

Leave a comment