Wednesday, August 29, 2012

Extract Information from JavaScript Enabled Content with Perl and V8

One of the common challenges for anyone that currently performs information extraction from Web pages is that more and more Web content is being served up by JavaScript, which makes the content much less accessible than for sites whose content resides solely in HTML. This is one of the reasons that JavaScript based obfuscation is used to protect against email address harvesting like in the HTML shown below:

<title>Contact XYZ inc</title>
<H1>Contact XYZ inc</H1><br>
<p>For more information about XYZ inc, please contact us at the following Email address</p>
<script type="text/javascript" language="javascript">
// Email obfuscator script 2.1 by Tim Williams, University of Arizona
// Random encryption key feature by Andrew Moulden, Site Engineering Ltd
// This code is freeware provided these four comment lines remain intact
// A wizard to generate this code is at
{ coded = "OKUxkq@KwtoO2K.0ko"
  key = "l7rE9B41VmIKiFwOLq2uUGYCQaWoMfzNASycJj3Ds8dtRkPv6XTHg0beh5xpZn"
  for (i=0; i<coded.length; i++) {
    if (key.indexOf(coded.charAt(i))==-1) {
      ltr = coded.charAt(i)
      link += (ltr)
    else {   
      ltr = (key.indexOf(coded.charAt(i))-shift+key.length) % key.length
      link += (key.charAt(ltr))
document.write("<a href='mailto:"+link+"'>"+link+"</a>")
</script><noscript>Sorry, you need Javascript on to email me.</noscript>

When having to perform information extraction on sites that use JavaScript to serve up content, I find the JavaScript::V8 module very helpful.  Here is a segment of Perl code that uses the V8 JavaScript engine to extract the email address from the HTML page shown above.  


use JavaScript::V8;
use LWP;
use Text::Balanced qw(extract_codeblock);
use strict;
use warnings;

#delimiter used to distinguish code blocks for use with Text::Balanced
my $delim='{}';

#downloads Web page
my $ua=LWP::UserAgent->new;
my $response=$ua->get('http://localhost/email.html');
my $result=$response->content;

#print "$result\n\n";

#extracts JavaScript
my $js;

#modified JS to make it processable by V8 module

#print "$js\n\n";

#processes JS
my $context = JavaScript::V8::Context->new();
$context->bind_function(write => sub { print @_ });

my $mail=$context->eval("$js");

print "$mail\n\n";

Wednesday, August 22, 2012

A GDB-like Debugger for Perl – Devel::Trepan

Earlier today I saw an interesting talk at a meeting of the NY Perl Mongers in which Rocky Bernstein highlighted some of his recent work creating a GDB-like debugger for Perl.  While Perl does ship with its own integrated debugger, which can be invoked with the –d switch, the Devel::Trepan module demonstrated appears to have great potential for turning into a Perl debugger that could one day rival or even exceed the current Perl debugger.  The Trepan module borrows much of its command set from the GDB debugger, which many users who have migrated to Perl from C\C++ might find helpful.  Moreover, the debugger as demonstrated, while still a work in progress, already appears to be functional enough to begin to be evaluated as a serious possibility for use in debugging tasks. This is a module that I will definitely be taking a closer look at in the near future and one that I think would be worth it for other Perl programmers to take a look at as well.  More information on the module can be found at 

Thursday, August 16, 2012

Effectively Timeout Slow HTTP Requests with LWPx::ParanoidAgent

One of the potential pitfalls of writing spiders or any type of application that makes use of http requests is that a slow or intermittent connection to the destination server can make your application hang. While modules such LWP do have a timeout parameter, this parameter is implemented in a way that only works well for timeouts regarding non-responsive sites. Responsive, but very slow, sites will often cause LWP to keep the connection alive and result in your application hanging up for longer than you desire. One way to deal with this issue is to consider making use of the LWPx::ParanoidAgent module. The module is a derivative of LWP, but it does not base its timeouts on time since the last socket read, its timeout counter is initiated at the same time the request is made. Thus if you specify a 10 second timeout, 10 seconds is the maximum amount of time allotted for the completion of the request. This module is used almost identically to the LWP module. For example:

use LWPx::ParanoidAgent;

my $ua=LWPx::ParanoidAgent->new;

$ua->timeout(30); #in seconds

my $response=$ua->get('');
my $result=$response->content;

Another interesting feature of this module, is that it allows you to specify whitelists and blacklists to give you control over what links the module will actually attempt connecting to.  While the near universality of LWP may often make it the better choice, the LWPx::ParanoidAgent module is worth keeping in mind for any project that may deal with http requests to sites with questionable network connectivity. 

Friday, August 10, 2012

Test and Debug Your Web Applications with Tamper Data

Most Perl programmers at some point in their career are involved in a project that includes a bit of Web development.  One of the Firefox plug-ins that I occasionally find useful for the debugging and testing of Web applications is the Tamper Data plug-in for Firefox.  In particular, from a debugging perspective it allows you to capture HTTP and HTTPS headers as well as POST parameters, which can allow you to verify the requests that are being sent to your Web application.  On the testing side, some basic security testing can be done as well, since it allows you to modify captured HTTP/HTTPS headers and POST parameters prior to transmission. While someone that is heavily involved in the security testing of Web applications, would likely be better served by more robust intercepting proxies (e.g. Burp Proxy, etc) it is a nice plug-in to use to introduce people to some of the basic techniques that can be used to test Web application security.  An example of a captured Facebook login request can be seen below.


Notice, how it shows the different POST parameters and their values?  Any one of these parameters in the request could then be modified and submitted to the site.  Once the “OK” button is clicked, the request will be forwarded to the Web application, including whatever modifications that you have made.