Sean’s Obsessions

Sean Walberg’s blog

WWW::Mechanize Is Amazing!

I’ve known about WWW::Mechanize for a while, but never really bothered to look into it. I figured that between LWP::UserAgent and HTML::TokeParser, there was nothing I couldn’t do.

So, I find myself flipping through Spidering Hacks, and I see they have some examples using it. Interested, I read further…

Normally, when scraping a web page, you fetch the page, parse the HTML to find the elements, fill out forms or find your links, then follow the next link. WWW::Mechanize builds on LWP::UserAgent so that the fetching is pretty much the same, but the parsing is almost completely automated.

One of the common tasks is to fill out a form. The old way, I’d get the page, then parse it with HTML::Form

1
2
3
my @forms = HTML::Form->parse($page, $url);
# figure out what form I want
$form = $forms[0];

Then, cycle through the elements to find what I need:

1
2
3
for my $i ($form->inputs) {
   $i->value("something");
}

With WWW::Mechanize, a lot of that is hidden. You can even ignore the parsing of the form, and fill in the first two visible text fields, such as to log in to a protected site:

1
2
$mech->form_number(1);
$mech->set_visible("username", "password");

Another, more obnoxious, task was to follow a link identified by the text, ie find an anchor tag, followed by a string, followed by a close anchor tag. Within it could also be bold tags, so a great deal of care was necessary. For this, I had to crack out HTML::TokeParser and parse the page. With WWW::Mechanize:

1
$mech->follow_link(text_regex => qr/Log In/i);

Duh. That’s pretty simple.

So, I present here a simple script that goes to the Amazon.ca site, follows the “associates link”, logs in, and checks the stats. I end up with a web page, to actually pull out the numbers will require HTML::TokeParser, but that’s not a big deal (WWW::Mechanize is amazing, not miraculous!) This script is about 20 lines, it would probably have taken me 3 times that to write the my old way, and that would still be sacrificing the flexibility to adapt if amazon changes their page.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
my ($username, $password) = qw/username password/;
my $mech = WWW::Mechanize->new(  );
 $mech->agent_alias("Linux Mozilla");
$mech->get("http://amazon.ca");
die $mech->response->status_line unless $mech->success;
$mech->follow_link(text_regex => qr/Join Associates/i);
$mech->follow_link(text_regex => qr/Sign-in here/i);
 $mech->form_number(1);
$mech->set_visible($username, $password);
$mech->submit;
$mech->follow_link(text_regex => qr/View Reports/i);
print $mech->content;

Some more links:
http://www.stonehenge.com/merlyn/LinuxMag/col53.html
http://www.stonehenge.com/merlyn/LinuxMag/col54.html
http://use.perl.org/~petdance/journal/17232

Comments

I’m trying something new here. Talk to me on Twitter with the button above, please.