I’ve known about WWW::Mechanize for a while, but never really bothered to look into it. I figured that between LWP::UserAgent and HTML::TokeParser, there was nothing I couldn’t do.
So, I find myself flipping through Spidering Hacks, and I see they have some examples using it. Interested, I read further…
Normally, when scraping a web page, you fetch the page, parse the HTML to find the elements, fill out forms or find your links, then follow the next link. WWW::Mechanize builds on LWP::UserAgent so that the fetching is pretty much the same, but the parsing is almost completely automated.
One of the common tasks is to fill out a form. The old way, I’d get the page, then parse it with HTML::Form
1 2 3 |
|
Then, cycle through the elements to find what I need:
1 2 3 |
|
With WWW::Mechanize, a lot of that is hidden. You can even ignore the parsing of the form, and fill in the first two visible text fields, such as to log in to a protected site:
1 2 |
|
Another, more obnoxious, task was to follow a link identified by the text, ie find an anchor tag, followed by a string, followed by a close anchor tag. Within it could also be bold tags, so a great deal of care was necessary. For this, I had to crack out HTML::TokeParser and parse the page. With WWW::Mechanize:
1
|
|
Duh. That’s pretty simple.
So, I present here a simple script that goes to the Amazon.ca site, follows the “associates link”, logs in, and checks the stats. I end up with a web page, to actually pull out the numbers will require HTML::TokeParser, but that’s not a big deal (WWW::Mechanize is amazing, not miraculous!) This script is about 20 lines, it would probably have taken me 3 times that to write the my old way, and that would still be sacrificing the flexibility to adapt if amazon changes their page.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
Some more links:
http://www.stonehenge.com/merlyn/LinuxMag/col53.html
http://www.stonehenge.com/merlyn/LinuxMag/col54.html
http://use.perl.org/~petdance/journal/17232