So, I find myself flipping through Spidering Hacks, and I see they have some examples using it. Interested, I read further…
Normally, when scraping a web page, you fetch the page, parse the HTML to find the elements, fill out forms or find your links, then follow the next link. WWW::Mechanize builds on LWP::UserAgent so that the fetching is pretty much the same, but the parsing is almost completely automated.
One of the common tasks is to fill out a form. The old way, I’d get the page, then parse it with HTML::Form
1 2 3
Then, cycle through the elements to find what I need:
1 2 3
With WWW::Mechanize, a lot of that is hidden. You can even ignore the parsing of the form, and fill in the first two visible text fields, such as to log in to a protected site:
Another, more obnoxious, task was to follow a link identified by the text, ie find an anchor tag, followed by a string, followed by a close anchor tag. Within it could also be bold tags, so a great deal of care was necessary. For this, I had to crack out HTML::TokeParser and parse the page. With WWW::Mechanize:
Duh. That’s pretty simple.
So, I present here a simple script that goes to the Amazon.ca site, follows the “associates link”, logs in, and checks the stats. I end up with a web page, to actually pull out the numbers will require HTML::TokeParser, but that’s not a big deal (WWW::Mechanize is amazing, not miraculous!) This script is about 20 lines, it would probably have taken me 3 times that to write the my old way, and that would still be sacrificing the flexibility to adapt if amazon changes their page.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15