Sean’s Obsessions

Sean Walberg’s blog

Perl & LWP Reviewed

A while back I had received a copy of Perl & LWP from O’Reilly for review. I read it and loved it. Unfortunately the review never was published on Cramsession. Since I got some positive feedback after publishing it in my LUG’s newsletter, here it is, hope it’s helpful.

Short story – Great book if you want to automate the pulling and parsing of web pages in Perl. I can’t stress the parsing enough… The man pages are good enough to teach you all you need to know about LWP, this book is great for teaching you what to do with the page once you have it.

Perl & LWP
Sean M. Burke
O’Reilly, 2002
242 pp
$34.95USD, $54.95CDN

The only disappointing thing about this book is its title. ‘Perl & LWP’ conjures up images of simply grabbing a web page, maybe posting to a form through Perl. Not much fodder for a book, seeing as a cursory glance of the man pages will tell you how to do that. However, one must dig into the byline of the book to get what this book is really about, namely ‘Fetching Web pages, Parsing HTML, Writing Spiders, and More’.

Unless you’ve done this stuff before, it’s hard to appreciate how difficult it really is. The act of getting data from a web server is trivial. The book covers it quickly, writes a function, and then largely ignores the issue. The trick is what to do with the data once you’ve got it. That’s the hard stuff, unless you’ve got this book by your side.

The book starts off with a quick look at what web automation is, why you’d use it, and where it’s appropriate. 35 pages later, you know how to GET and POST forms several different ways using the LWP class library. Then it’s into a discussions of URLs, and then forms. That’s five chapters, and 85 pages. It’s also where the book goes from simply ‘good’, to ‘great’.

Many people will use regular expressions to parse the data from here, and chapter six talks about ways to do that. Basically, it involves a lot of looking at the source, and making some rough guesses such as ‘look for a sequence of characters, followed by some spaces, then comes our data.’ The author shows iterative techniques to arrive at the solution. He then also shows shortcomings of this technique.

To overcome the problem with regular expressions, token parsing is introduced. Again, good ways to use the technique are demonstrated, and the limitations presented. To overcome these limitations, trees are shown. Though three different methods are used to solve much the same problem, the relative advantages and disadvantages are made clear.

This book is full of examples. Every example is from a real live site, from grabbing a book’s rankings from Amazon.com, to checking out licence plates from the California DMV, to pulling out news items from BBC news. His code is exceptionally clear, there are very few Perl tricks used, meaning that you don’t need to be a Perl expert to make use of it. Unlike many Perl books I have, the focus is on providing easy to understand code rather than efficient and terse code. Furthermore, the author isn’t scared to show you the things that don’t work alongside the things that do work.

Perhaps the most impressive thing I found about Perl & LWP was that it embodied the Perl motto: There is more than one way to do it. Regular expressions are shown as a quick and dirty way to solve a problem. Tokens are introduced to overcome the limitations of regular expressions. Cases where tokens fall apart are solved with trees. Even some examples are redone to show how the improvements happen.

Particularly interesting was the chapter on writing web spiders. Not only is it a cool application, but it pulls together techniques learned from most of the previous chapters. The design and build of the spider is laid out here, you follow the author’s thoughts as it’s put together.

If you’re working with Perl to retrieve web pages without this book, you’re either the author, or wasting a lot of your time. I gained more timesaving techniques reading this on the bus than I have in several years of hacking around with LWP. The time it saves is more than worth the price. For a book on programming, it’s surprisingly easy to read, even bordering on fun (just what is the author’s fascination with pie?) I heartily recommend ‘Perl & LWP’ to anyone looking to automate even the smallest of web tasks using Perl.

Table of Contents:

1. Introduction to Web Automation
2. Web Basics
3. The LWP Class Model
4. URLs
5. Forms
6. Simple HTML Processing with Regular Expressions
7. HTML Processing with Tokens
8. Tokenizing Walkthrough
9. HTML Processing with Trees
10. Modifying HTML with Trees
11. Cookies, Authentication, and Advanced Requests
12. Spiders
A. LWP Modules
B. HTTP Status Codes
C. Common MIME Types
D. Language Tags
E. Common Content Encodings
F. ASCII Table
G. User’s View of Object-Oriented Modules

Comments

I’m trying something new here. Talk to me on Twitter with the button above, please.