Sean’s Obsessions

Sean Walberg’s blog

Yet Another Reason WWW::Mechanize Rocks

I found this article about a perl script that downloads True Type fonts from http://grsites.com/fonts/. Looking at the script, it’s about 100 lines of fairly dense, comment-less code that’s quite tied to the web page and url structure that it’s scraping. I rewrote it using WWW::Mechanize in 33 lines including comments and it runs under strict. It took about 15 minutes, and I never had to view the source code of the page I was scraping.

Code below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#!/usr/bin/perl
use strict;
use WWW::Mechanize;
# Hit the first page
my $mech = WWW::Mechanize->new();
$mech->agent_alias("Linux Mozilla");
$mech->get("http://grsites.com/fonts/");
die $mech->response->status_line unless $mech->success;
# Pull out all the page links
my @links = $mech->find_all_links(text_regex => qr/^Page/);
foreach my $page (@links) {
        print "Getting " . $page->url_abs . "\n";
        $mech->get($page->url_abs);
        next unless $mech->success;
        my @fonts = $mech->find_all_links(url_regex => qr/fontview.cgi/);
        foreach my $font (@fonts) {
                print "Getting " . $font->url_abs ."\n";
                $mech->get($font->url_abs);
                unless $mech->success {
                        print $mech->response->status_line;
                        next;
                }
                my $fname = $font->url_abs;
                # pull out the name from the url
                $fname =~ s/.*fn=(.*)&?.*$/$1/;
                my $fontlink = $mech->find_link(url_regex => qr/fontdownload/);
                $mech->get($fontlink->url_abs, ":content_file"=>"$fname.ttf");
        }
}

Comments

I’m trying something new here. Talk to me on Twitter with the button above, please.