Sean’s Obsessions

  • Archives

07 May

Yet another reason WWW::Mechanize rocks

I found this article about a perl script that downloads True Type fonts from http://grsites.com/fonts/. Looking at the script, it’s about 100 lines of fairly dense, comment-less code that’s quite tied to the web page and url structure that it’s scraping. I rewrote it using WWW::Mechanize in 33 lines including comments and it runs under strict. It took about 15 minutes, and I never had to view the source code of the page I was scraping.

Code below.


#!/usr/bin/perl
use strict;
use WWW::Mechanize;

# Hit the first page
my $mech = WWW::Mechanize->new();
$mech->agent_alias("Linux Mozilla");
$mech->get("http://grsites.com/fonts/");
die $mech->response->status_line unless $mech->success;

# Pull out all the page links
my @links = $mech->find_all_links(text_regex => qr/^Page/);
foreach my $page (@links) {

print "Getting " . $page->url_abs . "\n";
$mech->get($page->url_abs);
next unless $mech->success;

my @fonts = $mech->find_all_links(url_regex => qr/fontview.cgi/);
foreach my $font (@fonts) {
print "Getting " . $font->url_abs ."\n";
$mech->get($font->url_abs);
unless $mech->success {
print $mech->response->status_line;
next;
}
my $fname = $font->url_abs;
# pull out the name from the url
$fname =~ s/.*fn=(.*)&?.*$/$1/;
my $fontlink = $mech->find_link(url_regex => qr/fontdownload/);
$mech->get($fontlink->url_abs, ":content_file"=>"$fname.ttf");
}
}

Comments are closed.

© 2008 Sean’s Obsessions | Entries (RSS) and Comments (RSS)

Powered by Wordpress, design by Web4Sudoku, based on Pinkline byGPS Gazette