Sean’s Obsessions

Sean Walberg’s blog

Linux Newsletter

A few people have written asking for copies of old Linux newsletters. They’re all available at http://ertw.com/~sean/news/ I’ve also added links so you can download the whole shebang as one file (1.4M, MBOX format), or as individual .txt or .html pages.

While I’m on the subject of the Linux news archives…

I’m Lazy, in that I hate doing things more than once. When I started writing the Linux newsletter, I wanted to keep copies of the newsletter as it was distributed (the copies that I submit, which I also have, get edited). At that point, I was mailing publishers and the like to get them to send me review copies of stuff so I could write about it. Having a web archive is a good way to show them that I really am writing a newsletter. As time went on, I needed a way to refer back to old issues, so the archive became more important (as Cramsession didn’t start archiving my newsletter until mid 2002)

So, what I did was write a PHP script that would read my mailbox and print out the correct message. It was pretty simple, you’d give it a date string, it would read through the file until it saw it, then it would start spitting out the content. You can see the code here: http://ertw.com/~sean/newsletter.phps

There were a few problems with this. As more issues were written, the script had to do more work to print out the list. At each hit, it would read through to find all the issues (for the index), and also read through to find the right one. I knew this wasn’t efficient, but honestly, at the time I didn’t think the newsletter would make it far enough so that it became a problem.

Another thing is that I wanted this archive to be search engine friendly. Some search engines skip or mangle URLs with a query string, that is something like

http://mysite/newsletter?issue=1

What I wanted was something like

http://mysite/newsletter/1

That’s done through the PATH_INFO variable in Apache:


ForceType application/x-httpd-php

This line tells apache that any URL with /~sean/newsletter is to be treated as a PHP script. The script above was also called “newsletter”. Thus, if you asked for

http://ertw.com/~sean/newsletter/abc

It would run the PHP script called “newsletter”, and put the rest of the url (/abc) into the $PATH_INFO environment variable. I could then read that, and I’d know what issue to provide. If it was nothing, I knew I was on the index page.

That was all well and good, until Cramsession changed the layout of the banner just enough to break my program. I tweaked the script a bit, it worked. They changed it again, I tweaked. Eventually I gave up and had to change it.

My web server is a mere K6-233 with a slow disk, so reading a megabyte file every hit is too slow. What I wanted at this point was some way to generate individual pages from the mbox in a consistent manner. Furthermore, if I wanted to change the layout of the page (ie when I added the amazon links), I should be able to apply the changes to all pages, not just the ones that happen after. Template engines to the rescue!

An explanation of how I solved the problem will have to wait, but if you’re curious, I used Template::Toolkit and a bit of perl magic to do it. http://www.template-toolkit.org/ is an amazing piece of software, if you need a template engine that works with perl I’d highly recommend it. Having worked a bit with it and Mason, I prefer TT2 (not that Mason isn’t great itself).

Don’t Say I Didn’t Tell You

Well, I probably didn’t put it in writing, but I’ve never trusted SCO/Caldera. Turns out they are suing IBM (/. article) over Unix IP.

The UNIX IP and Trademarks have been passed around like a biker’s girlfriend, the latest holders being our friends at SCO. Seeing as they don’t have anything to offer themselves, they may as well try and make a buck (or a Billion) off the patent to some 30 year old technology. Bruce Perens made a comment on slashdot that this could be a ploy to get them bought – after all, IBM has the cash, SCO has nothing of value except for this lawsuit.

Here’s the deal. IBM licenced technology from the AT&T patents to make AIX. Much after that (10 years) SCO bought the patents. SCO now says that IBM is misusing them in order to promote their Linux interests.

http://newsforge.com/article.pl?sid=03/03/07/1728239 is another good article on the situation. SCO has a page about it too. Conspiracy theorists will note that SCO uses Schwartz as their PR agency, who Red Hat used to use.

I Might Have Found a New Hobby

Combining my hobby of writing perl code for web automation, and my love of google:

http://www.google.com/apis/

More to come

BSD vs GPL

This article on Slashdot deals with Intel and Red Hat making some modifications to their licencing in order to be able to cooperate.

Basically, Intel had some code that they wanted to be able to distribute without the requirements of GPL. Red Hat didn’t like it. They comprimised on the BSD licence.

I agree with this decision – While I am a user of, and enjoy the work the GPL project has done, I also prefer the BSD licence. Theo de Raadt puts it best:

In the BSD world, we believe in making available trap-less software which anyone can use for any purpose. Even if they wanted to put our operating system into baby mulching machines or cruise missiles. We expose no ethic except our own of transitive freedom in sharing. We make no demands except credit.

I find the GPL too strict. As a developer (a poor one, but a developer none the less), I’d prefer that my code get used over anything else. “Do what you want with it, just don’t say you wrote it”. Even Microsoft has made use of BSD code, such as zlib, and quite possibly parts of the {*BSD,Linux} socket libraries. The GPL, while perfect for keeping code open, doesn’t do much to promote its use. BSD, on the other hand, fits better with commercial software and open software alike.

GCC Myths

This article on Freshmeat goes over some myths about tuning performance in GCC, and some of the improvements in the later versions. I’ve seen it too… Turn on -O69 and your programs will fly, right? Nope. Even quoting some source code, the author walks through some of the performance optimizations and what they really mean, including the differences between -march and -mcpu

Unfortunately, the author neglects to explain some of the terms, such as loop unrolling (he’s got a section on when it comes in to play or not, but never actually tells you what it is). If you’ve got a loop, say you’re cycling through a fixed array from 1..10 and doing a calculation on each element, the computer has to check the index variable each round to see if it should continue looping. If the loop is unrolled, then the compiler gets rid of the loop, and simply replicates the calculation 10 times. The code is bigger, but the CPU does less work.

Advanced Server Gets COE

This article (Red Hat press release here)says that Red Hat Advanced Server just got COE certification. The first article also notes that another group is pushing toward Common Criteria certification.

This is a great step forward for Linux, as it opens the door for Linux to be deployed in more mission critical applications.

I should note that this is all within the Advanced Server line of products (and on a specific hardware platform to boot, but that’s another story). The basic Red Hat distribution doesn’t apply. But if you need that sort of stamp on your products, what’s a couple of grand (seeing as though your alternatives are much more expensive). Furthermore, the technologies used to obtain the certifications will be available to integrate into other distributions.

Red Hat Evil?

Tony, author of the Lockergnome Penguin Shell newsletter recently made some comments about Red Hat’s corporate strategy. Quoting this article, he infers that Red Hat is trying to drive the non paying customers away.

There is another interpretation of the article, which I think is closer to the truth. Right now, you either download or pay for the basic version of Red Hat, or you pay a lot of money for Advanced Server. There is no middle ground. By producing some better mid range products, Red Hat can shift some people from the free versions to the newer products.

For those of us that don’t need Advanced Server, there are some features that we’d like out of Personal/Professional, namely support. I’m not going to pay $2500/year/box for support. So far, I (well, my work) has ponied up $60/year/box for Red Hat Network. For that, we get priority access to patches, the great web GUI, and a way to show our appreciation to Red Hat. For my boxes at home, I take the cheap road out, and either use the free demo licence that comes with installation, or just manage packages with something else like Red Carpet or AutoRPM.

There is a huge market for Red Hat between the home user, and the enterprise Advanced Server. All they’re trying to do is get into that market.

Sure, the public “Demo” server is swamped when a patch comes out. For that matter, when a new kernel or version of Red Hat comes out, most mirrors are pretty busy too. If I need a patch in a hurry, I know I can always get it off a mirror (who themselves have a separate way to get the patches).

I like and use Red Hat because they “Get it”. They build a quality product, and IMHO, do a very good job of balancing new features software vs stability. They successfully combine Open Source and business. Despite what people say, they still remember their roots, and, in my eyes, have not yet done anything to try and extort money out of the home user.

Perl & LWP Reviewed

A while back I had received a copy of Perl & LWP from O’Reilly for review. I read it and loved it. Unfortunately the review never was published on Cramsession. Since I got some positive feedback after publishing it in my LUG’s newsletter, here it is, hope it’s helpful.

Short story – Great book if you want to automate the pulling and parsing of web pages in Perl. I can’t stress the parsing enough… The man pages are good enough to teach you all you need to know about LWP, this book is great for teaching you what to do with the page once you have it.

Perl & LWP
Sean M. Burke
O’Reilly, 2002
242 pp
$34.95USD, $54.95CDN

The only disappointing thing about this book is its title. ‘Perl & LWP’ conjures up images of simply grabbing a web page, maybe posting to a form through Perl. Not much fodder for a book, seeing as a cursory glance of the man pages will tell you how to do that. However, one must dig into the byline of the book to get what this book is really about, namely ‘Fetching Web pages, Parsing HTML, Writing Spiders, and More’.

Unless you’ve done this stuff before, it’s hard to appreciate how difficult it really is. The act of getting data from a web server is trivial. The book covers it quickly, writes a function, and then largely ignores the issue. The trick is what to do with the data once you’ve got it. That’s the hard stuff, unless you’ve got this book by your side.

The book starts off with a quick look at what web automation is, why you’d use it, and where it’s appropriate. 35 pages later, you know how to GET and POST forms several different ways using the LWP class library. Then it’s into a discussions of URLs, and then forms. That’s five chapters, and 85 pages. It’s also where the book goes from simply ‘good’, to ‘great’.

Many people will use regular expressions to parse the data from here, and chapter six talks about ways to do that. Basically, it involves a lot of looking at the source, and making some rough guesses such as ‘look for a sequence of characters, followed by some spaces, then comes our data.’ The author shows iterative techniques to arrive at the solution. He then also shows shortcomings of this technique.

To overcome the problem with regular expressions, token parsing is introduced. Again, good ways to use the technique are demonstrated, and the limitations presented. To overcome these limitations, trees are shown. Though three different methods are used to solve much the same problem, the relative advantages and disadvantages are made clear.

This book is full of examples. Every example is from a real live site, from grabbing a book’s rankings from Amazon.com, to checking out licence plates from the California DMV, to pulling out news items from BBC news. His code is exceptionally clear, there are very few Perl tricks used, meaning that you don’t need to be a Perl expert to make use of it. Unlike many Perl books I have, the focus is on providing easy to understand code rather than efficient and terse code. Furthermore, the author isn’t scared to show you the things that don’t work alongside the things that do work.

Perhaps the most impressive thing I found about Perl & LWP was that it embodied the Perl motto: There is more than one way to do it. Regular expressions are shown as a quick and dirty way to solve a problem. Tokens are introduced to overcome the limitations of regular expressions. Cases where tokens fall apart are solved with trees. Even some examples are redone to show how the improvements happen.

Particularly interesting was the chapter on writing web spiders. Not only is it a cool application, but it pulls together techniques learned from most of the previous chapters. The design and build of the spider is laid out here, you follow the author’s thoughts as it’s put together.

If you’re working with Perl to retrieve web pages without this book, you’re either the author, or wasting a lot of your time. I gained more timesaving techniques reading this on the bus than I have in several years of hacking around with LWP. The time it saves is more than worth the price. For a book on programming, it’s surprisingly easy to read, even bordering on fun (just what is the author’s fascination with pie?) I heartily recommend ‘Perl & LWP’ to anyone looking to automate even the smallest of web tasks using Perl.

Table of Contents:

1. Introduction to Web Automation
2. Web Basics
3. The LWP Class Model
4. URLs
5. Forms
6. Simple HTML Processing with Regular Expressions
7. HTML Processing with Tokens
8. Tokenizing Walkthrough
9. HTML Processing with Trees
10. Modifying HTML with Trees
11. Cookies, Authentication, and Advanced Requests
12. Spiders
A. LWP Modules
B. HTTP Status Codes
C. Common MIME Types
D. Language Tags
E. Common Content Encodings
F. ASCII Table
G. User’s View of Object-Oriented Modules

Interesting Comment on Newsforge

An interesting article on newsforge asks how much the market would bear for a consumer grade Linux distribution.

If it could provide email, web, and basic word processing, with less hassle than Windows, wouldn’t that be enough for many people? Kick out all the gamers, power users, and leave students, older folk, and those that generally need a computer that just works when they turn it on, would there be enough market? MacOS is making a comeback on a similar concept, why can’t Linux jump on board?

I think Linux is at the point where it can provide an easy to use environment. Even if you throw in something like Crossover to run the plugins, the cost of a Linux box with the basic software, at around $100-$150 will still be cheaper than a similar box with Windows, Word, etc.

Maybe Lindows has the idea. Now with Click N’ Run, but the concept in general.