Spent the past week in Indianapolis and Atlanta meeting with a vendor and our parent organization. It took over 8 hours to get between the two sites because of delays and transfers. Simply absurd. I was gone almost 5 whole days, and only two were productive, the rest were wasted with travel.
Airport security is likewise silly. While the searches are thorough, there was a huge inconsistency in the way everything was handled. In the Atlanta airport, baggage was x-rayed. Chicago required people to take off their shoes, where Atlanta had little devices you could step on that would test your shoes to see if you needed to take them off. Toronto was sniffing laptops with an “electronic nose” to check for chemicals, and Winnipeg customs were performing baggage checks.
His greatness, Bruce Schneier, made some comments about his initial reactions. Cryptogram is a hell of a newsletter, btw, I recommend signing up.
I’ve known about WWW::Mechanize for a while, but never really bothered to look into it. I figured that between LWP::UserAgent and HTML::TokeParser, there was nothing I couldn’t do.
So, I find myself flipping through Spidering Hacks, and I see they have some examples using it. Interested, I read further…
Normally, when scraping a web page, you fetch the page, parse the HTML to find the elements, fill out forms or find your links, then follow the next link. WWW::Mechanize builds on LWP::UserAgent so that the fetching is pretty much the same, but the parsing is almost completely automated.
One of the common tasks is to fill out a form. The old way, I’d get the page, then parse it with HTML::Form
123
my@forms=HTML::Form->parse($page,$url);# figure out what form I want$form=$forms[0];
Then, cycle through the elements to find what I need:
123
formy$i($form->inputs){$i->value("something");}
With WWW::Mechanize, a lot of that is hidden. You can even ignore the parsing of the form, and fill in the first two visible text fields, such as to log in to a protected site:
Another, more obnoxious, task was to follow a link identified by the text, ie find an anchor tag, followed by a string, followed by a close anchor tag. Within it could also be bold tags, so a great deal of care was necessary. For this, I had to crack out HTML::TokeParser and parse the page. With WWW::Mechanize:
1
$mech->follow_link(text_regex=>qr/Log In/i);
Duh. That’s pretty simple.
So, I present here a simple script that goes to the Amazon.ca site, follows the “associates link”, logs in, and checks the stats. I end up with a web page, to actually pull out the numbers will require HTML::TokeParser, but that’s not a big deal (WWW::Mechanize is amazing, not miraculous!) This script is about 20 lines, it would probably have taken me 3 times that to write the my old way, and that would still be sacrificing the flexibility to adapt if amazon changes their page.
Upgraded my RH9 box to Fedora core 1 tonight with Yum. The only reboot was after the install finished, ie I didn’t even reboot to start the install. Very simple.
I was reading a thread on a forum about Google’s next reindex and how it might be another Florida.
The first thing that hit me was how seriously some people (affiliate marketers, SEOs) take Google. In a matter of a few hours, almost two hundred messages were posted of people analyzing the results of the update. The second thing that hit me was how complex Google really is.
Not understanding some of the terms used in the posts, I did some searching. The Google Dance refers to the process where a new Google index rolls out across the various Google data centres. Here and here, you can see the results of a search across different data centres.
If you’re curious about more methods used to optimise pages, here is a page on SEO tools.
Anyway, what’s the relevance of all this Google stuff? It’s all designed to make the results (SERPs - Search Engine Result Pages) contain useful information instead of affiliate sites and popup traps. So people like you and I can find useful information on the Internet. I also look at it as a race between the google Engineers trying to make a smarter GoogleBot, and people trying to exploit it. It’s a great spectator sport.
Updated: Found some more links on the way Google works.
A paper on Google’s architecture, interesting analysis of the cost to run a distributed environment vs. a centralized one. Also contains information on the flow that happens when someone queries Google.
They’ve also built their own file system on top of Linux.
Taking a trip to Ottawa to spend some time with the in-laws. One of the airlines had a dollar seat sale, meaning two round trip tickets were $4. Then how come it ended up being almost $280?
I originally registered ERTW.COM with the idea of turning it into a resource for engineers and engineers-to-be. In the past, I ran the front page as a news site and ran some forums. I was also talking to various student councils across the country about sharing announcements and stuff. Somehow, it never took off. This went through two iterations before I gave up.
If you’re not familiar with a Wiki, it is a web site that can be updated by anyone. Each time you type a term with WordsSmashedTogether (like that), it makes a link to that page/node. The first time you type in that term, it creates the page. It also has its own markup syntax, which is designed to make it easier for people to add content.
I’ve gone through and added a bit, with the intention of adding more. If you’re an Engineer, are thinking of becoming one, or are in the midst of your studies, go ahead and add stuff. The emphasis is on content, not looking good (which is par for the course in Engineering).
In December 2002 I wrote some 2003 predictions, and boy was I off (comically, in some instances)
First, a major distribution will drop out of the market. Red
Hat? Nah. Debian? Doubt it. Caldera/SCO/whatever they’re called
today don’t count.
Well, Red Hat dropped out, SCO did their thing.
Second prediction is an easier one. The BSDs (FreeBSD, OpenBSD,
and NetBSD) will get a lot more attention this year.
While I won’t say I’m completely right here, the *BSDs made some major releases, and certainly weren’t worse off.
Third prediction – Sun Microsystems. Are they going down the
hole? No. But, out of all the proprietary UNIXes out there, I
think Solaris has the most likelihood of losing market share to
Linux.
I think I have to take some points here, even though this one was fairly obvious.
I also didn’t forsee that the newsletter this was written in would be abruptly stopped a month later.
This article over at Linux.com complains about combing RPM with custom code. I do it all the time, and if done properly, there should be no problems. In this example, I customize the Webalizer package while still maintaining the software under version control.
Webalizer has a feature that counts /index.* as /, since they usually produce the same output. The “index.” part is hardcoded in, though you can add additional names with the IndexAlias configuration directive.
However, my RSS feed is index.rdf, and since it begins with index., webalizer counts requests to both http://ertw.com/blog/ and http://ertw.com/blog/index.rdf as one page. With this in force, there is no way for me to determine how many hits I get of people reading my blog vs people pulling it in with an RSS aggregator.
Grepping for “index” in the webalizer source showed me that it’s a single line causing my problem:
1
add_nlist("index.",&index_alias);
Commenting it out and reinstalling is needed, but since I already have it set up as an RPM, I may as well keep it as such.
A binary RPM is a collection of files and scripts needed to install a package. With the information in the package, the system can determine if the necessary dependencies are present, and what files are associated with the package for later removal. RPM also differentiates between binaries, documentation, and configuration, so that if you upgrade a package, it will either back up or leave a config file alone, while replacing binaries.
An RPM is made from a .spec file, “pristine sources”, and patches. The .spec file contains the instructions for unpacking, patching, configuring, and installing the software. Think of it as a script for building and installing software, ensuring it is done the same way time after time. Pristine sources are simply the tarballs obtained from vendors. Patches are changes to the pristine sources, either local modifications (such as what I’ll do here), or updates from the vendor (like the kernel does)
Downloading the .srpm of webalizer from the fedora sources (and note I’m running RH 7.3, even though we’re dealing with RPM, we still have benefits of source code), it is installed with “rpm -i webalizer-2.01_10-14.src.rpm”. This puts all the source and patches in /usr/src/redhat/SOURCES, and the spec file in /usr/src/redhat/SPECS.
Just to make sure I can build it, I try to generate a binary RPM of webalizer:
1
rpmbuild -ba webalizer.spec
(Another note here is that this requires db4, where RH7.3 only has db3. In the past I had to upgrade it anyway, by simply rebuilding the db4 rpm from RH9 in the same manner and upgrading. Again, the benefits of source with the benefits of RPM)
So, it built fine. What I have to do now is fix the code, create a patch, and then rebuild.
With a clean copy of the webalizer source in one directory, I make a copy:
12
cp -r webalizer-2.01-10/ webalizer-2.01-10.orig
vi webalizer-2.01-10/webalizer.c
I make my change to webalizer.c (ie commenting out the one line), then create a patch into the SOURCES directory:
(Read this for more information on diff and patch)
Editing webalizer.spec, I make the following changes:
1
Version: %{ver}_%{patchlevel}Custom
this changes the package name to have “Custom” in the name. Then, tell RPM to apply the patch to the sources and strip off one directory (ie patch -p1)
1234
#... by the Patch0 and Patch1 linesPatch5: webalizer.indexalias.patch
#... by the other %patches%patch5 -p1
to /etc/webalizer.conf, and I’m tracking my RSS statistics separate from my normal page views.
It is well worth your while to make good use of RPM and specfiles. Almost anything I install, I try to build as an RPM. When patches need applying to the code, or when the software upgrades, I copy a couple of files and change some lines in the specfile. Since the specfile is a script, I ensure that the software is built consistently each time.