17 June 2008

Joining Tables... in Perl!

Is it just me or is joining tables in SQL just like processing text files in Perl? OK, maybe when I put it like that they don't sound very similar but there's actually a connection. Let me start from the beginning.

There are a number of ways your database server can join tables to execute a query. In SQL Server, for example, there's the nested loops join, the hash join and the merge join. What do you mean you haven't heard of those types of joins? What about the inner and the outer join, you say. They're valid as well, of course, but in this case I'm talking about physical, not logical joins.

The difference is that logical joins tell the database which rows from table A you want to join to which other rows from table B to produce the output. On
the other hand, physical joins tell the server how the rows should be matched up. In other words, what algorithm to use. The reason you might not have heard of physical joins is that you normally don't have to specify them at all. SQL is basically a declarative programming language. Instead of telling the machine what to do (as you would in C, Java, VB, or Python) you just tell it what you want and it works out the best way to obtain that data for you. One of the main steps for the machine is deciding on an execution plan for your query. That's where the physical joins come into play. They are some of the "building blocks" used in the plan.

Although SQL Server already picks physical joins for each query, you can add hints to tell it to use different ones. The reason you'd want to do that is usually performance but, for the record, let me say that you should always avoid hints whenever possible. In the long run, the server stands a better chance than you of finding the best joins possible.

OK, but what has this got to do with Perl? Even better, what has it got to do with processing text files in Perl? Hold on, I'm getting there. Perl is commonly used to process large input files with regular data (delimited, fixed-width, you get the picture). "Processing" the file might mean looking at each line to extract only a few columns or filtering out certain lines depending on the values they contain. For example, a Perl script might run through a web server log and extract only the Timestamp and Requested URL fields for lines where the Timestamp was between 9:00 and 11:00. Doesn't that sound like a query? The input file is pretty much like a very simple database table.

Processing becomes more interesting when you have more than one input file. That's when you have to combine data from the multiple inputs, something you could call "joining". Maybe you have two web servers and you want to combine both log files into a single output. In Perl, you could open the input files and merge the contents according to the Timestamp field. That's essentially what the merge join does.
The key factor is that the input files are already sorted which makes merging pretty simple.

If the inputs are not ordered, and especially if one of the inputs is a lot bigger than the other, the hash join would be a more appropriate choice. For example, let's say you are still looking at those web server logs but now you decide you also want to include the name of each server in the output. The problem is the input only has the IP address of the server. The mapping of IPs to names is stored in a separate file. The trivial solution in Perl is to first read the mapping file, load the contents into a hash and then process the other files, using the hash to output the correct names for each input line. The main limitation is the fact that one of the files is loaded into memory. Try using one that's really large and you'll see your machine crawl to a standstill.

Last but certainly not least, there's the nested loops join. This particular workhorse does exactly what it says on the label. Essentially, the join looks at the first input
and for each line it runs through the entire second input. To put it another way, it sets up one loop nested inside the other. There are basically no limitations on how your input files can be organized.

What am I trying to say with all this? Well, you certainly shouldn't stop using SQL and start writing the joins yourself in Perl. But, I think it helps to understand what big complex software like a database server is doing behind the scenes. Not only does it make it all seem less like magic, it also gives you more knowledge to deal with those pesky performance problems.

05 May 2008

When is a bitmap not an image?

I've been reading Programming Pearls lately and I'm really enjoying it. The book is a collection of thought-provoking articles on fundamental topics in programming and software development. (That's right: it's Pearls, not Perl) To be fair, I like going back to basics and discussing things like algorithms and data structures.

Being a light and easy read (unlike so many dry comp-sci books out there) it's tempting to just whiz through the articles in a few days. However, the true benefit comes from working on the end-of-chapter exercises and actually thinking about the questions the author proposes throughout the book.

Before you start to wonder, I'm not about to turn this into a book review. What I really want to do is put my proverbial money where my proverbial mouth is. I decided to tackle one of the first exercises in the book and write my own implementation of a Bitmap data structure. Don't worry, I promise not to bore you with all the details, I'll just focus on the interesting bits.

Let's get started with this Bitmap thing. That's an image file format, right? Well, not quite. In this case, we're talking about an incredibly space-efficient way of storing numbers. I don't know about you, but I was blown away by just how simple and ingenious it is. Sure, it only works for sets of unique integers but, even then, how come I'd never heard of it before? I must've over-slept and missed that class at university. Come on, I even had to implement a red-black tree. I think the Bitmap deserved at least a brief mention.

Incidentally, after learning about this other Bitmap, I realized I was already
indirectly dealing with it in one of the tools I use on a daily basis. You might have seen it there too: SQL Server uses a Bitmap in the execution plan for some types of queries. And you thought the whole thing was just an academic brain teaser.

Anyway, the gist of the structure is that instead of storing the numbers themselves, we store an ordered checklist of the numbers. If we want to store a certain number, we check it on the list. Otherwise, we leave that number's position unchecked. As you can imagine, it only takes one bit to store each checked/unchecked state. Pack those bits into bytes and you've put a lot of data into a very small space.

I decided to store everything in a simple byte array. Although a byte array in C# knows how long it is, I had to keep track of the length of the whole bitmap since the last byte might only be partially used. Any implementation I could think of would have to deal with the bitwise operations to set and unset the bits inside the bytes so that was my first challenge. I brushed up on the mechanics of ANDing and ORing bytes (it's not that complicated — just find the right byte in the array, then set up a mask to single out the bit you want in there) and implemented the two fundamental operations of my data structure as a bool-valued indexer:

public bool this[int index] { get; set; }

My initial implementation used a fixed length for the array that was set when the object was created. Auto-expanding collections are all the rage these days so making the bitmap resize itself on the fly was the next item on my list. That was enough to obtain a pretty usable collection class. Did it work? It sure did! As predicted, it took up very little memory and it was also really fast.

I could keep tweaking the class and adding more features but I think that's enough for a 30 minute exploratory coding session. That's right, it only took about that long to get the basic pieces in place. Now, if only writing about it were as easy as that...

01 May 2008

Fiddling With a Live Site

It all started with a problem on a live site. Not that the site didn't work — it really did. Except when it didn't. But most of the time it did. You see, it sometimes had a glitch on a specific browser (probably not the one you're thinking of) and even then it worked most of the time. We thought of slapping an appropriate badge on the page and calling it a day but management wasn't too impressed by that. (Who said they always fall for the Certification buzzword?)

There was nothing left to do but actually debug and fix. Sure, no sweat: just reproduce the error on a development server, figure out why it's breaking and come up with a patch to solve the problem. I got stuck on the frst step. I told you the problem was on the live site. No one mentioned anything about the site not working in development. The team would've picked it up if that had happened.

Right, so I was stuck with a bug I could only repro sometimes (does that even count as "repro"?) on a specific browser accessing the live site. Even if I could come up with a patch, I couldn't just apply it to the live site without testing. My first idea was to use FireBug. I'm not that good at it but I've heard it can do some amazing things. For reasons I won't go into here, that wasn't an option.

That's when inspiration struck me. I had to change the Javascript my browser saw without changing what was on the server. Another nifty tool that can do that sort of thing is Fiddler. I use it mainly to see what data is being passed back and forth but it can also change the data for you. (If you're going to try this, do yourself a favor and install the FiddlerScript Editor as well.)

Step number one was to identify what file contained the Javascript code I wanted to debug. At first, all I wanted was to insert some alert() calls to see what functions were being executed and what the data looked like. The trick was to find a "hook" to attach to. For example, if I had a line of code like this:


var i = 0;

And I wanted to check that i was actually being set to zero, I would replace that line of code with:


var i = 0;
alert(i);

That seems pretty obvious but when I say "replace", I really mean it. I was making textual replacements in the source code before the browser got a chance to run it. (Did I just hear "C macros"?) Of course, by doing that, every line of code that looked the same as this one would have an alert attached to it (or maybe just the first one would — it depends on the exact replacement function). So the trick was to find bits of code (it doesn't even have to be a valid statement — it's just text, remember?) that looked unique enough so I could attach an alert.

Once I'd figured out why the code didn't work, I had to test the patch. Using the same technique, I could either remove chunks of code altogether or comment them out by inserting comment delimeters in the right places. All that was left was to insert the fixed Javascript and check that it worked.

While, in the end, the patch altered less than five lines of code, the same idea can be extended to much bigger changes. A larger intervention could require modifying the HTML and CSS as well as the Javascript of a website. The practical limitation is the complexity of the code you have to write in FiddlerScript. When you start dealing with code that's modifying other code that, in turn, is modifying HTML, it can be pretty hard to mentally juggle all those different abstraction levels. I wonder if there's a gap in the market for a tool that can hide away that complexity while still enabling a developer to debug a website without having access to either server or browser.

20 April 2008

Starting from Scratch

Hooray! I've finally got my own blog. And the name isn't half bad either.

So what's this blog all about? Am I just going to tell the world what I had for dinner yesterday? Or what DVD I saw last night and whether I liked it or not? (It had some really good laughs in it.) Well, you can breathe a sigh of relief 'cause this ain't yet another one of those blogs. (If my posts turn out to be a lot less exciting than you thought, don't say I didn't warn you.)

Following in the steps of some of my favorite bloggers ("Add gratuitous links to well known blogs: Done"), I've decided to write in order to improve my understanding of software development and my ability to communicate that understanding in a clear, concise manner. Hopefully I'll be able to share some interesting ideas with any lost stragglers who stumble upon these posts.

Let the blogging begin!