17 June 2008

Joining Tables... in Perl!

Is it just me or is joining tables in SQL just like processing text files in Perl? OK, maybe when I put it like that they don't sound very similar but there's actually a connection. Let me start from the beginning.

There are a number of ways your database server can join tables to execute a query. In SQL Server, for example, there's the nested loops join, the hash join and the merge join. What do you mean you haven't heard of those types of joins? What about the inner and the outer join, you say. They're valid as well, of course, but in this case I'm talking about physical, not logical joins.

The difference is that logical joins tell the database which rows from table A you want to join to which other rows from table B to produce the output. On
the other hand, physical joins tell the server how the rows should be matched up. In other words, what algorithm to use. The reason you might not have heard of physical joins is that you normally don't have to specify them at all. SQL is basically a declarative programming language. Instead of telling the machine what to do (as you would in C, Java, VB, or Python) you just tell it what you want and it works out the best way to obtain that data for you. One of the main steps for the machine is deciding on an execution plan for your query. That's where the physical joins come into play. They are some of the "building blocks" used in the plan.

Although SQL Server already picks physical joins for each query, you can add hints to tell it to use different ones. The reason you'd want to do that is usually performance but, for the record, let me say that you should always avoid hints whenever possible. In the long run, the server stands a better chance than you of finding the best joins possible.

OK, but what has this got to do with Perl? Even better, what has it got to do with processing text files in Perl? Hold on, I'm getting there. Perl is commonly used to process large input files with regular data (delimited, fixed-width, you get the picture). "Processing" the file might mean looking at each line to extract only a few columns or filtering out certain lines depending on the values they contain. For example, a Perl script might run through a web server log and extract only the Timestamp and Requested URL fields for lines where the Timestamp was between 9:00 and 11:00. Doesn't that sound like a query? The input file is pretty much like a very simple database table.

Processing becomes more interesting when you have more than one input file. That's when you have to combine data from the multiple inputs, something you could call "joining". Maybe you have two web servers and you want to combine both log files into a single output. In Perl, you could open the input files and merge the contents according to the Timestamp field. That's essentially what the merge join does.
The key factor is that the input files are already sorted which makes merging pretty simple.

If the inputs are not ordered, and especially if one of the inputs is a lot bigger than the other, the hash join would be a more appropriate choice. For example, let's say you are still looking at those web server logs but now you decide you also want to include the name of each server in the output. The problem is the input only has the IP address of the server. The mapping of IPs to names is stored in a separate file. The trivial solution in Perl is to first read the mapping file, load the contents into a hash and then process the other files, using the hash to output the correct names for each input line. The main limitation is the fact that one of the files is loaded into memory. Try using one that's really large and you'll see your machine crawl to a standstill.

Last but certainly not least, there's the nested loops join. This particular workhorse does exactly what it says on the label. Essentially, the join looks at the first input
and for each line it runs through the entire second input. To put it another way, it sets up one loop nested inside the other. There are basically no limitations on how your input files can be organized.

What am I trying to say with all this? Well, you certainly shouldn't stop using SQL and start writing the joins yourself in Perl. But, I think it helps to understand what big complex software like a database server is doing behind the scenes. Not only does it make it all seem less like magic, it also gives you more knowledge to deal with those pesky performance problems.