08 February 2010

The table with a tiny key

Whenever I model a new database table I like to start with an auto-incrementing primary key column. Maybe it's the Ruby on Rails convention creeping up on me, but I've even adopted the minimalist id as my default name for such columns. A lot of times I go ahead and make it an int. Sounds pretty good, huh? After all, the good old int is the de facto standard when it comes to integer data types. What else would you use, right?

Since you ask, there are a few alternatives out there: smallints, tinyints and bigints. Let's start with bigints. I don't think the name does it justice. Bigints aren't just big - they are massive. Massive in the range they can represent but also pretty hefty in the storage space they require: 8 bytes instead of the usual 4 an int takes. How can an additional 4 bytes be a problem? We'll come to that in a second but for right now, just ask yourself if your table will eventually have more than 2 billion rows. In most cases, that's a very reasonable upper limit. For example, if you started with an empty table and added a new row every second it would still take you approximately 60 years to reach that limit.

OK, so you probably don't need a bigint unless you're Google, Flickr or maybe Walmart. What about the other data types? Smallints and tinyints can't represent the large range of numbers that ints can. To refresh your memory, smallints go up to a little over 32,000 and of course tinyints a mere 255. And that's precisely where things get interesting. You see, the truth is, any database has a bunch of tables that have very few rows each. They might be domain tables like MaritalStatus ("Single", "Married", "Divorced", etc.) or PhoneType ("Work", "Home" or "Cell/Mobile"). Other tables might be limited by the nature of the data they store. For example, a table called WorldAirports is probably going to have less than 50,000 rows. My point is why use an int as a unique identifier for each row if you're not going to have that many rows? Why not use a smallint and save 2 bytes per row. Even better, use a tinyint and chop 3 bytes off every row. If that doesn't seem like a lot, try looking at it this way: a tinyint requires 75% less space than an int.

Now, wait a minute! What's the point of optimizing a table if it only has a few rows anyway? The total savings are going to be negligable! Well, not quite... The magic of these small tables is that their primary keys get used as foreign keys with the same data type in much larger tables. Let's say you have a PhoneDirectory table with one million phone numbers and each phone number has a PhoneTypeId that's a foreign key from PhoneType. The PhoneDirectory.PhoneTypeId column alone would take up 4MB of storage as an int but only 1MB as a tinyint. Not bad, huh?

But disk space is cheap and it's measured in terabytes these days. One megabyte, four megabytes, who cares? While there's certainly a lot to be said for throwing more hardware (in this case, storage) at a problem, those savings can add up and remember that a smaller database doesn't just mean less disk usage. It also means faster backups and restores, less replication errors, and smaller bandwidth usage for remote archiving and DR. Did I mention better performance? That's right, smaller rows means you can fit more of them in each page of data and therefore reduce the number of reads and writes.

Next time you create a table, have a think about that and choose your data types accordingly. Or get started on refactoring your existing tables right now.

No comments: