22 February 2010

Class inheritance in relational databases

Here's my rather crude definition of class inheritance: the ability to create a base class encapsulating the common elements of a bunch of other classes while letting developers focus on a higher level abstraction. If you write code in any OO language you probably use inheritance every so often. That's fine, it works great and we all understand it pretty well but what happens if your objects have to the stored in a database? Well maybe not your objects per se but the data they store and represent. You probably use a relational database already but it doesn't have any support for this so-called inheritance. Now what?

Actually, there is a well known technique for implementing the same idea in relational databases. Joe Celko explains all the little details in his great book so, if you can, go check it out. Anyhow, this is how I would do it.

Let's use an example. A database for a bicycle inventory system stores each type of bicycle part in its own table: Frame, Wheel, Pedal, etc. That makes sense since each part requires entirely different columns. The system has to track if a frame is for men or women, if a pedal requires special shoes, and if a wheel has spokes or not. (By the way, an alternative would be to store everything in a very wide table with lots of nullable columns but that has its own problems and doesn't use inheritance.) However, the details of a sale (quantity, date, etc.) are the same for every part. Do we really need a Sale table that has nullable FrameId, WheelId and PedalId columns? There must be a better way.

First of all, we need a Part table - that will be our base. Next, every row in Frame, Wheel or Pedal has to have a correspondent in Part. The best way to do that is to share ids. If we're going to do that, someone has to be in charge of issuing new ids and the logical candidate is the base table. How about this:

create table Part (
    id int identity not null primary key
)

create table Frame (
    id int not null primary key references Part
    -- other Frame columns
)

create table Wheel (
    id int not null primary key references Part
    -- other Wheel columns
)

create table Pedal (
    id int not null primary key references Part
    -- other Pedal columns
)
Great, now Sale can reference Part without caring whether the part is a wheel or a frame. We have a base, we have some derived tables and every derived row is also a member of the base table. That looks like inheritance to me. Job done! Let's move on.

Hold on a second, on closer inspection things look a bit messy. We couldn't select on Sale and Part and say if a wheel was sold even if we wanted to. We would have to join it to Wheel and see if a match was found or not. In other words, there's no way to tell what type a part is just by looking at the Part table. What we need is a part type:

create table PartType (
    typeId int not null primary key,
    typeName varchar(20) not null
)

insert PartType values (1, 'Frame')
insert PartType values (2, 'Wheel')
insert PartType values (3, 'Pedal')

alter table Part
    add typeId int not null references PartType
That's more like it. With this modification, the base table can assert that each row is actually of a specific derived type. If we add a row to Part like this:
insert Part(typeId) values (3)
    -- the row is assigned Part.id = 1
We know that the new row is a Pedal. Wait a minute, does that mean that I can't add a row to Wheel with id 1? Actually, you can and that would be weird. The Part table would say that part id 1 is a pedal but Wheel would say it's a wheel. Maybe it's a new hybrid the PedalWheel, or the WheelPedal?

We are still missing the final piece of the puzzle that brings it all together. Adding a row to the wrong derived table should not be allowed. We can stop that from happening by changing the foreign keys and adding a new constraint. Let's drop the tables and start again:

drop table Frame
drop table Wheel
drop table Pedal
drop table Part
drop table PartType

create table PartType (
    typeId int not null primary key,
    typeName varchar(20) not null
)

insert PartType values (1, 'Frame')
insert PartType values (2, 'Wheel')
insert PartType values (3, 'Pedal')

create table Part (
    id int identity not null primary key,
    typeId int not null references PartType,
    constraint u_Part unique (id, typeId)
)

create table Frame (
    id int not null primary key,
    typeId int not null,
    -- other Frame columns
    constraint fk_Frame_Part foreign key (id, typeId)
    references Part (id, typeId),
    constraint ck_Frame_Type check (typeId = 1)
)

create table Wheel (
    id int not null primary key,
    typeId int not null,
    -- other Wheel columns
    constraint fk_Frame_Part foreign key (id, typeId)
    references Part (id, typeId),
    constraint ck_Wheel_Type check (typeId = 2)
)

create table Pedal (
    id int not null primary key,
    typeId int not null,
    -- other Pedal columns
    constraint fk_Frame_Part foreign key (id, typeId)
    references Part (id, typeId),
    constraint ck_Pedal_Type check (typeId = 3)
)
We've added the typeId to the derived tables. At first, that seems unnecessary because each table will always have the same value in that column. Every row in Wheel will have typeId = 2. We need the column to take part in the foreign key and so we can apply the check constraint. Also note that, even though id alone is the primary key of Part, we need to specify a unique constraint on the pair (id, typeId) so it can be used as an alternate key. (Trivia: the SQL spec defines table constraints that can check values in other tables which would eliminate typeId from the derived tables. Unfortunately SQL Server doesn't support that feature.)

There you have it: The model is flexible and can deal with pretty much any single inheritance scenario. It's also robust and guards against most types of invalid data. The only outstanding problem is that it's possible to add a row to the base table without a correspondent in the correct derived table. Unfortunately that's not something that can be addressed by the model (not that I'm aware of) but there are a few ways to deal with the issue. Personally, I would use stored procedures for the inserts, updates, and deletes. If T-SQL isn't your thing, try building the correct logic into the data access layer of your application instead.

Now go have a good look at your database model. Can you find any tables that could this technique?

No comments: