Friday, September 21, 2007

Object-oriented database systems

I had a discussion with Anders today about databases.

I'm not very fond of relational databases, or perhaps more precisely SQL. Having to convert back and forth between a database model and SQL and a programming language with all it entails of proper quoting to prevent SQL injection attacks and joining tables all over the place is really a great source of frustration. And a good waste of time.

On the other hand, I've seen what a record-based file system with no built-in join can do to people. And the built-in indexing stuff that lets you search on different attributes without a linear scan is just neat.

I think the great success of the relational database systems has a few simple explanations. Many business-related applications are really about the data. They spend their time collecting and displaying lots of simple, boring data - unlike an entertainment program or a standard desktop application that spends a lot of time dealing with small amounts of data, maybe less than say 1000 objects.

This data may be larger than what can be kept in memory (the possibility of this happening is enough to make it a problem). Furthermore the data must be kept when the machine is turned off. So you have to deal with secondary storage, a topic largely ignored by most programming languages that otherwise go to great lengths to make dynamic use of primary storage as transparent as possible.

Finally, you need to be able to access all that boring data in a many different ways. Say you have a situation that you naturally think of as a hierarchy, e.g. a company with multiple departments with each department having a number of employees and each employee having far to many tasks to do.

In object-oriented thinking, you would model this as a list of departments, each with some attributes and a list of employees, each again with a list of tasks. Want to show the org chart in the company? Easy, just loop through the departments and print the list of employees one by one. Super efficient. Assign a task to an employee? Easy, just add it to his list. Super efficient. But what if you want to know who's task it is to empty the bin? Then you have to wade through two outer loops plus lots of unrelated stuff, just to get to the actual data you're interested in.

Relational database systems, while solving the secondary storage problem, have an answer to this problem, and it's simple and elegant.

Then never mind that getting data in and out of them is hugely painful, especially hierarchical data, even if you're not trying to force everything leaving the database into a class.

I've recently been working a lot with Django's object-relational mapper, the ORM (which BTW means "worm" in Danish), and it really takes the pain out of the situation since it obviates the need for writing SQL strings in favour of native, albeit somewhat weird, Python code.

However, I still wonder why true object-oriented databases haven't overtaken the world ages ago. One that I hand over my object to and later can get it back from, in one piece with no fuss.

It turns out that there's a pretty large open-source one available, db4o. I think it solves the problem of retrieving objects based on arbitrary attribute values by incorporating a special query system where it's possible to define indexes like for a relational database, but I'll have to do some more reading.

There's a smaller one, DyBASE, for Python (and Ruby and PHP). It could be interesting to compare the performance against Django's ORM + MySQL.

The interesting question is of course whether you need a relational database if you can have the simpler object-oriented one. Interoperability issues aside, some people apparently seem to believe that the relational model is crucial for performance with huge amounts of data.

It would seem to me, though, that the natural partitions which the object-oriented paradigm splits the data into would tend to make things faster since there's less data to search through. For instance, if you're only concerning yourself with data from one department in the company, then it's rather wasteful to work with the maybe ten times larger table containing data for all departments. Perhaps it has something to do with joins and restrictions being difficult to express simultaneously in one query with the object-oriented approach. There are some benchmarks on www.polepos.org although they are not exactly easy to interpret at a glance.

Other people, like this guy on Slashdot, seem to think that the abstract relational model makes for better databases as opposed to application-specific persistent storages.

This makes me wonder whether many, perhaps most, uses of a relational database today wouldn't be better served by a simple object-oriented application-specific data store. I think most run-of-the-mill web sites need easy object persistence far more than terabyte scalability or an enterprise-wide database. Leave the heavy artillery for when it's really required.

Then, on the other hand, if you can get easy object persistence with Django's ORM, maybe this is a moot point.

No comments:

Post a Comment