|
|
05.08.2004 |
|
||||||||||
| Why Not Use the Database? | ||||||||||||
I've been a computer geek since a boy, and thoughts related to computers and software engineering get dropped here for the benefit of humanity and my own hubris.
|
The last couple of years of server-side architecture has left me with a pretty bad taste in my mouth, and I’ve just got to spit… However, I have a suspicion that some of my ideas may be seen as heretical, so allow me to protect myself from the pyre by stating that the following idea is not a global truth, but a wakening shake for all of us to realize that not everything is a pattern from our last project. Allow me to get concrete by talking a bit about my last project… which was as much an experiment as it was a deliverable to the customer… Most server-side or web-based applications are written using object-oriented technology (OOP), and it seems that we have a fetish for writing useless objects. Yes, sometimes you have to build some helper objects, but it is better to write those with a clear purpose in mind and to realize that you might not need to… I’m specifically referring to fact that we have this itch to create an object that mirrors our database tables. For example, in a CRM system, we would be tempted to create a “customer” and a “task” and a “campaign” object… granted, I did create a database table for each of those, but why are we so tempted to also create an object. I know, our rationale goes something like this: I need to retrieve the data from the database, and I don’t want to duplicate that code every time I need a customer, so I will instantiate an object that will do that business. Now let me bore you with the slippery slope that happens afterwards… you first create the object, and it is good. After a while, you need to retrieve more and more complex data associated with customer, and suddenly the performance goes to hell because to list all of the customer names takes an hour as it is gleefully calculating each customer’s mortgage payments as well. Let me give you an example from my last project. In my case, each customer could have zero or more “tasks”, be associated with zero or more “campaigns” and had one registration event. Each of these, while stored in a different table, had a date associated with it, and in one particular “screen” (read: servlet), we wanted to display the names of the customers with the latest date associated with each customer from any of these tables. Since I am grabbing a value that is related from three different tables (but doesn’t have the same name), we need to perform an SQL union with the max date. In my case, I threw this into a temporary table and then joined it with the customer list. My point is not to get bogged down into the details, but to show to give you and idea of the amount of work required to get this “business data”. Let’s suppose I had taken the typical approach and had created a “Customer” class, with a number of Retrieving each property when it is called seems terribly inefficient if I am trying to display all of the data associated with a customer, as this could result in a hundred database calls… But performing this SQL union business would be silly during the object’s construction because I might just be needing to list all of the customer’s names in a list, as we don’t want to perform costly database operation without needing to… So since I’m such a smart guy, I’ll have the costly properties called when we need them, and the rest performed at construction… and also since I’m so smart, I will realize that to retrieve all the customer’s data will not be abysmal, it also won’t be optimal— as each costly method will have to perform its database operations by itself, and can’t leverage any pre-calculations done by any other costly method. Unless, I cache some of these pre-calculations… Yes, this idea is an exciting white rabbit hole… But before you start coding this intensive customer class, stop and think if it is worth it. Will you ever reuse this code? No way, this is a customer class that will be slightly different from the next customer class you’ll have to write. Also keep in mind that this is a mindless database application… it just isn’t worth your time to build such complexity. Especially when complexity == risk for bugs. Of course there’s another way… don’t make the object in the first place. What? Yeah, I know this might be a controversial idea, but think about it… we’ve been doing database applications for a long time— longer than we’ve been doing OOP. So why not use the database for what it is good for— get data. How did we do it before? If a “screen” needed a bunch of data, we performed the SQL operation, got that data, and stuck it up on the screen. The advantage of this approach is that each screen can perform optimal database operations to get just the data it requires… no more, and no less. The disadvantage? Obviously there may be a few “screens” that need to perform the same database operations, and this approach may result in duplicate code. It has been my perspective that this doesn’t happen very often, as usually each display “screen” required different SQL code, so it really results in similar (not duplicate) code, but one in which it can adapt to the needs of any individual “screen.” However, just like you abstract all of your output messages into a separate text file (resource bundle), we could do the same for all of these SQL statements. Now, instead of hard-coding the SQL string, you retrieve it from a resource bundle based on some identification string. If a “screen” requires a slightly different SQL statement, we just create it and add it to the bundle. Put down the pitchforks— I’m not saying don’t make any more objects… if an object did some non-database calculation, then clearly make that object. But I’d bet that the bulk of these database-oriented-objects probably don’t need to be objects at all. If this idea gives you hives, then at least don’t perfect your re-invented wheel, but use some tool like Hibernate or JDO… just don’t get me started on EJBs. Thought originally posted on Saturday, 8 May 2004
© 2004-2005, Howard Abrams • Except where otherwise noted, all original content is licensed under a Creative Commons License (see details). A comment to this from Greg Jorgensen
I just read your article. You hit on something I’ve wondered and bitched about for a while. As a long-time database/SQL programmer I have often wondered why programmers use relational databases as persistent object storage and ignore the powerful and well-tested data manipulation and programming features RDBMSs offer. I’ve seen hundreds of the “wrapper” classes you mention. The idea seems to be to map classes to database tables (or vice-versa), then moving the business/rules and logic into higher-level classes. This leads to clunky and inefficent classes for customer, employee, etc. and code somewhere else that manipulates the objects. Reporting and almost any aggregate functionality gets tedious to program AND slow, since it either generates a lot of discrete database queries, or it uses cursors on a result set. I guess the idea of separating the database structure and contents from the business logic is a holdover from the three-tier architecture craze. Or maybe it’s caused by an overwhelming desire to do as much as possible in the “cool” OOP language and only do simple SELECT/INSERT/UPDATE transactions on the database. If you think about it for a minute, you realize that placing the data structure and contents in a back-end relational database, and the logic for manipulating/validating/displaying that data somewhere else (in middle-tier classes) is really the antithesis of OOP: data and the operations on it are now separated rather than encapsulated. Rather than write a bunch of wrapper classes, I let the database engine do what it’s good at: managing data. So my Customer class has methods for OrderVolume(periodstart, periodend) or ChangeAddress(…). The wrapper class does nothing more than get a database connection (either opening a new one or using a pooled open connection) and passing NAMED parameters to a stored procedure. The SQL code in the database knows how to physically manipulate the data. I recently rewrote some code that had wrappers for every table in the database: you could SELECT, INSERT, UPDATE, DELETE by passing a zillion parameters in a dictionary. The wrappers were really just training wheels to keep the programmer away from SQL. The wrappers constructed the SQL, but had no way to validate it for correct field names or types, so the SQL was tossed at the RDBMS with fingers crossed. Tracking through the code to find references to tables or columns was tricky because of the layers of indirection. My fix was to determine the operations the code actually performed against the database (and there were not very many operations). It never did something like change the city but not the zip code, or display just one detail line from an order. So there was no reason to have wrappers that could be made to do such things. Instead I wrote functions that know how to change a customer address, or retrieve all order detail lines at once. Those functions were mostly a few lines of code that called a stored procedure. Stored procedures are syntax-checked, type-validated, and compiled (I’m using SQL Server), so they are known to work before they are executed, and they are much faster than sending random SQL statements. In the process I was able to move code that does secondary lookups into JOINs in the stored procedure, do some formatting of results, calculate discounts and defaults, etc. The combined code is much smaller, simpler, and easier to follow than the original. Now if I we need to change a discount percentage that happens in one place, in the database code, instead of in hundreds of places that were fetching order detail lines one at a time, looking up the discount factor in the inventory table, doing the calculation, and keeping track of all that. In short, try using different abstractions for the database. Treat it as a black box that can do specific operations on data or return formatted results, instead of treating it like a high-capacity dBase file. The arguments I’ve heard against this kind of slicing are mostly spurious:
My anecdotal experience is that too many programmers don’t get relational database theory because it’s one of the (very) few things in the programming world that is based on real math and has provably right and wrong ways to do things. That’s not to say database designers don’t argue about normalization, just that the arguments are usually based more on definable and testable assumptions than arguments comparing, say, Java to C#. I think most programmers don’t get the set orientation of RDBMSs and SQL, and tend to think in terms of records and fields instead of relations; their thinking is one-dimensional, which creates the loads of slow and buggy cursor-oriented code I see all the time. The record-oriented approach introduces locking and consistency problems that the database engine can handle if you let it. It also hobbles the database engine and slows things down to the point that caching results (which any decent database engine does already) starts to seem like a good idea. Comment posted on Thursday, 3 June 2004Check out this other web page that references this entry...
Web Frameworks?
|
|||||||||||
|
||||||||||||