Introduction
In many typical SQL shops, data oriented business applications are sometimes developed with less emphasis on data integrity for a
variety of reasons. Database developers often wear several hats to fit roles starting from system analyst to data modeler/database
designer. In the midst of satisfying the client's wish lists, they are often tempted to use several kludgy design schemes. One such
scheme among the most prevalent kludges is a common lookup table.
What is it?
Within the commonly accepted definitions in the computer field, a lookup table is simply an array-like data structure used to replace
a runtime computation with a simpler lookup operation. In the arena of data management, the concept is somewhat similar.
Consider a typical business system where orders are placed by several customers. Naturally, several entities like customers, orders,
statuses, locations etc. are involved in this business model. Let us assume at a given point in time a customer has a certain status,
his order has certain priority level, the location where an item is shipped has a certain code, the freight selected has a discount
type, nature of the warehouse has a certain classification code and so forth. For simplicity, let the logical schema supporting these
entities is represented as:
OrderStatus ( status_id, status_description )
status_id status_description
----------------------------------
1 Ordered
2 Cancelled
3 Shipped
AssetPriority ( priority_code, priority_description )
priority_code priority_description
------------------------------------
0 Unspecified
1 Over 5 Million
3 3 - 5 Million
4 1 - 3 Million
5 500,000 - 1 Million
6 Less than 500,000
DiscountType ( discount_code, discount_details )
discount_code discount_details
--------------------------------
1 Half Quota
2 One or more references
LocationCode ( location_nbr, location_details )
location_nbr location_details
--------------------------------
1 Northeast region
2 Western states excluding CA
3 Southern states and West Indies
WarehouseType ( warehouse_type, warehouse_description )
warehouse_type warehouse_description
-------------------------------------
1 Central Headquarters
2 Colocation
3 Inventory shipping
These entity types are often referenced by several other entity types and therefore the resulting schema might have several foreign key constraints referencing these tables. When the complexity of the business model increases, the number of such entities increases as well. And sometimes, this increase might give a certain unpleasant sense of clutter especially to those familiar with programming background. Some folks address this "clutter" by advancing approaches that reduce the number of tables. And the solution they often arrive at is a single lookup table!
GenericLookup( category, code, description )
category code description
-------------------------------------------
1 1 Ordered
1 2 Cancelled
1 3 Shipped
2 0 Unspecified
2 1 Over 5 Million
2 3 3 - 5 Million
2 4 1 - 3 Million
2 5 500,000 - 1 Million
2 6 Less than 500,000
3 1 Half Quota
3 2 One or more references
...
These structures are often known as One-True-Lookup-Table. Some online resources have termed them as MUCK tables ( Massively
Unified Code-Key tables ) or domain tables or code tables etc.
Superficial Benefits
These so called benefits are really long term nightmares. I am detailing them so that you can be familiar with the reasons why
some are lured into using these designs.
- Easy to use: Rather than understanding the underlying conceptual model and the tables that representing them,
all the developers need to know is the actual category name or the type and they can include it as a parameter.
- Quick and Simple: Which is simpler -- Lots of tables or just one column? To the untrained eye, simply adding
a column makes it quick solution. After all most of the queries may look simpler with less number of tables to join,
union, intersect etc.
- Reducing data access code: Rather than using separate data access procedures, with such common lookup tables
developers can now write a single all encompassing procedures passing an additional parameter. This boon multiplies
when updates, inserts and deletes can all be bundled up in a single procedure.
- Agile and flexible: Frankly I don't even know what the exact meaning of the term agile in the context of
data management. However I have seen many using it to mean some property of a certain schema to be extended without
breaking existing constraints. Essentially using a common lookup table seems to allow addition any number of categories
without altering the schema or adding additional tables. This seems to be one of the most compelling temptations for a
developer to opt for such a design solution.
Drawbacks and shortcomings
Now that we have gone through the "benefits", let us look at the common lookup table approach from the perspective of data
management and isolate the problems. It is quite possible that some the problems may not appear to be real problems unless we
really dig through the fundamentals of relational database design. Also in general current business environments typically favor
high performance over data integrity and therefore there could be tendency to ignore the problems altogether.
- Lack of Referential Integrity:
Referential Integrity refers to the ability of a DBMS to enforce foreign key
constraints for insert, update and delete operations with appropriate restrict and cascading options. They ensure
consistent references among tables emphasizing the kinds of relationships that exist among applicable entities. A common
lookup table cannot ensure this consistency and this is a critical drawback of such approaches.
While a direct foreign key constraint can be defined, it is essentially meaningless since it offers no real referential
integrity. Sometimes it is suggested that one can have a composite foreign key constraint over multiple columns, case
in point over category and code columns. The problem here is with reusing the composite alternate key ( category, code )
in all other tables. And the category column may obviously lead to unwanted dependencies in the referencing tables.
Alternatively the integrity enforcement will have to be done external to the DBMS which is a sure source of disaster.
- Ignoring types:
In a properly implemented DBMS, a rich type system that can represent any value of arbitrary
complexity is an essential component. However, with the commercial systems supported by SQL cannot boast of such a luxury.
Therefore intermingling of different types becomes a source of problem with common lookup tables. In the example stated
above, if the discount code is CHAR(3) and location_nbr is INT(4), what is the type of code column in GenericLookup table?
- Mediocre constraints:
This is a significant drawback of using common SQL lookup table since we have no way of
restricting the values that can be shoved into this table. Limiting the size and range of values becomes a nightmare given
that there is no way to predict a safe, reasonable maximum size. For example, priority_description could be CHAR(10) while
warehouse_description could be NVARCHAR(800). What would be an appropriate CHECK() for such or more complex discrepancies?
- Limited expressiveness:
This is again an extension of the problems described above. Every query or data
manipulation statement would have to include the category value while referring to this table. A composite key would
become a necessity to prevent redundancy. And since SQL allows NULLs in foreign keys the logic in the references by itself
breaks down. When the user inserts a referencing row with a value for category and NULL for code, should the DBMS allow
the insert? Since logically DBMS cannot preserve the integrity of the relationship should it abort the insert? Or should
it consider the validity of the category value alone and proceed with the insert? The same issues exist while updating
the referencing values as well.
- Rigidity:
How can such a seemingly simple and "flexible" design be rigid? Well considering the above common lookup table scheme, just
imagine the LocationCode table to include another column say, region. Or say adding a status to the DiscountType table?
Just to change a single category you'll have to consider make way for all the rows in the table regardless of the new
column is applicable to them or not. Again developers tend to make such columns nullable and "logic" using nulls is a
topic for another day.
- Hidden Complexity:
This is something that is not very apparent. Often the idea of using common lookup tables
come from the idea of generalizing entities where by a single table represents a "thing" -- pretty much anything. Contrast
this with the fundamental that a well designed table represents a set of facts about entities or relationships of the
same kind. The problem with generalizing entities is that a table becomes a pile of unrelated rows. Consequently it
results in the loss of meaning, leading to confusion and often unwanted complexity.
Another suggestion is to use multiple tables with a view with a UNION ALL query. This, in some cases, can be beneficial
providing logical data independence and appropriate security, however view update requirements vary from product to product.
Moreover one can argue the same for a common lookup table using multiple views for each category. This again will have to
be weighed against the arguments against the lack of constraints and unnecessary complexity that arises from managing them
just for the sake of doing it.
The main goal of a DBMS is to enforce the rules that govern how the data is represented and manipulated. Make sure you do
not confuse the terms "generalize", "reuse" etc. in the context of database design to the extent where you have no control
over what is being designed.
- Physical Implementation Issues:
While logical design is a totally separate concern from physical implementation,
commercial DBMS products supported by SQL lack sufficient insulation between them. As a result drawbacks in physical
implementations are often attributed to logical design and vice-versa. In large enterprises, such common lookup tables
can become hundreds of thousands of rows leading to the need for heavy physical database tuning. Locking and concurrency
issues with such large tables will have to be controlled. The internal representation of a particular set of row in physical
storage can be a determining factor in how efficient the values can be accessed and manipulated by SQL queries.
Recommendations
Now that you have some idea about the drawbacks of using common lookup tables here is a brief list of general recommendations
that you can use in your database design projects.
- First and foremost, avoid generic lookup approaches as a rule. Use separate tables identifying appropriate columns with proper types,
constraints and references.
- Write distinct procedures to access and manipulate data in the tables
- View approaches should be avoided since view expressions in SQL limit constraints, violate closure and have drawbacks
with updating them.
Conclusion
Whether it is used as a short term make shift solution or as a long term viable solution, common lookup tables have no place in
sensible database design. While application enforced integrity is often touted by many in the developer crowd, the fact that DBMS
is the centralized enforcer of all integrity constraints still stands. In that regard, considering preserving data integrity and
logical correctness as the foremost goal in a given database design, common lookup tables are one of the worst kind of mistakes
that one can make in a stable data management solution.
Acknowledgements
Jake Christian
|