Jul 23 2010

Domain Integrity in Relational Database

A domain is the set of all possible values for a given attribute. A domain integrity constraint—usually just called a domain constraint—is a rule that defines these legal values. It might, of course, be necessary to define more than one domain constraint to describe a domain completely.

A domain isn’t the same thing as a data type, and defining domains in terms of physical data types can backfire. The danger is that you will unnecessarily constrain the values—for example, by choosing an integer because you think it will be big enough rather than because 255 is the largest permitted value for the domain.

That being said, however, data types can be a convenient shorthand in the data model, and for this reason choosing a logical data type is often the first step in determining the domain constraints in a system. By logical data type, I mean “date,” “string,” or “image,” nothing more specific than that. Dates are probably the best example of the benefits of this approach. I recommend against defining the domain TransactionDate as DateTime, which is a physical representation. However, defining it as “a date” allows you to concentrate on it being “a date between the commencement of business and the present date, inclusive” and ignore all those rather tedious rules about leap years.

Having chosen a logical data type, it might be appropriate to define the scale and precision of a numeric type, or the maximum length of string values. This is very close to specifying a physical data type, but you should still be working at the logical level. Obviously, you will not be hit by lightning if your particular shorthand for “a string value of no more than 30 characters” is char(30). But the more abstract you keep the description in the data model, the more room you’ll have to maneuver later and the less likely you’ll be to impose accidental constraints on the system.

The next aspect of domain integrity to consider is whether a domain is permitted to contain unknown or nonexistent values. The handling of these values is contentious, and we’ll be discussing them repeatedly as we examine various aspects of database system design. For now, it’s necessary to understand only that there is a difference between an unknown value and a nonexistent value, and that it is often (although not always) possible to specify whether either or both of these is permitted for the domain.

The first point here, that “unknown” and “nonexistent” are different, doesn’t present too many problems at the logical level. (And please remember, always, that a data model is a logical construct.) My father does not have a middle name; I do not know my next-door neighbor’s. These are quite different issues. Some implementation issues need not yet concern us, but the logical distinction is quite straightforward.

The second point is that, having determined whether a domain is allowed to include unknown or nonexistent values, you’ll need to decide whether either of these values can be accepted by the system. To return to our TransactionDate example, it’s certainly possible for the date of a transaction to be unknown, but if it occurred at all it occurred at some fixed point in time and therefore cannot be nonexistent. In other words, there must be a transaction date; we just might not know it.

Now, obviously, we can be ignorant of anything, so any value can be unknown. That’s not a useful distinction. What we’re actually defining here is not so much whether a value can be unknown as whether an unknown value should be stored. It might be that it’s not worth storing data unless the value is known, or it might be that we can’t identify an entity without knowing its value. In either case, you would prevent a record containing an unknown value in the specified field from being added to the database.

This decision can’t always be made at the domain level, but it’s always worth considering since doing so can make the job a little easier down the line. To some extent, your decision depends on how generic your domains are. As an example, say that you have defined a Name domain and declared the attributes GivenName, MiddleName, Surname, and CompanyName against it. You might just as well have defined these attributes as separate domains, but there are some advantages to using the more general domain definition since doing so allows you to capture the overlapping rules (and in this case, there are probably a lot of them) in a single place. However, in this case you won’t be able to determine whether empty or unknown values are acceptable at the domain level; you will have to define these properties at the entity level.

The final aspect of domain integrity is that you’ll want to define the set of values represented by a domain as specifically as possible. Our TransactionDate domain, for example, isn’t just the set of all dates; it’s the set of dates from the day the company began trading until the current date. It might be further restricted to eliminate Sundays, public holidays, and any other days on which the company does not trade.

Sometimes you’ll be able to simply list the domain values. The domain of Weekends is completely described by the set {“Saturday”, “Sunday”}. Sometimes it will be easier to list one or more rules for determining membership, as we did for TransactionDate. Both techniques are perfectly acceptable, although a specific design methodology might dictate a particular method of documenting constraints. The important thing is that the constraints be captured as carefully and completely as possible.

LEAVE A COMMENT

Subscribe Form

Subscribe to Blog

Spam Protection by WP-SpamFree