A
surrogate key in a
database is a unique identifier for either an
entity in the modeled world or an
object in the database. The surrogate key is
not derived from application data.
Definition
There are at least two definitions of a surrogate:
; Surrogate (1) : This definition is based on that given by Hall, Owlett and Todd (1976). Here a surrogate represents an entity in the outside world. The surrogate is internally generated by the system but is nevertheless visible to the user or application.
; Surrogate (2) : This definition is based on that given by Wieringa and De Jonge (1991). Here a surrogate represents an object in the database itself. The surrogate is internally generated by the system and is invisible to the user or application.
The surrogate (1) definition defines its usage in the data model rather than the storage model and is used in this article. See Date (1998).
An important distinction exists between a surrogate and a primary key, depending on whether the database is a current database or a temporal database. A current database stores only currently valid data, therefore there is a one-to-one correspondence between a surrogate in the modelled world and the primary key of some object in the database; in this case the surrogate may be used as a primary key, resulting in the term surrogate key. However, in a temporal database there is a many-to-one relationship between primary keys and the surrogate. Since there may be several objects in the database corresponding to a single surrogate, we cannot use the surrogate as a primary key; another attribute is required, in addition to the surrogate, to uniquely identify each object.
Although Hall et alia (1976) say nothing about this, others have argued that a surrogate should have the following
* the value is unique system-wide, hence never reused;
the value is system generated;
the value is not manipulable by the user or application;
the value contains no semantic meaning;
the value is not visible to the user or application;
the value is not composed of several values from different domains.
Surrogates in practice
In a
current database, the surrogate key can be the
primary key, generated by the
database management system and
not derived from any application data in the database. The only significance of the surrogate key is to act as the primary key. It is also possible that the surrogate key exists in addition to the database-generated uuid (for example, an HR number for each employee besides the UUID of each employee).
A surrogate key is frequently a sequential number (e.g. a Sybase or SQL Server "identity column", a PostgreSQL serial
, an Oracle SEQUENCE
or a column defined with AUTO_INCREMENT
in MySQL) but doesn't have to be. Having the key independent of all other columns insulates the database relationships from changes in data values or database design (making the database more agile) and guarantees uniqueness.
In a temporal database, it is necessary to distinguish between the surrogate key and the primary key. Typically, every row would have both a primary key and a surrogate key. The primary key identifies the unique row in the database, the surrogate key identifies the unique entity in the modelled world; these two keys are not the same. For example, table Staff may contain two rows for "John Smith", one row when he was employed between 1990 and 1999, another row when he was employed between 2001 and 2006. The surrogate key is identical (non-unique) in both rows however the primary key will be unique.
Some database designers use surrogate keys systematically regardless of the suitability of other candidate keys, while others will use a key already present in the data, if there is one.
A surrogate key may also be called
a synthetic key,
an entity identifier,
a system-generated key,
a database sequence number,
a factless key,
a technical key, or
an arbitrary unique identifier.
Some of these terms describe the way of generating new surrogate values rather than the nature of the surrogate concept.
Here are some possible candidates for generating surrogates:
* Universally Unique Identifiers (UUIDs)
Globally Unique Identifiers (GUIDs)
Object Identifiers (OIDs)
Sybase or SQL Server identity column
Oracle SEQUENCE
PostgreSQL serial
MySQL AUTO_INCREMENT
AutoNumber data type in Microsoft Access
AS IDENTITY GENERATED BY DEFAULT
in IBM DB2
Advantages
Immutability
Surrogate keys do not change while the row exists. This has the following advantages:
* Applications cannot lose their reference to a row in the database (since the identifier never changes).
The primary key data can always be modified, even with databases that do not support cascading updates across related foreign keys.
Requirement changes
Attributes that uniquely identify an entity might change, which might invalidate the suitability of the natural, compound keys. Consider the following example:
:An employee's network user name is chosen as a natural key. Upon merging with another company, new employees must be inserted. Some of the new network user names create conflicts because their user names were generated independently (when the companies were separate).
In these cases, generally a new attribute must be added to the natural key (for example, an original_company column).
With a surrogate key, only the table that defines the surrogate key must be changed. With natural keys, all tables (and possibly other, related software) that use the natural key will have to change.
Some problem domains do not clearly identify a suitable natural key. Surrogate keys avoids choosing a natural key that might be incorrect.
Performance
Surrogate keys tend to be a compact data type, such as a four-byte integer. This allows the database to query the single key column faster than it could multiple columns. Further:
* A non-redundant distribution of keys causes the resulting b-tree index to be completely balanced.
Surrogate keys are less expensive to join (fewer columns to compare) than compound keys.
Compatibility
While using several database application development systems, drivers, and
object-relational mapping systems, such as
Ruby on Rails or
Hibernate (Java), it is much easier to use an integer or GUID surrogate keys for every table instead of natural keys in order to support database-system-agnostic operations and object-to-row mapping.
Uniformity
When having a uniform surrogate key on each and every table, some tasks can be easily automated for all the tables by writing the code in a table-independent way.
Validation
It is possible to design key-values such that they follow a certain well-known pattern or structure which can be automatically verified by some means. For instance, the keys which are intended to be used in some column of some table, might be designed to "look differently from" those that are intended to be used in another column or table, thereby simplifying the detection of application errors in which the keys have been misplaced. However, this characteristic of the surrogate keys should never be used to drive any of the logic of the applications themselves, as this would violate the principles of
Database normalization.
Disadvantages
Disassociation
The values of generated surrogate keys have no relationship to the real-world
meaning of the data held in a row. When inspecting a row holding a foreign key reference to another table using a surrogate key, the meaning of the surrogate key's row cannot be discerned from the key itself. Every foreign key must be joined to see the related data item. This can also make auditing more difficult, as incorrect data is not obvious on inspection.
Surrogate keys are unnatural for data that is exported and shared. A particular difficulty is that tables from two otherwise identical schemas (for example, a test schema and a development schema) can hold records that are equivalent in a business sense, but have a different key due to how the keys were assigned. This can be mitigated not exporting surrogate keys, except as transient data (most obviously, in executing applications that have a "live" connection to the database).
Query optimization
Relational databases assume a unique
index is applied to a table's primary key. The unique index serves two purposes: 1) to enforce entity integrity—primary key data must be unique across rows—and 2) to quickly search for rows queried. Since surrogate keys replace a table's identifying attributes—the
natural key—and since the identifying attributes are likely to be those queried, then the query optimizer is forced to perform a full table scan when fulfilling likely queries. The remedy to the full table scan is to apply indexes on the identifying attributes, or sets of them. Where such sets are themselves a
candidate key, the index can be a unique index.
These additional indexes, however, will take up disk space and slow down inserts and deletes. Choosing what sets of attributes shall be indexed can be something of an art. It depends on what queries are typically run against a database and often on knowledge of the "real world". A database administrator (DBA) will typically monitor the performance of a working database and take note of what queries are frequently run that trigger full table scans, and add indexes accordingly.
Normalization
The presence of a surrogate key can result in the database administrator forgetting to establish, or accidentally removing, a secondary unique index on the natural key of the table. Without a unique index on the natural key, duplicate rows can appear and once present can be difficult to identify.
Although applications accessing the database should be free of bugs and never attempt to create rows with duplicate natural keys, nevertheless properly maintained unique keys are a vital "last line of defense" against this type of data corruption. This is particularly the case where data is sometimes mutated by user-entered SQL. Of course, due to the very fact that a well-tested and bedded down application will not attempt to introduce duplicate keys, the absence of database constraints is often not immediately apparent as a problem.
Modern databases, however, can apply database triggers to enforce unique constraints.
Business process modeling
Because surrogate keys are unnatural, flaws can appear when modeling the business requirements. Business requirements, relying on the natural key, then need to be translated to the surrogate key. A strategy is to draw a clear distinction between the logical model (in which surrogate keys do not appear) and the physical implementation of that model, to ensure that the logical model is correct and reasonably well normalised, and to ensure that the physical model is a correct implementation of the logical model.
Inadvertent disclosure
Proprietary information can be leaked if sequential key generators are used. By subtracting a previously generated sequential key from a recently generated sequential key, one could learn the number of rows inserted during that time period. This could expose, for example, the number of transactions or new accounts per period. There are a few ways to overcome this problem:
* Increase the sequential number by a random amount.
Generate a completely random primary key. However, to prevent duplication which would cause an insert rejection, a randomly generated primary key must either be queried (to check that it is not already in use), or the key must contain enough entropy that one can be confident that collisions will not happen.
Inadvertent assumptions
Sequentially generated surrogate keys create the illusion that events with a higher primary key value occurred after events with a lower primary key value. This illusion would appear when an event is missed during the normal data entry process and is, instead, inserted after subsequent events were previously inserted. The solution to the inadvertent assumption problem is to generate a random primary key. However, a randomly generated primary key must be queried before assigned to prevent duplication and cause an insert rejection.
See also
Natural key
Object identifier
Persistent Object Identifier
References
Engles, R.W.: (1972), A Tutorial on Data-Base Organization, Annual Review in Automatic Programming, Vol.7, Part 1, Pergamon Press, Oxford, pp. 1–64.
Langefors, B (1968). Elementary Files and Elementary File Records, Proceedings of File 68, an IFIP/IAG International Seminar on File Organisation, Amsterdam, November, pp. 89–96.
The identification of objects and roles: Object identifiers revisited by R. Wieringa and W. de Jonge (1991).
Relational Database Writings 1994–1997 by C.J. Date (1998), Chapters 11 and 12.
Category:Data modeling