Talk:Surrogate key

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Computing (Rated Start-class)
WikiProject icon This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Start-Class article Start  This article has been rated as Start-Class on the project's quality scale.
 ???  This article has not yet received a rating on the project's importance scale.
 


The assertion that "the addition of a surrogate key will slow down access to the table, particularly if it is indexed" should be removed. It's not substantiated, incomplete, and misleading. It has no place in theoretical performance computations and in practice is almost always incorrect anyway. Finally, it's not a part of the definition.

Agreed[edit]

I agree. An index on a surrogate key will be smaller (and therefore could be faster) than an index on a natural key.

Hrmph![edit]

Some Anonymous Hero's decided that it's better to throw out everybody else's work and dump in some essay they wrote themselves. I'm marking this for wikification, and I'm going to wikify and merge in some of the old content if it's relevant when I've more time. --Kgaughan 19:35, 13 December 2005 (UTC)

Needs Attention[edit]

This article needs attention. The current posting is correct, though quite poorly written. It's certainly relational-centric.

The old article seems to confuse system-assigned keys with surrogate keys, particularly in this paragraph: ""Surrogate key" may also be known as "System-generated key", "Database Sequence number", "Synthetic key", "Technical key" or an "Arbitrary, unique identifier"."

I don't think either version thinks of "surrogate keys" from the point of view of data warehousing, though.

-- 03:01, 31 December 2005 (UTC)

Huh? Surrogate Keys are, AFAIK, a relational database concept. How does the term even apply to other kinds of databases? Jberkus 19:05, 23 September 2006 (UTC)

Similarities[edit]

Surely a surrogate key and a primary key are identical? Is there really any difference?

Sure there is. A primary key can be surrogate, or it can be natural. A surrogate key is always a candidate key, but may or may not be designated primary key (as any other candidate key, really). Of course, in practice surrogate keys are pretty much always primary keys (but not vice versa, though surrogate primary keys seem to be more popular at the moment). -- int19h 08:12, 11 September 2006 (UTC)

Added Advantages/Disadvantages[edit]

I've added a summary of the main advantages and disadvantages of Surrogate Keys. I've also linked my own blog article; I realize that linking one's on blog is in poor taste, so if someone else has an article which extensively catalogs the disadvantages of surrogate keys, then please link it instead. Jberkus 19:07, 23 September 2006 (UTC)

Surrogate Keys in Data Warehouse[edit]

Isn't surrogate keys more applied in the data warehousing for relating fact and dimension tables. Surrogate keys are also one of the solution for the "Slowly Changing Dimension" problem. I strongly feel this has to be included along with the following external link [1]. —The preceding unsigned comment was added by 203.123.182.27 (talk) 05:19, 2 February 2007 (UTC).


Compatibility[edit]

"several database application development systems, drivers, and object-relational mapping systems, such as Ruby on Rails or Hibernate, depend on the use of integer or GUID surrogate keys in order to support database-system-agnostic operation and object-to-row mapping."

Hibernate works fine with non-surrogate key, even with composite keys, see [2]. I know little about Ruby on Rails, but I very much doubt that they require surrogate keys.

—The preceding unsigned comment was added by 85.158.137.195 (talk) 11:16, 20 April 2007 . (sig added by --FvdP (talk) 16:07, 9 May 2008 (UTC))

Random keys[edit]

"However, a randomly generated primary key must be queried before assigned to prevent duplication and cause an insert rejection." Really ? Isn't the point of some random generation schemes (random 128-bit UUIDs) to make the test redundant by making the collision probability definitely negligible ? Also can't the collision, if need be, be checked after rejection by the uniqueness constraint ? --FvdP (talk) 16:12, 9 May 2008 (UTC)

New disadvantage: "Performance"[edit]

The following subsection was added to the Disadvantages section: "For some queries, the existence of a surrogate key may result in the need for more |joins in order to retrieve the needed information. E.g., if a table of telephone numbers uses a surrogate key (some counter) instead of the natural key (the telephone number), referencing tables need to be joined with the telephone number table, in order to retrieve actual telephone numbers." I moved it here because it needs some more thought. This disadvantage is not a surrogate/natural key choice; instead, it is an over-normalization. See Database normalization. Timhowardriley (talk) 22:05, 26 January 2009 (UTC)

The telephone number table doesn't have to have only one column (I assume that this is what you were thinking when you wrote "over-normalization"?) The example could certainly be better. But when using surrogate keys, child tables don't include data from referred parent tables, meaning that a join is needed when you need data from the natural key based on rows on the child table; because of this, there are cases where you save a join when using natural keys. TroelsArvin (talk) 22:49, 26 January 2009 (UTC)
Regarding "The example could certainly be better.": I agree. However, I disagree that performance in this context is significantly effected; it's only marginally effected. Performance would only be marginally effected by joins of natural keys because of the increased size of the natural key over that of a surrogate key. Timhowardriley (talk) 17:20, 27 January 2009 (UTC)
So let's agree on the fact that extra joins may result from using surrogate keys, right? That can - all else being equal - only be a drawback of surrogate keys. Is it significant? - In the hot new column store databases, it may not be so significant. In traditional row-store databases, an extra join can certainly be significant. I assert that it's rather uncontroversial to try to eliminate extra joins.TroelsArvin (talk) 21:47, 27 January 2009 (UTC)

Disadvantage: Query Optimization[edit]

The Query Optimization disadvantage was changed to read:

  • "Relational databases assume a unique index is applied to a table's primary key. The unique index serves two purposes: 1) to enforce entity integrity—primary key data must be unique across rows—and 2) to quickly search for rows queried. Since surrogate keys replace a table's identifying attributes—the natural key—and since the identifying attributes are likely to be those queried, then the query optimizer is forced to perform a full table scan when fulfilling likely queries. The remedy to the full table scan is to apply an additional index on the columns containing the surrogate key. However, the additional index will take up disk space, slow down data-modifying operations, and possibly increase lock contention and decrease cache efficiency."

However, it originally read:

  • "Relational databases assume a unique index is applied to a table's primary key. The unique index serves two purposes: 1) to enforce entity integrity—primary key data must be unique across rows—and 2) to quickly search for rows queried. Since surrogate keys replace a table's identifying attributes—the natural key—and since the identifying attributes are likely to be those queried, then the query optimizer is forced to perform a full table scan when fulfilling likely queries. The remedy to the full table scan is to apply a (non-unique) index on each of the identifying attributes. However, these additional indexes will take up disk space, slow down inserts, and slow down deletes."

I don't see the added clarity. Actually, "slow down insert, and slow down deletes" is more clear than "slow down data-modifying operations." Also, why would you put an additional index on the surrogate key? It already has a unique index on it. Timhowardriley (talk) 22:14, 26 January 2009 (UTC)

The surrogate key is generally of no end-user interest. End-use will normally operate on the natural key data, meaning that the natual key will also need to be indexed, at least (and often also be governed by a unique constraint, supplementing the primary key, for data quality purposes). Thus, you end up with more indexes. And while indexes are certainly a nice invention, it's well known that they slow down inserts, updates and deletes. About "data-modifying" vs. inserts/deletes: Indexes slow down any modification, not only insert and delete (but someone can probably come up with a better general term; mutating operation?) TroelsArvin (talk) 22:31, 26 January 2009 (UTC)
Regarding "but someone can probably come up with a better general term; mutating operation?": this would be original research. See wp:or. Regarding everything else: these statements don't address my objection. My objection is why should the paragraph be changed to something illogical. The paragraph is being changed to say, "The remedy to the full table scan is to apply an additional index on the columns containing the surrogate key." Why would you put an additional index on the surrogate key? It already has a unique index on it. Timhowardriley (talk) 17:12, 27 January 2009 (UTC)
About the choice of words: I'm honestly trying to keep a conversation here, in order to improve the article. There is a difference between "insert or delete" and "insert or update or delete", right. So I'm asking for suggestions to better express this. TroelsArvin
"Database Design Fundamentals", by Naphtali Rishe, says that one restriction for the primary key is that it must be "time-invariant". I've also heard this restriction called the "immutability constraint." What this means is the primary key cannot change. However, it might be the case that relation DBMSs now allow for changing the primary key. So, depending upon the scope you're talking about, either "inserts or deletes" will be slower or "inserts, updates, and deletes" will be slower. However, rather than going into this nuance in the article, I think it's safer to regard the historical perception that primary keys are immutable. Timhowardriley (talk) 21:10, 28 January 2009 (UTC)
I do not argue that an extra index be defined for the surrogate key (I didn't write that; someone else did). But there will have to be a uniqueness constraint on the natural candidate key to protect against redundant data (if there is only a primary key on the surrogate key, the table is open for _any_ insert); at least, this is the standard recommendation from surrogate key advocates (unless they subscribe to the view that integrity should solely be enforced by the applications - a rather radical view, going against many a good advice and many a good textbook on relational databases). This means that you end up with an index on the surrogate key, AND an index on the column(s) which would have made up the natural key.TroelsArvin (talk) 21:47, 27 January 2009 (UTC)
Oh, I see. You're saying that there's two solutions to the full table scan problem: 1) an additional index on each of the identifying attributes or 2) an additional unique index on the natural key. Timhowardriley (talk) 21:10, 28 January 2009 (UTC)
I very rarely see a table with a single index, unless it's a child table way down the ER graph. When we're doing any software development we keep an eye on the queries, feed them through Explain Plan or similar, add in indexes as necessary to remove bottlenecks, then run the whole thing again to make sure the new indexes aren't slowing other operations down. By which I mean to say; it is true that using surrogate keys will usually force you to have at least two indexes per table - one for the surrogate key, and one for the natural key - but most tables will have more than one index anyway, so that they can be used efficiently in many queries. —Preceding unsigned comment added by 115.186.240.40 (talk) 03:20, 10 February 2010 (UTC)

Altered a few of the sections[edit]

I ought to have read the discussion page before making edits, but I didn't. I also out to have logged on first - the edits are under ip address 124.178.225.104 . I have made a few changes where I thought the wording was a trifle clumsy, although I have tries to keep what I think the intended meaning was ("dissassociated" isn't a very good word, but I now what you mean).

I've also added a bit here and there.

The stuff on the caveats relating to randomly generated keys needs to be moved out, as it is duplicated. —Preceding unsigned comment added by Paul Murray (talkcontribs) 11:13, 26 March 2009 (UTC)

Your additions are surely productive. The duplication should remain to keep each paragraph independent. Timhowardriley (talk) 05:27, 27 March 2009 (UTC)


Inadvertent Assumptions demistified[edit]

The inadvertant assumption - that surrogate keys have meaning - can be solved by repeating "surrogate keys have _no_ meaning" in meetings. Also by peer reviewing for code that uses surrogate key values in anything other than joins.

Removed Disadvantages/Normalization paragraph[edit]

I removed this text. "Although applications accessing the database should be free of bugs and never attempt to create rows with duplicate natural keys, nevertheless properly maintained unique keys are a vital "last line of defense" against this type of data corruption. This is particularly the case where data is sometimes mutated by user-entered SQL. Of course, due to the very fact that a well-tested and bedded down application will not attempt to introduce duplicate keys, the absence of database constraints is often not immediately apparent as a problem. Modern databases, however, can apply database triggers to enforce unique constraints." It's meaning should be more clear. I'm not sure if it refutes or supports the previous sentence that says, "Without a unique index on the natural key, duplicate rows can appear and once present can be difficult to identify." Timhowardriley (talk) 19:16, 13 May 2011 (UTC)

Requirements changes[edit]

The example given in the Requirements changes section is not an argument for Oracle sequences or similar approaches, because those sequences would have been independent in the two separate companies as well (unless they used a shared database such as a clouid-hosted seevice). However, it seems like a good example for why to use UUIDs. Any objections to noting this?--greenrd (talk) 09:36, 2 February 2012 (UTC)

References[edit]

Is Definition (1) from P. Hall, J. Owlett, S. Todd, ``Relations and Entities, in ``Modeling in Data Base Management Systems, G.M. Nijssen (ed.), pp 201-220, North-Holland, 1976? — Preceding unsigned comment added by 91.37.175.230 (talk) 08:04, 25 October 2012 (UTC)

Caveat[edit]

"For example, table Staff may contain two rows for "John Smith", one row when he was employed between 1990 and 1999, another row when he was employed between 2001 and 2006.[] The surrogate key is identical (non-unique) in both rows however the primary key will be unique."

Above explanation from the article is the exact opposite of how Kimballl & Ross explain surrogate keys:

"We need two different product surrogate keys for the same SKU or physical barcode. Each of the separate surrogate keys identifies a unique product attribute profile that was true for a span of time."(Kimball & Ross 2002, The Data Warehouse Toolkit, p. 97)

Also how can a single surrogate key exist twice within one table if implemented via IDENTITY, AUTOINCREMENT, etc. as described?