Using Unique Identifiers (UIDs) as primary keys, especially as surrogates in data warehousing comes with many benefits.
There are also performance impacts which are well known to be detrimental – but how much so? Does it matter? I set out to find out the performance impact and design drawbacks to UIDs as primary keys.
Benefits vs. Drawbacks
Some Pros –
The key value can be obtained by the client before the row is inserted rather than after, without the need for a DB round trip. This is a godsend to .Net programmers working with the DataSet object as relationships can be determined before inserts are performed.
- Key generation can be done at the client instead of by the DB. While a spin-off of the bullet above, this warrants a call-out as the ramifications cannot be understated.
- UIDs are globally unique values, so keys are unique throughout the database: this assists with scenarios such as DataWarehousing and multiple-site aggregation, as primary keys need not have surrogates or other remapping.
Merge Replication. When merge replication is to be used, each table to be replicated must have a UID column. A UID primary key serves this purpose.
Distributed Partitioned Views. UID primary keys support distributed partitioned views, whereas Integer Identity columns do not support them.
A UID is a truly meaningless value to the end user. Occasionally an integer key column becomes depended upon and represented in the UI (i.e. a CustomerID) whilst the risk of the same problem occurring when using UID keys becomes reduced.
Importing and exporting of data becomes simplified as many needs for remapping lookup values and relationships during these processes are eliminated using UIDs.
As 16-byte values, UIDs have more possible combinations than Integers (4 bytes) or BigInts (8 bytes).
And Some Cons
UIDs are random values. Because indexes are stored in order, such randomness increases fragmentation of indexes – and will even increase table fragmentation when clustered.
UIDs have wide values. With 4 times the width of a standard integer identity column, we incur roughly 4 times the processing overhead when doing comparisons/joins. If the primary key is also the clustered index, this additional width is replicated over into the non-clustered indexes.
Not every database has a native UID datatype, so transitioning the schema to another provider might prove difficult.
UIDs are less developer-friendly than Integers, as UIDs aren’t easily remembered, and queries against the database necessarily become more script oriented.
UIDs aren’t 100% completely unique. Due to privacy concerns the use of a MAC address as a basis for UID generation has been removed, so while very, very little chance exists, duplicate UIDs can occasionally be encountered. Due to the number of high number of possibilities to a UID’s value, this is extremely rare occurrence – and those who are very paranoid can handle the rare occasion with exception handling.
Barring developer happiness and multiple DB platforms, the challenges with UIDs boil down to two root issues: Randomized values and Data width. We realized in our environment that in order to reach a conclusion to the driving question mentioned above, we needed to understand these challenges in more depth. Primarily we wanted to understand if and how we could overcome the challenges with our database configuration.
Randomized Values (Fragmentation)
A simple truth is that randomized values result in dependent index fragmentation, and when clustering by a randomized value, table fragmentation. This is one reason why clustered identity columns are highly efficient (see Figure 1), as there is a fixed sequential progression which reduces table and index fragmentation to nil. Conversely, the reason why random values promote fragmentation is they create ‘inserts’ in-between other values which leads to page splits (see Figure 2), which result in page and extent fragmentations. As fragmentation increases, efficiency decreases, until ultimately the index is no longer deemed ‘viable’ by SQL server.
Regardless, once scan density has become unacceptable, a DBCC INDEXDEFRAG (non-blocking) command can be used to improve it, and a DBCC REINDEX (blocking) command will nearly restore it to 100%. We run a nightly job that detects Scan Densities that have fallen below a threshold of 30%, automatically performing an INDEXDEFRAG when appropriate. When the INDEXDEFRAG isn’t achieving enough Scan-Density buy-back anymore, we manually run a DBCC REINDEX against the affected index.
We’ve also found the setting of the FillFactor value when creating the indexes (and table if clustered by UID) becomes very important, as it directly affects the rate at which the index/table fragments. The lower the fill factor, the greater the empty space in pages, the less the pages need to split, hence less fragmentation. Conversely, the higher the fill factor, the less empty space, the more page splitting occurs, and the faster we’re DOA.
GUID Data Width
The width of a UID is certainly greater than that of an Integer, thus there is additional overhead to comparison operations against the UID and index storage can be affected adversely. The additional width affects the majority of operations against the table when using a UID as the primary key. When clustering by the UID, the size of the non-clustered indexes is also affected by the additional width. We wanted to find a way to reduce the index size and join overhead, while still maintaining the UID as the table’s primary key.
We can determine the size of the index using the DBCC SHOW_STATISTICS command. This command provides metrics about the statistics behind each index. Under index details the metric “Average key length” can be found. The metric shows the average size of the value being indexed. For fixed-length datatypes such as UIDs and integers, this column allows us to easily approximate the size of the index using # Rows * Key Length.
In our environment, we decided to attempt a balance between space consumption and query efficiency. We added an integer based identity column and clustered the table by it, and we created the primary key constraint against the UID column, along with a non-clustered index.
What this saved us was excessive non-clustered index size and table fragmentation due to clustering by a randomized value. We still have additional processing overhead related to UIDs, and bookmark-lookups from the non-clustered UID to the clustered integer column, both of which we’ve decided to live with, as measurements show that our SQL Server is not currently CPU constrained.
A meaningful metric that measures fragmentation is Scan Density, available via the DBCC SHOWCONTIG command. Scan Density represents the physical contiguity of the table, so a scan density of 100% is perfect (meaning the table/index is perfectly linear and contiguous) and a density of 0% is rotten. SQL will discard the index due to cost if scan density nears the 20% or less range. Discarding the index results in table scans, and table scans will wreak havoc on any system, including those using UIDs as primary keys. This particular problem could appear gremlin-like, as query plans won’t necessarily reflect that the index was discarded under this scenario.
Primary Key Benchmarks
The scripts I used to create the benchmarking table can be downloaded here.
I created a simple UID work table based on several different fill factors to find an appromiate ideal – setlling on a FillFactor of 50% for UID based primary keys – using 3 million rows.
I ran simple selects through direct statements, prepared statements, and stored procedures timed in milliseconds per lookup. I randomly looked up rows out of the database. I captured counts while flushing buffers and with data cached.
Out of curiousity I also benchmarked varchars as I’ve come across their use as primary key more than once.
|Clean Buffers = True||Clean Buffers = False|
Varchar PK (appending values)
|Clean Buffers =True||Clean Buffers = False|
Varchar PK (prepend)
|Clean Buffers =True||Clean Buffers = False|
|Clean Buffers =True||Clean Buffers = False|
The benchmark produces interesting results. UIDs perform faster than integers when data is cached. While the test was run repeatedly there were significant deviations. I hesiatate to speculate why.
Its clear the performance impact is negligible – of no concern.
In the end fragmentation is the only objection remaning, and with tools like SQL defrag manager or scenarios free from fragmentation such as daily incremental loading of a warehouse this risk is easily mitigated.
Resultingly I recommend UID primary keys as a substitute for identities when other rationale exists (such as multiple source systems where key collision is likely), and for their routine use as surrogate keys in a data warehouse.