Comparing SELECT Performance of Joining Integers vs GUIDs in SQL Server

Overview

There are plenty of online conversations and articles asking which makes a better primary key: INT or GUID? There are pros and cons to both, but from my perspective the winner is INT by a small margin. As shown in this article, as long as the indexes are maintained to keep fragmentation down the performance is similar, with INT slightly ahead in both index size and the speed of CRUD operations. The key is storing the GUID as a UNIQUEIDENTIFIER datatype and never using VARCHAR. Not only is SQL Server less efficient at comparing strings, there's also a chance to run into collation issues comparing across databases.

In these discussions, one thing I haven't seen is a comparison of SELECT performance when joining different datatypes. After all, it's rare to find queries that don't stretch across multiple tables in a relational database. I used the StackOverflow2013 database (83 GB size) for testing. Download a copy for yourself from Brent Ozar's fantastic website if you'd like to duplicate these tests. The Posts table contains 17,142,169 rows and Votes contains 52,928,720 rows, so running the commands below can take several minutes.

Some additional reading:

Kimbery Tripp on GUID primary keys

Using NEWSEQUENTIALID() can help with GUID fragmentation, but it only increments until SQL Server is restarted/rebooted. At that point it will start with a new GUID, and that could be at a lower point than previously used.

Scenario 1

First, let's join the existing clustered primary key (INT) in Posts with a new nonclustered index (INT) on Votes.PostId.

Setup

CREATE NONCLUSTERED INDEX [IX_Votes_PostId]

ON [dbo].[Votes]
([PostId] ASC)

WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF

, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF
, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON
, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF)

ON [PRIMARY]

GO

Query

select top 1000000 Posts.Id

, Votes.Id

from Votes
inner join Posts

on Votes.PostId = Posts.Id;

Scenario 2

Next, let's create a table copy of Posts with a UNIQUEIDENTIFIER primary key, and a new nonclustered index in Votes to match.

Setup

First, let's add a UNIQUEIDENTIFIER column to Posts.

alter table Posts
add GuidId uniqueidentifier null;

update Posts set GuidId = newid();

Then we'll create a new table from Posts, but using the new GuidId as the primary key, and then populate it from Posts.

CREATE TABLE [dbo].[PostsGuid](
[GuidId] [uniqueidentifier] NOT NULL,
[AcceptedAnswerId] [int] NULL,
[AnswerCount] [int] NULL,
[Body] [nvarchar](max) NOT NULL,
[ClosedDate] [datetime] NULL,
[CommentCount] [int] NULL,
[CommunityOwnedDate] [datetime] NULL,
[CreationDate] [datetime] NOT NULL,
[FavoriteCount] [int] NULL,
[LastActivityDate] [datetime] NOT NULL,
[LastEditDate] [datetime] NULL,
[LastEditorDisplayName] [nvarchar](40) NULL,
[LastEditorUserId] [int] NULL,
[OwnerUserId] [int] NULL,
[ParentId] [int] NULL,
[PostTypeId] [int] NOT NULL,
[Score] [int] NOT NULL,
[Tags] [nvarchar](150) NULL,
[Title] [nvarchar](250) NULL,
[ViewCount] [int] NOT NULL,
CONSTRAINT [PK_PostsGuid_Id] PRIMARY KEY CLUSTERED
([GuidId] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF
, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON
, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF)
ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

GO

INSERT INTO [dbo].[PostsGuid]
([GuidId]
,[AcceptedAnswerId]
,[AnswerCount]
,[Body]
,[ClosedDate]
,[CommentCount]
,[CommunityOwnedDate]
,[CreationDate]
,[FavoriteCount]
,[LastActivityDate]
,[LastEditDate]
,[LastEditorDisplayName]
,[LastEditorUserId]
,[OwnerUserId]
,[ParentId]
,[PostTypeId]
,[Score]
,[Tags]
,[Title]
,[ViewCount])
SELECT [GuidId]
,[AcceptedAnswerId]
,[AnswerCount]
,[Body]
,[ClosedDate]
,[CommentCount]
,[CommunityOwnedDate]
,[CreationDate]
,[FavoriteCount]
,[LastActivityDate]
,[LastEditDate]
,[LastEditorDisplayName]
,[LastEditorUserId]
,[OwnerUserId]
,[ParentId]
,[PostTypeId]
,[Score]
,[Tags]
,[Title]
,[ViewCount]
FROM Posts

GO

Finally, we'll create a new column and non-clustered index on Votes for the new table.

alter table Votes
add PostGuidId uniqueidentifier null;

update v
set v.PostGuidId = p.GuidId
from Votes v
inner join Posts p
on v.PostId = p.Id

CREATE NONCLUSTERED INDEX [IX_Votes_PostGuidId] ON [dbo].[Votes]
([PostGuidId] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF
, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF
, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON
, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
GO

Query

select top 1000000 PostsGuid.GuidId
, Votes.PostGuidId
from Votes
inner join PostsGuid
on Votes.PostGuidId = PostsGuid.GuidId;

Scenario 3

Finally, for giggles let's do a GUID primary key for Posts, but stored in CHAR(64) to see how that compares against UNIQUEIDENTIFIER.

Setup

First, let's add a CHAR(64) column to Posts and populate it from GuidId.

alter table Posts
add GuidStringId char(64) null;

update Posts set GuidStringId = GuidId;

Then we'll create a new table from Posts, but using the new GuidStringId as the primary key, and then populate it from Posts.

CREATE TABLE [dbo].[PostsGuidString](
[GuidStringId] [char](64) NOT NULL,
[AcceptedAnswerId] [int] NULL,
[AnswerCount] [int] NULL,
[Body] [nvarchar](max) NOT NULL,
[ClosedDate] [datetime] NULL,
[CommentCount] [int] NULL,
[CommunityOwnedDate] [datetime] NULL,
[CreationDate] [datetime] NOT NULL,
[FavoriteCount] [int] NULL,
[LastActivityDate] [datetime] NOT NULL,
[LastEditDate] [datetime] NULL,
[LastEditorDisplayName] [nvarchar](40) NULL,
[LastEditorUserId] [int] NULL,
[OwnerUserId] [int] NULL,
[ParentId] [int] NULL,
[PostTypeId] [int] NOT NULL,
[Score] [int] NOT NULL,
[Tags] [nvarchar](150) NULL,
[Title] [nvarchar](250) NULL,
[ViewCount] [int] NOT NULL,
CONSTRAINT [PK_PostsGuidString_Id] PRIMARY KEY CLUSTERED
([GuidStringId] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF
, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON
, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF)
ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO

INSERT INTO [dbo].[PostsGuidString]
([GuidStringId]
,[AcceptedAnswerId]
,[AnswerCount]
,[Body]
,[ClosedDate]
,[CommentCount]
,[CommunityOwnedDate]
,[CreationDate]
,[FavoriteCount]
,[LastActivityDate]
,[LastEditDate]
,[LastEditorDisplayName]
,[LastEditorUserId]
,[OwnerUserId]
,[ParentId]
,[PostTypeId]
,[Score]
,[Tags]
,[Title]
,[ViewCount])
SELECT [GuidStringId]
,[AcceptedAnswerId]
,[AnswerCount]
,[Body]
,[ClosedDate]
,[CommentCount]
,[CommunityOwnedDate]
,[CreationDate]
,[FavoriteCount]
,[LastActivityDate]
,[LastEditDate]
,[LastEditorDisplayName]
,[LastEditorUserId]
,[OwnerUserId]
,[ParentId]
,[PostTypeId]
,[Score]
,[Tags]
,[Title]
,[ViewCount]
FROM Posts
GO

Finally, we'll create a new column and non-clustered index on Votes for the new table.

alter table Votes
add PostGuidStringId char(64) null;

update v
set v.PostGuidStringId = p.GuidStringId
from Votes v
inner join Posts p
on v.PostId = p.Id

CREATE NONCLUSTERED INDEX [IX_Votes_PostGuidStringId] ON [dbo].[Votes]
([PostGuidStringId] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF
, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF
, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON
, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
GO

Query

select top 1000000 PostsGuidString.GuidStringId
, Votes.PostGuidStringId
from Votes
inner join PostsGuidString
on Votes.PostGuidStringId = PostsGuidString.GuidStringId;

Results

Scenario 1

PK Size: 33.3 GB
IX Size: 0.7 GB
CPU Time: 0.6 seconds
Elapsed Time: 5.5 seconds
Network I/O: 4.9 seconds
Subtree Cost: 75

Scenario 2

PK Size: 34.6 GB
IX Size: 1.7 GB
CPU Time: 2.0 seconds
Elapsed Time: 7.5 seconds
Network I/O: 5.9 seconds
Subtree Cost: 115

Scenario 3

PK Size: 35.5 GB
IX Size: 4.8 GB
CPU Time: 3 seconds
Elapsed Time: 9.6 seconds
Network I/O: 6.7 seconds
Subtree Cost: 98.4

The JOIN results show similar performance characteristics as the tables alone.

INT has the lowest storage requirements and best performance
UNIQUEIDENTIFIER has slightly higher storage with lower performance
CHAR(64) has slightly higher storage with the lowest performance

Recommendation?

So, this is just my opinion, given my experiences, reading the work of others, and the results of this testing, but I think most tables would benefit from 2, possibly 3 ID columns.

Primary Key

INT
Also known as a surrogate key. This is a field that uniquely identifies the records within the system, but does not depend on the data it represents. It is used for foreign key lookups and joins.

Natural Key

UNIQUEIDENTIFIER or VARCHAR/NVARCHAR
If there is a unique natural key that makes sense for the data and won't change over time, by all means use it. If not, a GUID is perfect for uniquely identifying across databases and systems.

Public Key (optional)

If a system will expose data publicly it is more secure to not use any real sequential key to prevent attackers from guessing other keys.
If a random GUID was used for the natural key, that could also be used for a public key. If not, a new GUID column could be created, or any other unique identifier that makes sense for the data lookups.

Your mileage may vary.

Search This Blog

Desiderata Errata