Comparing SELECT Performance of Joining Integers vs GUIDs in SQL Server
Overview
There are plenty of online conversations and articles asking which makes a better primary key: INT or GUID? There are pros and cons to both, but from my perspective the winner is INT by a small margin. As shown in this article, as long as the indexes are maintained to keep fragmentation down the performance is similar, with INT slightly ahead in both index size and the speed of CRUD operations. The key is storing the GUID as a UNIQUEIDENTIFIER datatype and never using VARCHAR. Not only is SQL Server less efficient at comparing strings, there's also a chance to run into collation issues comparing across databases.
In these discussions, one thing I haven't seen is a comparison of SELECT performance when joining different datatypes. After all, it's rare to find queries that don't stretch across multiple tables in a relational database. I used the StackOverflow2013 database (83 GB size) for testing. Download a copy for yourself from Brent Ozar's fantastic website if you'd like to duplicate these tests. The Posts table contains 17,142,169 rows and Votes contains 52,928,720 rows, so running the commands below can take several minutes.
Some additional reading:
Kimbery Tripp on GUID primary keys
Using NEWSEQUENTIALID() can help with GUID fragmentation, but it only increments until SQL Server is restarted/rebooted. At that point it will start with a new GUID, and that could be at a lower point than previously used.
Scenario 1
First, let's join the existing clustered primary key (INT) in Posts with a new nonclustered index (INT) on Votes.PostId.
Setup
CREATE NONCLUSTERED INDEX [IX_Votes_PostId]
ON [dbo].[Votes]([PostId] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF
, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF)
ON [PRIMARY]
GO
Query
select top 1000000 Posts.Id
, Votes.Id
from Votesinner join Posts
on Votes.PostId = Posts.Id;
Scenario 2
Next, let's create a table copy of Posts with a UNIQUEIDENTIFIER primary key, and a new nonclustered index in Votes to match.
Setup
First, let's add a UNIQUEIDENTIFIER column to Posts.
alter table Postsadd GuidId uniqueidentifier null;
update Posts set GuidId = newid();
Then we'll create a new table from Posts, but using the new GuidId as the primary key, and then populate it from Posts.
CREATE TABLE [dbo].[PostsGuid]([GuidId] [uniqueidentifier] NOT NULL,[AcceptedAnswerId] [int] NULL,[AnswerCount] [int] NULL,[Body] [nvarchar](max) NOT NULL,[ClosedDate] [datetime] NULL,[CommentCount] [int] NULL,[CommunityOwnedDate] [datetime] NULL,[CreationDate] [datetime] NOT NULL,[FavoriteCount] [int] NULL,[LastActivityDate] [datetime] NOT NULL,[LastEditDate] [datetime] NULL,[LastEditorDisplayName] [nvarchar](40) NULL,[LastEditorUserId] [int] NULL,[OwnerUserId] [int] NULL,[ParentId] [int] NULL,[PostTypeId] [int] NOT NULL,[Score] [int] NOT NULL,[Tags] [nvarchar](150) NULL,[Title] [nvarchar](250) NULL,[ViewCount] [int] NOT NULL,CONSTRAINT [PK_PostsGuid_Id] PRIMARY KEY CLUSTERED([GuidId] ASC)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF)ON [PRIMARY]) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO
INSERT INTO [dbo].[PostsGuid]([GuidId],[AcceptedAnswerId],[AnswerCount],[Body],[ClosedDate],[CommentCount],[CommunityOwnedDate],[CreationDate],[FavoriteCount],[LastActivityDate],[LastEditDate],[LastEditorDisplayName],[LastEditorUserId],[OwnerUserId],[ParentId],[PostTypeId],[Score],[Tags],[Title],[ViewCount])SELECT [GuidId],[AcceptedAnswerId],[AnswerCount],[Body],[ClosedDate],[CommentCount],[CommunityOwnedDate],[CreationDate],[FavoriteCount],[LastActivityDate],[LastEditDate],[LastEditorDisplayName],[LastEditorUserId],[OwnerUserId],[ParentId],[PostTypeId],[Score],[Tags],[Title],[ViewCount]FROM Posts
GO
Finally, we'll create a new column and non-clustered index on Votes for the new table.
alter table Votesadd PostGuidId uniqueidentifier null;
update vset v.PostGuidId = p.GuidIdfrom Votes vinner join Posts pon v.PostId = p.Id
CREATE NONCLUSTERED INDEX [IX_Votes_PostGuidId] ON [dbo].[Votes]([PostGuidId] ASC)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]GO
Query
select top 1000000 PostsGuid.GuidId, Votes.PostGuidIdfrom Votesinner join PostsGuidon Votes.PostGuidId = PostsGuid.GuidId;
Scenario 3
Finally, for giggles let's do a GUID primary key for Posts, but stored in CHAR(64) to see how that compares against UNIQUEIDENTIFIER.
Setup
First, let's add a CHAR(64) column to Posts and populate it from GuidId.
alter table Postsadd GuidStringId char(64) null;update Posts set GuidStringId = GuidId;
Then we'll create a new table from Posts, but using the new GuidStringId as the primary key, and then populate it from Posts.
CREATE TABLE [dbo].[PostsGuidString]([GuidStringId] [char](64) NOT NULL,[AcceptedAnswerId] [int] NULL,[AnswerCount] [int] NULL,[Body] [nvarchar](max) NOT NULL,[ClosedDate] [datetime] NULL,[CommentCount] [int] NULL,[CommunityOwnedDate] [datetime] NULL,[CreationDate] [datetime] NOT NULL,[FavoriteCount] [int] NULL,[LastActivityDate] [datetime] NOT NULL,[LastEditDate] [datetime] NULL,[LastEditorDisplayName] [nvarchar](40) NULL,[LastEditorUserId] [int] NULL,[OwnerUserId] [int] NULL,[ParentId] [int] NULL,[PostTypeId] [int] NOT NULL,[Score] [int] NOT NULL,[Tags] [nvarchar](150) NULL,[Title] [nvarchar](250) NULL,[ViewCount] [int] NOT NULL,CONSTRAINT [PK_PostsGuidString_Id] PRIMARY KEY CLUSTERED([GuidStringId] ASC)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF)ON [PRIMARY]) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]GOINSERT INTO [dbo].[PostsGuidString]([GuidStringId],[AcceptedAnswerId],[AnswerCount],[Body],[ClosedDate],[CommentCount],[CommunityOwnedDate],[CreationDate],[FavoriteCount],[LastActivityDate],[LastEditDate],[LastEditorDisplayName],[LastEditorUserId],[OwnerUserId],[ParentId],[PostTypeId],[Score],[Tags],[Title],[ViewCount])SELECT [GuidStringId],[AcceptedAnswerId],[AnswerCount],[Body],[ClosedDate],[CommentCount],[CommunityOwnedDate],[CreationDate],[FavoriteCount],[LastActivityDate],[LastEditDate],[LastEditorDisplayName],[LastEditorUserId],[OwnerUserId],[ParentId],[PostTypeId],[Score],[Tags],[Title],[ViewCount]FROM PostsGO
Finally, we'll create a new column and non-clustered index on Votes for the new table.
alter table Votesadd PostGuidStringId char(64) null;
update vset v.PostGuidStringId = p.GuidStringIdfrom Votes vinner join Posts pon v.PostId = p.Id
CREATE NONCLUSTERED INDEX [IX_Votes_PostGuidStringId] ON [dbo].[Votes]([PostGuidStringId] ASC)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]GO
Query
select top 1000000 PostsGuidString.GuidStringId, Votes.PostGuidStringIdfrom Votesinner join PostsGuidStringon Votes.PostGuidStringId = PostsGuidString.GuidStringId;
Results
Scenario 1
- PK Size: 33.3 GB
- IX Size: 0.7 GB
- CPU Time: 0.6 seconds
- Elapsed Time: 5.5 seconds
- Network I/O: 4.9 seconds
- Subtree Cost: 75
Scenario 2
- PK Size: 34.6 GB
- IX Size: 1.7 GB
- CPU Time: 2.0 seconds
- Elapsed Time: 7.5 seconds
- Network I/O: 5.9 seconds
- Subtree Cost: 115
Scenario 3
- PK Size: 35.5 GB
- IX Size: 4.8 GB
- CPU Time: 3 seconds
- Elapsed Time: 9.6 seconds
- Network I/O: 6.7 seconds
- Subtree Cost: 98.4
The JOIN results show similar performance characteristics as the tables alone.
- INT has the lowest storage requirements and best performance
- UNIQUEIDENTIFIER has slightly higher storage with lower performance
- CHAR(64) has slightly higher storage with the lowest performance
Recommendation?
- Primary Key
- INT
- Also known as a surrogate key. This is a field that uniquely identifies the records within the system, but does not depend on the data it represents. It is used for foreign key lookups and joins.
- Natural Key
- UNIQUEIDENTIFIER or VARCHAR/NVARCHAR
- If there is a unique natural key that makes sense for the data and won't change over time, by all means use it. If not, a GUID is perfect for uniquely identifying across databases and systems.
- Public Key (optional)
- If a system will expose data publicly it is more secure to not use any real sequential key to prevent attackers from guessing other keys.
- If a random GUID was used for the natural key, that could also be used for a public key. If not, a new GUID column could be created, or any other unique identifier that makes sense for the data lookups.
Comments
Post a Comment