The Great Guid vs Int PK Debate October, 2010
This has come up so often that I figured its time to do a blog post and just point people to it.
First, like Voltaire said, lets define our terms. We're talking about a primary key which according to Wikipedia is a value that "can uniquely identify each row in a table". When I'm talking about ints I'm implying that it is an integer key that is incremented by the database (IE "identity" in SQL Server). When I'm talking about guids I'm talking about sequential guids.
Reasons for using guids:
- They are managed by the application not the database. I think this is my biggest reason for preferring them. I can have the application (domain or ORM) manage them which allows the application to be in control. Having the database manage key creation (Whether int or guid) buys me absolutely nothing and constrains the application for no good reason. I really think the application is where key management belongs, not the backing store. In relation to NHibernate, Tuna has an interesting post here and Fabio here on this subject. I summarize:
- Unit of work. NHibernate has the concept of unit of work. Objects are tracked and when you are ready to commit the changes it will figure out what needs to be done at that time and then do it. Using an int identity hijacks this ability when doing inserts because NH must immediately perform an insert (On Save()) in order to get the PK of the entity. If your doing a transaction and you roll it back then prior inserts have to be rolled back in the database, instead of simply having been in memory.
- Batching on insert. For example creating the trite order and line items, committing and then having the line items be inserted in one shot instead of individually. If you're using an int identity you will not be able to take advantage of this type of batching because NHibernate needs to return the PK of each line item, which is generated by the database. Note that this only applies to inserts; as other operations do not generate a PK that needs to be returned (Like when doing an update).
- Easy to merge data. This may or may not be a good reason depending on your application. Recently this has not been an issue for me but earlier in my career I worked for a payroll company that moved data around the system like crazy (It was a SAAS app that that had one DB per client). Some of the tables had int PKs and others had guid PK's. For obvious reasons it was much easier to move data from one db to another when you had guid PK's. Int PK's required regeneration so were a nightmare to deal with. And again, int PK's didn't buy us anything at all.
- They are not as guessable as integers. Notice I didn't say that weren't guessable period. Some things to think about:
- Guids are not necessarily guaranteed to be truly random, as the UUID spec notes, they are guaranteed to be unique not random (Some interesting discussion here on this subject). So the randomness, if any, depends on the algorithm used to generate them. If your developing on .NET 4.0 (Haven't checked older versions) your using Guid.NewGuid() which uses CoCreateGuid (You can see this in Reflector). This in turn calls the function UuidCreate (So its creating a UUID). BTW, this means that the algorithm is OS dependent if your using .NET 4.0 (and probably older versions I'd imagine) as the guid generation does not happen in managed code. CoCreateGuid on both Windows Server 2003 SP2 x86 and Windows Server 2008 R2 yields a v4 UUID (See here for the layout). The spec states that this version "is meant for generating UUIDs from truly-random or pseudo-random number". What does this mean exactly? All depends on the algorithm Microsoft is using and AFIAK there's no official information on it. I've seen a lot of talk about it on the web but it seems like its all anecdotal, nothing official. In any event it's probably safe to say that its pseudo random. So the bottom line is its not as guessable as say, incrementing an int, but if someone really wanted to guess prior and subsequent values they could do it. Maybe its like having a 10 foot high wooden fence around your yard as apposed to a 10 foot high, razor wire fence and guys with M4's? The wooden fence will stop the casual intruder but someone could get past it if they really really wanted to. Now what about the guid comb (Because this is what your going to want to use as its partially sequential and wont cause index fragmentation)? The NHibernate implementation (NHibernate.Id.GuidCombGenerator) actually uses Guid.NewId() to get the base guid but then modifies part of it so that its sequential. According to the UUID spec the first 10 bytes are the temporally unique portion, and the last 6 bytes contain the spatially unique portion aka "node". For version 4 of the UUID, the random one, everything is random, including the temporally unique portion except for 6 bits (Which specify the version and a reserved portion). The NHibernate guid comb implementation replaces the last 6 bytes with a temporally unique value (Making it sequential), while leaving the rest (74 bits) alone. So you go from having 122 bits of psuedo random data to having 74 bits of pseudo random data. Now your 10 foot tall wooden fence gets cut down to 6 feet. So that's why I say its not as guessable as an int identity.
- If your app is secure this really shouldn't matter. If being able to predict PK's allows an attacker to compromise your app then you have a problem and relying on security by obscurity is bad. And as we just discussed, guids are not truly random, or at least you shouldn't count on them being so. On the other hand you could say that its not possible to make sure there are absolutely no security holes in your application so using guid PK's raises the bar by making less information available to an attacker or at least making it harder to obtain. So I guess it all comes down to personal philosophy; I can see compelling arguments for both sides. I've sided on trying to make the app as secure as possible while giving attackers as little information as possible and making what information is there harder to obtain.
Bunk reasons for using guids:
- Ints are limited and we want a PK that will not run out of id's. Yeah, ints are limited.....to 2.1 billion!!! I don't buy this argument in the least. How many people doing line of business apps are running into that limit? Not saying it cant happen but we have a 400+ GB database and were not even scratching the surface of the the int upper limit. And you can always go to a big int if you want...
- Required for replication. Ok this may not be a bunk reason, it may actually be a good reason for your application. But I see this reason thrown around a lot and I get the feeling that many people will not do replication down the road and just use this as yet another reason (Totally an anecdotal statement, no hard data on that one).
- We want to be able to change our DBMS if we need to so we want to be completely decoupled from it. I like the decoupling part, but the "change our DBMS part"? Does anyone actually swap out their DBMS on a regular basis, if ever? This is a bunk argument IMO because this does not happen in the real world. Plus I believe that most modern DBMS's support some sort of int identity so in the off chance this happens this will be the least of your worries.
Reasons for not using guids:
- They are slower and 4 times bigger: There are a number of posts out there demonstrating that sequential guids are marginally slower than ints. Also they are 128 bits as opposed to 32 bits. In particular it will increase the size of your indexes (I don't include the size of the tables because 16 bytes for a column is relatively small which probably pales in comparison to the size of other columns in the table). Does that marginal difference in performance and increased index size mean you should definitely use ints? That really depends on your application and your development philosophy. Obviously if there is some concrete reason to go either way then the decision is already made (For example if you will do replication or if you've tested guid PK's and the performance hit is really that critical). If that isn't the case, which is probably where most LOB app developers are, then it may be less clear. Since the performance hit is marginal it may not be critical or even noticeable for your application. Be careful not to "strain the gnat" (Perceived guid vs int PK performance penalty) while "gulping down the camel" (All those real performance issues in your application). If the tables are big then the index size pales in comparison, so is that really an issue? I think it really comes down to philosophy. If your using a data centric approach then guids will seem pointless. So you might as well go the int route as guids aren't going to buy you anything and you will have the slight performance boost and smaller indexes. If your using an ORM and/or a DDD approach then the guid will be an attractive option as it puts the control where you want it; the application. In that case its definitely worth the slight performance hit or increased index size because there is a payoff.
- Guid's are not human readable. This is a personal preference thing and again depends on development philosophy as mentioned above. As someone who prefers a DDD approach I could care less if its human readable because I'd rather be spending my time in the domain, not mucking around in the db where I would be oft encountering guids. If your taking a data centric approach then you will be looking at them all the time and will probably want the simplicity and elegance of an integer.
Bunk reason for not using guids:
- They cause page splits in a clustered index and cause index fragmentation in non clustered indexes. Guids yes, sequential guids no. Guid's are all psuedo random whereas guid combs have a sequential portion which avoids this. Sequential guids are on par with int identities performance-wise, check out Jimmy Nillson's guid comb article for some stats and further explanation. NHibernate also ships with a guid comb identity strategy so you can take advantage of it there.
As mentioned earlier, I see the lines being drawn according to development philosophy. If you're using an ORM (And possibly doing some flavour of DDD) then there will be advantages to using a sequential guid, both concrete and philosophical, and no advantages to using an int. If your approach is data centric then guids will seem distasteful and int will seem like a trusted old friend.