Getting Distinct Results on a Finite Set of Columns or Projections January, 2011
I have a situation where I need to return a result set that is distinct on only a few columns or projections. The DISTINCT keyword does not fit the bill as it applies to the entire projection. Before we explore this lets start with some sample data in a table we'll call Contacts.
id email name
-- ------------------- -------
1 foxydog@yahoo.com Bob
2 foxydog@yahoo.com Robert
3 foxydog@yahoo.com Dick
4 william@hotmail.com Bill
In this example we want to get distinct records by email address, discarding duplicate records (As in this context they are not important). There are a couple of ways to approach this. You could do a GROUP BY on email and an aggregate the other fields (Like MAX) but this could mess things up if the other columns are related somehow (Like an address for example, you could get mismatched street, city, state and zip). You could also do a WHERE id IN on a sub query that does group by on email and a max on id which would solve the problem of mismatched columns:
SELECT * FROM Contacts
WHERE id IN (SELECT MAX(id) FROM Contacts GROUP BY email)
I don't like this however because it requires two table scans (BTW I have indexes on id and email):
Another option is to use the ROW_NUMBER aggregate and PARTITION BY. You can partition by the columns or projections you choose (And I say "projections" because it doesn't strictly have to be a column name, it could be LEN(email) for example). So we could say the following:
SELECT *
FROM (SELECT *,
ROW_NUMBER() OVER(PARTITION BY email ORDER BY email) PartitionId
FROM Contacts) T
This returns the following:
id email name PartitionId
-- ------------------- ------- -----------
1 foxydog@yahoo.com Bob 1
2 foxydog@yahoo.com Robert 2
3 foxydog@yahoo.com Dick 3
4 william@hotmail.com Bill 1
Notice how the row numbers restart (PartitionId) when the email address changes? We can then take this one step further and only return records with a partition id of 1 giving us a distinct result set:
SELECT *
FROM (SELECT *,
ROW_NUMBER() OVER(PARTITION BY email ORDER BY email) PartitionId
FROM Contacts) T
WHERE PartitionId = 1
This returns the following:
id email name PartitionId
-- ------------------- ------- -----------
1 foxydog@yahoo.com Bob 1
3 william@hotmail.com Bill 1
And voila! Distinct on only one column. Even though this solution requires a bit more TSQL, the execution plan looks much nicer as we are only doing one table scan: