We do a lot of code reviews and the use of SELECT DISTINCT always comes up as its picked up by many of the static code analysis tools as a known scalability and performance risk. The common question is “Why should I avoid distinct” and “How can I avoid it”
During a performance review we either Red Flag DISTINCT indicating that it is a severe performance issue, or orange flag it, meaning that its ok to keep it in the code, but consider avoiding this in future developments as its only slightly impacting performance and there may be bigger fish to fry performance wise.
To be honest, my main concern with select distinct is not related to performance it is more to do with the mind set behind distinct. A developer may have written a query seen that he has duplicate rows and rather than revisit the joins to see if he has joined to many tables, or if he can change a join criteria, he simply slaps a distinct on the query. So, a distinct on a high percentage of queries is a sign of “hiding” problems in query design.
Lets look at a few use cases. The terrible (red flag), The not optimal (orange flag) and the passable
1) Terrible. Using DISTINCT with expensive scalar functions or UDF’s
select distinct FirstName , reverse(Firstname) as MyFunction from dbo.DimCustomer
In this case we are just using the replicate function, but often a customer may have far more expensive UDF functions (which by themselves are not so good for performance). The key point is that with DISTINCT, the scalar function has to be evaluated BEFORE the aggregate, so will be over the entire table, not just the few rows selected. On one customer engagement that was 100 million iterations when only seven rows were returned.
An alternative version of the same query is below. We can see that the Scalar function is now performed AFTER the aggregate. What difference does this make ? well it depends on the number of unique rows versus the number of rows in the table, and how expensive the scalar function is. So we cannot say that this is always a key issue, but its worth reviewing.
select FirstName , reverse(Firstname) from ( select distinct FirstName from dbo.DimCustomer ) a
During a performance tuning engagement we will only red flag this as critical if the workload analysis puts the query as one of the top five statements by CPU, otherwise we will mention it, but try and aim for bigger fish.
Note that I mentioned UDF’s are bad. This is well documented elsewhere, especially by Simon Sabin. If there was a holy war against UDF’s Simon would be leading the charge!
2) Bad. Using DISTINCT with One to Many Joins
select distinct EnglishPromotionName from dbo.DimPromotion p inner join dbo.FactInternetSales s on s.PromotionKey=p.PromotionKey
In this case, we need to perform a join and a Aggregate and a sort. A much more efficient query is written below using EXISTS
select p.EnglishPromotionName from dbo.DimPromotion p where exists (select * from dbo.FactInternetSales s where s.PromotionKey=p.PromotionKey)
Notice as because we use EXISTS we do not need the aggregate or the sort in the query plan, the exists will just return the first occurrence, which is much more efficient.
3 Passable. Using DISTINCT on a single table or with 1-1 Joins
select distinct FirstName from dbo.DimCustomer
In this case its not really possible to get a better query plan, is DISTINCT is “ok”. However we would often write this as an aggregate query with GROUP BY
select FirstName from dbo.DimCustomer group by FirstName
Why would we use the group by ? Well, clearly not for performance as the plans are the same, the reason is simply that we know that static code analysis tools will pick up distinct as a potential bad practice and we don’t want to have to continually justify its use. The group by will squeeze pass code reviews without raising an alert. Even though its technically the same plan anyway
4 The Unforgivable. Using DISTINCT to Hide Gratuitous Joins
select distinct Firstname from dbo.DimCustomer c INNER join dbo.FactInternetSales s on c.CustomerKey=s.CustomerKey
Assuming that every customer had at least one sale (eg that’s why they are customers!), the inner join is not needed so we can simply write
select distinct Firstname from dbo.DimCustomer c
5) The looks bad but is actually ok one
select distinct Firstname from dbo.DimCustomer c LEFT join dbo.FactInternetSales s on c.CustomerKey=s.CustomerKey
In this case the left join is worthless as the we are only doing a distinct on FirstName, so you would think this will result in a horrible query plan. However, god bless the Query Processor Team, they remove redundancy joins from the query so the actual plan for this is as below:
Anyone have any more base/ good use cases for distinct I would love to hear about or add them.