Queries with OR conditions are notorious for being hard to make fast.
Avoid them if you can, read on if you cannot.
Let’s backtrack to the problem. Our database is modeling a simple family - person (1:N) situation. A person belongs to a single family (household). To make it a little more real-world, we have another table person_relationship, which models how one person is related to another. (Sidenote, we only store relationships between people from different families (households).)
Now the client would like to choose a family and get a list of persons, who belong to the family or are related to someone from the family.
The straightforward way to get the answer is… actually in the title of the post.
We query the person table and add an outer join on the relationship and inner join on the (other) person. Finally we check if the (root/from) person’s or the joined (other) person’s family is the one we look for.
SELECT p.id, p.name
FROM person p
LEFT JOIN person_relationship r ON r.from_person_id = p.id
LEFT JOIN person other ON other.id = r.to_person_id
WHERE p.familyid = @familyId OR other.familyid = @familyId;
When I run the query, it is not only slow, it also produces too many records. Argh.
You can safely always do this one. Push the relevant where condition(s) into the join on-condition.
We copy the family condition from the where condition onto the join. Then we replace the where condition with an existence check: id of the joined (other) person is not null.
SELECT p.id, p.name
FROM person p
LEFT JOIN person_relationship r ON r.from_person_id = p.id
LEFT JOIN person other ON other.id = r.to_person_id AND other.familyid = @familyId
WHERE p.familyid = @familyId OR other.id IS NOT NULL;
Both joins in our query are left (outer). The second join however could be an inner join relative to the relationship.
But, if we change the second join to inner, this will change the semantics! It effectively makes the first join inner too.
How do we tell sql we want to nest joins? This one is really not obvious, i found the proper syntax on stackoverflow thank you Gajus.
SELECT p.id, p.name
FROM person p
LEFT JOIN (
person_relationship r
INNER JOIN person other ON other.id = r.to_person_id AND other.familyid = @familyId
) ON r.from_person_id = p.id
WHERE p.familyid = @familyId OR other.id IS NOT NULL;
Why too many records? I’m sure you already know this one. The person and person_relationship are 1:N. If Jane is related to Peter and Paul (grandma to grandsons) then we would get both rows in the output.
Ok, ok, yes we can slap a Distinct on it and its done.
SELECT DISTINCT p.id, p.name
FROM person p
LEFT JOIN person_relationship r ON r.from_person_id = p.id
LEFT JOIN person other ON other.id = r.to_person_id AND other.familyid = @familyId
WHERE p.familyid = @familyId OR other.id IS NOT NULL;
Can we do better? Is there another way to tell sql we only care whether at least one relationship exists? The more obscure way is to use APPLY. This allows as to “join” with a subquery and control how many record the “join” produces. It is very useful if we need
For our particular scenario, the next fix is much more obvious.
SELECT p.id, p.name
FROM person p
OUTER APPLY (
SELECT TOP 1 r.from_person_id
FROM person_relationship r
INNER JOIN person other ON other.id = r.to_person_id
WHERE r.from_person_id = p.id AND other.familyid = @familyId
) rel
WHERE p.familyid = @familyId OR rel.from_person_id IS NOT NULL;
Experienced db admins are rolling their eyes by now, so let’s get to (one of the) point(s). If you care whether some row exists, then say so and use Exists (duh, i know).
The join becomes a subquery.
Just keep in mind that
SELECT p.id, p.name
FROM person p
WHERE p.familyid = @familyId
OR EXISTS (
SELECT 1
FROM person_relationship r
INNER JOIN person other ON other.id = r.to_person_id
WHERE r.from_person_id = p.id AND other.familyid = @familyId
);
The “exists subquery" tip made our real-life query fast. From 15 seconds to sub second. Until we added an order-by and top. The top is added automatically by our generic paging code.
Sql server decided it will again use the index for the order-by, because surely we will find the top 50 rows fast.
Nope…
Since you and me live in the real world, we know that the result of the query will be at most tens of people, usually less. The wast majority of people have a couple of siblings, two parents and a few grandparents.
That means to get the 50 results, we will go over the whole table. Which is not optimal to say the least.
There is a trick we can use to nudge sql server towards the better execution plan.
OPTION(USE HINT('DISABLE_OPTIMIZER_ROWGOAL'))).
This tells sql server to disregard TOP
from optimizations.
However, adding a large NVARCHAR
column to the select and we are back to scanning the whole table.
Since you followed us this far the rabbit hole, is there a way to hint more forcefully? Not really, but we can put the large columns on a self join, thus hiding them from the optimizer.
Seeing how brittle this is, we can only advise this “fix” as the very last resort solution.
SELECT p.id, p.name, s.long_text_column_1, s.long_text_column_2
FROM person p
INNER JOIN person s ON (p.id=s.id)
WHERE p.familyid = @familyId
OR EXISTS (
SELECT 1
FROM person_relationship r
INNER JOIN person other ON other.id = r.to_person_id
WHERE r.from_person_id = p.id AND other.familyid = @familyId
);
ORDER BY p.name, p.id
OFFSET 0 ROWS FETCH NEXT 50 ROWS ONLY
OPTION(USE HINT('DISABLE_OPTIMIZER_ROWGOAL'))
If we go back to the start and examine what the client asks for? Give me a list of
This is not one query, but two…, of course. Sadly, any time you have an OR condition on different and especially joined columns, you have to leave the declarative world behind and imperatively tell sql what to do. And that is to write the query as a union of the first and second case.
-- First: Persons in the family
SELECT p.id, p.name
FROM person p
WHERE p.familyid = @familyId
UNION
-- Second: Persons related to someone in the family (but not in the family themselves)
SELECT other.id, other.name
FROM person_relationship r
INNER JOIN person from_p ON r.from_person_id = from_p.id
INNER JOIN person other ON other.id = r.to_person_id
WHERE from_p.familyid = @familyId -- AND (other.familyid IS NULL OR other.familyid <> @familyId);
Since it is much faster to use union all (does not remove duplicates), we should make the results non-overlapping (disjoined).
UNION
will be the best one.Our experience is with SQL Server, but seems like Postgresql behaves similarly.
UNION
can be your friend. And surely we all have better things to do than arguing with the sql optimizer.Happy hacking!