Various DBMSs offer a wide range of types of JOIN operators for tables. If you encounter a performance problem with CROSS JOIN – for example, a Cartesian product of a table with a million records with itself – welcome, this article lists the simplest ways to get rid of CROSS JOIN.
Of course, you can review and simplify the business logic itself or the calculation methods; this article discusses some basic cases that you should not forget about and it makes sense to check first. I hope they will be relevant or can help you find other SQL optimizations.
The examples in the article are based on CROSS JOIN from ClickHouse. The current version of ClickHouse does not automatically optimize CROSS JOIN. It is also worth noting that since SQL queries are often not written manually, but, for example, assembled piece by piece programmatically, the cases listed below are quite real.
CROSS JOIN with a table all fields of which are not used further
This case is that one of the tables from the CROSS JOIN is no longer used. An example SQL for this case with a table join on itself is as follows:
SELECT T_1.a1, T_1.a1 FROM T AS T_1 CROSS JOIN T AS T_2;
ClickHouse does not optimize this CROSS JOIN, so there may be performance problems out of the blue. All SQL examples are available in playground.
Thus, before considering complex SQL optimizations for CROSS JOIN, it makes sense to do a simple check that the fields of both tables from the CROSS JOIN are selected and further used, otherwise, if the T_2 fields are not used, then you can get rid of T_2 and the CROSS JOIN respectively:
SELECT a1, a1 FROM T;
It is necessary to take into account that CROSS JOIN adds duplicates (this is what distinguishes SQL from relational algebra, i.e. SQL is based on multisets, and not on sets), i.e. The two SQL queries listed above return the same results up to duplicates (which can be removed using GROUP BY).
Everything may seem quite obvious, however, a proof can be given based on relational algebrawhich is the projection of the Cartesian product
PROJECT (T TIMES T`) { A1A2,…, Am }
is equal to T. Cartesian product of T with attributes A1A2…, Am to the same relation T` with the same attributes A`1A`2…, A`m:
T TIMES T`
represents all tuples (a1,a2…, am,a`1,a`2…, a`m), such that
(a1,a2…, am) ∈ T,
(a`1,a`2…, a`m) ∈ T`.
Let’s take the projection:
PROJECT (T TIMES T`) { A1A2…, Am }
represents all tuples (a1,a2…, am), such that
(a1,a2…, am) ∈ T,
(a`1,a`2…, a`m) ∈ T`,
or, getting rid of (a`1,a`2…, a`m) ∈ T`, since tuples (a`1,a`2…, a`m) are not used in the conditions, we get:
all tuples (a1,a2…, am), such that
(a1,a2…, am) ∈ T.
It can be seen that this corresponds to T, which is what needed to be proven.
CROSS JOIN with additional conditions in ON other JOINs
Let’s look at an example of SQL for three table joins on itself, with the last LEFT SEMI JOIN using a condition that essentially joins all three tables:
SELECT T_1.a1, T_2.a1 FROM T AS T_1 CROSS JOIN T AS T_2 LEFT SEMI JOIN T AS T_3 ON T_1.a1 = T_3.a1 AND T_2.a1 = T_3.a1;
Equivalent:
SELECT T_1.a1, T_2.a1 FROM T AS T_1 INNER JOIN T AS T_2 ON T_1.a1 = T_2.a1;
Or
SELECT a1, a1 FROM T;
This case is also quite obvious, however, from the point of view of relational algebra, the following proof can be given. We will conditionally use LEFT SEMI JOIN further in the relational record and assume that AkA`kA“k – key (unique) attributes and T, T` and T“ – the same relation T:
PROJECT (T TIMES T` LEFT SEMI JOIN T“ WHERE ak =a`k =a“k) { A1A`1 },
Let’s write this in tuples as:
(a1,a`1), such that
(a1,a2…, am) ∈ T,
(a`1,a`2…, a`m) ∈ T`,
(a“1,a“2…, a“m) ∈ T“,
ak =a`k =a“k (ON condition on the key unique field).
From the equality of the key fields and the fact that T, T` and T“ are the same relation T, the equality of the corresponding tuples follows:
ak =a`k =a“k => (a1,a2…, am) = (a`1,a`2…, a`m) = (a“1,a“2…, a“m).
In this regard, we simplify the condition for T“ and obtain that the projection PROJECT (T TIMES T` LEFT SEMI JOIN T“ WHERE ak =a`k =a“k) { A1A`1 } represents tuples:
(a1,a`1), such that
(a1,a2…, am) ∈ T,
(a`1,a`2…, a`m ) ∈ T`,
ak =a`k.
Or, simplifying for T`, we get that these will be tuples
(a1,a1), such that
(a1,a2…, am) ∈ T,
As you can see, this is equivalent to PROJECT T { A1A`1 }, or in SQL terms this is SELECT a1, a1 FROM T;
which was what needed to be proven.
CROSS JOIN with additional conditions in WHERE
Similar to the previous case, only additional conditions combining all 3 tables are placed in WHERE:
SELECT T_1.a1, T_2.a1 FROM T AS T_1 CROSS JOIN T AS T_2 LEFT SEMI JOIN T AS T_3 ON T_1.a1 = T_3.a1 WHERE T_1.a1 = T_2.a1;
Simplifying, we get:
SELECT T_1.a1, T_2.a1 FROM T AS T_1 INNER JOIN T AS T_2 ON T_1.a1 = T_2.a1;
and onwards
SELECT a1, a1 FROM T;
I hope that the listed methods for optimizing CROSS JOIN, despite their obviousness, can be useful in themselves, or as food for thought and more complex optimizations. Good luck with SQL!
Acknowledgement and Usage Notice
The editorial team at TechBurst Magazine acknowledges the invaluable contribution of the author of the original article that forms the foundation of our publication. We sincerely appreciate the author’s work. All images in this publication are sourced directly from the original article, where a reference to the author’s profile is provided as well. This publication respects the author’s rights and enhances the visibility of their original work. If there are any concerns or the author wishes to discuss this matter further, we welcome an open dialogue to address potential issues and find an amicable resolution. Feel free to contact us through the ‘Contact Us’ section; the link is available in the website footer.