As a data scientist there are many times that you'll need to combine or merge data from several places in order to perform analysis.
In this article, we'll discuss how to do this using subqueries and joins in SQL. This article is based on notes from Week 3 of this course on SQL for Data Science and is organized as follows:
- Introduction to Subqueries
- Subquery Best Practices
- Introduction to Joining Tables
- Cartesian (Cross) Joins
- Inner Joins
- Left, Right, and Full Outer Joins
This post may contain affiliate links. See our policy page for more information.
Introduction to Subqueries
In our previous articles on SQL we discussed how to retrieve data, filter and sort it, and perform calculations. If we're only using one data table for analysis, however, we're still very limited in our analysis.
In data science, a lot of value can come from relational databases when we combine various facts, dimensions, and data sources together to derive new insights.
In this section, we'll define subqueries, their advantages and disadvantages, and explain how they help us merge data from two or more tables.
What are Subqueries?
Subqueries are queries that are embedded into other queries.
When working with a relational database, data is often stored in multiple tables.
Subqueries merge data from multiple sources together and also help with adding other filtering criteria to the query.
An example of a subquery is if we want to know all the customer that have an order with freight over 100—the freight information is in one table and customer information in another.
We can set this problem up as a subquery as follows:
- Retrieve all customer IDs for orders with freight over 100
- Retrieve customer information
- Combine the two queries into a single table
Below is how we combine this into a subquery:
SELECT DISTINCT CustomerID ,CompanyName ,Region FROM Customers WHERE CustomerID in (SELECT CustomerID) FROM Orders WHERE Freight > 100);
When working with subquery statements, it's important to note that it will always perform the innermost
SELECT portion first.
In this case, the DMBS is performing two operations:
- Getting the order numbers for the product selected
- Adding that to the
WHEREclause and processing the overall
As you can see, this reduces the number of queries we have to write and keeps everything concise in a single query.
Subquery Best Practices
In this section, we'll discuss subquery best practices, including:
- How to write subqueries within subqueries
- Performance limitations with the overuse of subqueries
- How to use subqueries as calculations
There is no limit to the number of subqueries you can have, although performance does decrease when you nest too deeply.
Also keep in mind that subquery selects can also only retrieve a single column.
Subqueries within Subqueries
Below is an example of a subquery within a subquery in which we select:
- The order numbers for toothbrushes
- Customers IDS for the orders
- Customer information for the orders
SELECT Customer_name, Customer_contact FROM Customers WHERE cust_id IN SElECT customer_id FROM Orders WHERE order_number IN (SELECT order_number FROM OrderItems WHERE prod_name = 'Toothbrush');
As you can see, proper indentation is important when writing subqueries within subqueries.
Subqueries as Calculations
Below is an example of a subquery as a calculation in which we want to get the total number of orders placed by several customers:
SELECT COUNT (*) AS orders FROM Orders WHERE customer_id = '123456' SELECT customer_name ,customer_state (SELECT COUNT (*) AS orders FROM Orders WHERE Orders.customer_id = Customer.customer_id) AS orders FROM customers ORDER BY Customer_name
In summary, subqueries are a powerful tool, although they need to be used selectively to not decrease performance of the query.
Introduction to Joining Tables
For the reasons discussed above, subqueries aren't always the best option to combine data.
JOIN method is another popular way to combine data from multiple tables.
There are several benefits of breaking data into tables in the first place, a few of which include:
- Efficient data storage
- Easier manipulation
- Greater scalability
- Tables can logically model a process
- Tables are related through common values (keys)
What are Joins?
Joins associate correct records from each table on the fly.
They allow for data retrieval from multiple tables in a single query.
Joins are not permanent, meaning they only persist for the duration of the query execution.
Cartesian (Cross) Joins
In this section, we'll look at one type of join called a Cartesian join, also referred to as a cross join.
Cartesian joins allow you to take each row from the first table and join it with all the rows from another table.
One way to think of a cross join is that if the first table contain $x$ rows and the second table contains $y$ rows, the end result will be $x * y$.
As you can imagine, this is very computationally expensive, as if you have two tables with just 10 rows each, the end result will be 100 rows.
Since we're often working with much larger tables, cross joins aren't used that frequently, although can be useful in certain circumstances.
Below is an example of the syntax of a cross join:
SELECT product_name ,unit_price ,company_name FROM suppliers CROSS JOIN products;
The output of this example will be the number of joins in the first table multiple by the number of rows in the second table.
Inner joins are one of the most frequently used joins in SQL.
INNER JOIN selects records that have matching values in both tables.
This is why keys are so important in tables—inner joins will use them to look for records that are matching in both tables.
Below is an example of an
SELECT suppliers.CompanyName ,ProductName ,UnitPrice FROM Suppliers INNER JOIN Products ON Suppliers.supplierid = Products.supplierid
Here the connection between the two tables is the supplier ID in the suppliers table and the supplier ID in the products table.
Below is the syntax for
- The join condition is in the
FROMclause and uses the
- Joining more tables together affects overall database performance
- You can join multiple tables, there is no limit
- First list all the tables, then define the join conditions
A few best practices for joins include:
- Ensure you're using pre-qualifying names
- Do not make unnecessary joins as they can be computationally expensive
- Think about the type of join you're making to make sure it's the right data we want in the output
Aliases and Self Joins
Previously we discussed using aliases when aggregating a field. Aliases are also useful in tables and joining tables as they make them much easier to read and write.
To recap, aliases give a table or a column a temporary name to make it more readable. As alias only exists for the duration of the query, for example:
SELECT column_name FROM table_name AS alias_name
Returning to joins, we can also join a table to itself, which is referred to as a self join. In this case, it takes the table and treats it like two separate tables, and then joins the original table to itself:
SELECT column_name(s) FROM table1 T1, table2 T2 WHERE condition;
To summarize, in order to use self join we need to use aliases and specify the conditions for each table alias.
Left, Right, and Full Outer Joins
In this section, we'll discuss several other core joins including left joins, right joins, and full outer joins.
It's important to note that only SQLite only uses left joins—right joins and full outer joins are found in other database management systems.
Left joins return all records from the left table (table 1) and the matching records from the right table (table 2). If there are no matches from the right side, the result is
Right joins return all records from the right table (table 2) and the matching records from the left table (table 1). If there are no matches from the left side, the result is
Full Outer Joins
Full outer joins return all records when there is a match in either the left or right table records.
Below is an example of a left join that selects all customers and any orders they may have:
SELECT C.CustomerName, O.OrderID FROM Customers. C LEFT JOIN Orders O ON C.CustomerID = O.CustomerID ORDER BY C.CustomerName;
Keep in mind that since we list customers first, this means it is the left table, or table 1. This also means that left joins can be turned into right joins by reversing the order of the tables, and you'd still get the same output.
Unions are a less common type of join in SQL, although they can be useful in certain situations.
UNION operator is used to combine the result-set of two or more
SELECT statement within
UNION must have the same number of columns, and columns must have similar data types.
The columns in each
SELECT statement must also be in the same order.
Below is the basic syntax for a
SELECT column_name(s) FROM table1 UNION SELECT column_name(s) FROM table2;
Summary: Subqueries and Joins with SQL
To summarize, joins expand our ability to pull data by however many tables we have.
Although it is easy to get results with joins, you always need to be thinking about what data is being pulled and make sure they are the right results. It's also important to check for duplicates when using joins.
Another valuable step in SQL is to actually map out how you are going to join tables, which can often save time down the line.
Finally, remember that the more tables you join, the slower the database will perform, meaning you need to be strategic with their use.