SQL for Data Science: Subqueries and Joins

In this article on SQL for data science, we discuss how to merge and combine data from multiple sources using subqueries and joins.

6 months ago   •   6 min read

By Peter Foy

As a data scientist there are many times that you'll need to combine or merge data from several places in order to perform analysis.

In this article, we'll discuss how to do this using subqueries and joins in SQL. This article is based on notes from Week 3 of this course on SQL for Data Science and is organized as follows:

  • Introduction to Subqueries
  • Subquery Best Practices
  • Introduction to Joining Tables
  • Cartesian (Cross) Joins
  • Inner Joins
  • Left, Right, and Full Outer Joins
  • Unions

This post may contain affiliate links. See our policy page for more information.

Introduction to Subqueries

In our previous articles on SQL we discussed how to retrieve data, filter and sort it, and perform calculations. If we're only using one data table for analysis, however, we're still very limited in our analysis.

In data science, a lot of value can come from relational databases when we combine various facts, dimensions, and data sources together to derive new insights.

In this section, we'll define subqueries, their advantages and disadvantages, and explain how they help us merge data from two or more tables.

What are Subqueries?

Subqueries are queries that are embedded into other queries.

When working with a relational database, data is often stored in multiple tables.

Subqueries merge data from multiple sources together and also help with adding other filtering criteria to the query.

An example of a subquery is if we want to know all the customer that have an order with freight over 100—the freight information is in one table and customer information in another.

We can set this problem up as a subquery as follows:

  • Retrieve all customer IDs for orders with freight over 100
  • Retrieve customer information
  • Combine the two queries into a single table

Below is how we combine this into a subquery:

SELECT DISTINCT CustomerID
,CompanyName
,Region
FROM Customers
WHERE CustomerID in (SELECT CustomerID)
	FROM Orders
    WHERE Freight > 100);

When working with subquery statements, it's important to note that it will always perform the innermost SELECT portion first.

In this case, the DMBS is performing two operations:

  • Getting the order numbers for the product selected
  • Adding that to the WHERE clause and processing the overall SELECT statement

As you can see, this reduces the number of queries we have to write and keeps everything concise in a single query.

Subquery Best Practices

In this section, we'll discuss subquery best practices, including:

  • How to write subqueries within subqueries
  • Performance limitations with the overuse of subqueries
  • How to use subqueries as calculations

There is no limit to the number of subqueries you can have, although performance does decrease when you nest too deeply.

Also keep in mind that subquery selects can also only retrieve a single column.

Subqueries within Subqueries

Below is an example of a subquery within a subquery in which we select:

  • The order numbers for toothbrushes
  • Customers IDS for the orders
  • Customer information for the orders
SELECT Customer_name, Customer_contact
FROM Customers
WHERE cust_id IN
	SElECT customer_id
    FROM Orders
    WHERE order_number IN (SELECT order_number
    	FROM OrderItems
        	WHERE prod_name = 'Toothbrush');

As you can see, proper indentation is important when writing subqueries within subqueries.

Subqueries as Calculations

Below is an example of a subquery as a calculation in which we want to get the total number of orders placed by several customers:

SELECT COUNT (*) AS orders
FROM Orders
WHERE customer_id = '123456'

SELECT customer_name
	,customer_state
    (SELECT COUNT (*) AS orders
    FROM Orders
    WHERE Orders.customer_id = Customer.customer_id) AS orders
    FROM customers
    ORDER BY Customer_name

In summary, subqueries are a powerful tool, although they need to be used selectively to not decrease performance of the query.

Introduction to Joining Tables

For the reasons discussed above, subqueries aren't always the best option to combine data.

The JOIN method is another popular way to combine data from multiple tables.

There are several benefits of breaking data into tables in the first place, a few of which include:

  • Efficient data storage
  • Easier manipulation
  • Greater scalability
  • Tables can logically model a process
  • Tables are related through common values (keys)

What are Joins?

Joins associate correct records from each table on the fly.

They allow for data retrieval from multiple tables in a single query.

Joins are not permanent, meaning they only persist for the duration of the query execution.

Cartesian (Cross) Joins

In this section, we'll look at one type of join called a Cartesian join, also referred to as a cross join.

Cartesian joins allow you to take each row from the first table and join it with all the rows from another table.

One way to think of a cross join is that if the first table contain $x$ rows and the second table contains $y$ rows, the end result will be $x * y$.

As you can imagine, this is very computationally expensive, as if you have two tables with just 10 rows each, the end result will be 100 rows.

Since we're often working with much larger tables, cross joins aren't used that frequently, although can be useful in certain circumstances.

Below is an example of the syntax of a cross join:

SELECT product_name
 ,unit_price
 ,company_name
FROM suppliers CROSS JOIN products;

The output of this example will be the number of joins in the first table multiple by the number of rows in the second table.

Inner Joins

Inner joins are one of the most frequently used joins in SQL.

An INNER JOIN selects records that have matching values in both tables.

This is why keys are so important in tables—inner joins will use them to look for records that are matching in both tables.

Below is an example of an INNER JOIN:

SELECT suppliers.CompanyName
 ,ProductName
 ,UnitPrice
FROM Suppliers INNER JOIN Products 
ON Suppliers.supplierid = Products.supplierid

Here the connection between the two tables is the supplier ID in the suppliers table and the supplier ID in the products table.

Below is the syntax for INNER JOIN:

  • The join condition is in the FROM clause and uses the ON clause
  • Joining more tables together affects overall database performance
  • You can join multiple tables, there is no limit
  • First list all the tables, then define the join conditions

A few best practices for joins include:

  • Ensure you're using pre-qualifying names
  • Do not make unnecessary joins as they can be computationally expensive
  • Think about the type of join you're making to make sure it's the right data we want in the output

Aliases and Self Joins

Previously we discussed using aliases when aggregating a field. Aliases are also useful in tables and joining tables as they make them much easier to read and write.

To recap, aliases give a table or a column a temporary name to make it more readable. As alias only exists for the duration of the query, for example:

SELECT column_name
FROM table_name AS alias_name

Returning to joins, we can also join a table to itself, which is referred to as a self join. In this case, it takes the table and treats it like two separate tables, and then joins the original table to itself:

SELECT column_name(s)
FROM table1 T1, table2 T2
WHERE condition;

To summarize, in order to use self join we need to use aliases and specify the conditions for each table alias.

Left, Right, and Full Outer Joins

In this section, we'll discuss several other core joins including left joins, right joins, and full outer joins.

It's important to note that only SQLite only uses left joins—right joins and full outer joins are found in other database management systems.

Left Joins

Left joins return all records from the left table (table 1) and the matching records from the right table (table 2). If there are no matches from the right side, the result is NULL.

Right Joins

Right joins return all records from the right table (table 2) and the matching records from the left table (table 1). If there are no matches from the left side, the result is NULL.

Full Outer Joins

Full outer joins return all records when there is a match in either the left or right table records.

Below is an example of a left join that selects all customers and any orders they may have:

SELECT C.CustomerName, O.OrderID
FROM Customers. C
LEFT JOIN Orders O ON C.CustomerID = O.CustomerID
ORDER BY C.CustomerName;

Keep in mind that since we list customers first, this means it is the left table, or table 1. This also means that left joins can be turned into right joins by reversing the order of the tables, and you'd still get the same output.

Unions

Unions are a less common type of join in SQL, although they can be useful in certain situations.

The UNION operator is used to combine the result-set of two or more SELECT statements.

Each SELECT statement within UNION must have the same number of columns, and columns must have similar data types.

The columns in each SELECT statement must also be in the same order.

Below is the basic syntax for a UNION:

SELECT column_name(s) FROM table1
UNION
SELECT column_name(s) FROM table2;

Summary: Subqueries and Joins with SQL

To summarize, joins expand our ability to pull data by however many tables we have.

Although it is easy to get results with joins, you always need to be thinking about what data is being pulled and make sure they are the right results. It's also important to check for duplicates when using joins.

Another valuable step in SQL is to actually map out how you are going to join tables, which can often save time down the line.

Finally, remember that the more tables you join, the slower the database will perform, meaning you need to be strategic with their use.

Resources

Spread the word

Keep reading