In our previous article we discussed how to select and retrieve data from databases using SQL. Since many databases can often have millions of records, however, we need a way to filter and sort the data with SQL.
In this guide, we'll discuss several new clauses and operations for filtering data such as:
This article is based on notes from Week 2 of this course on SQL for Data Science and is organized as follows:
- The Basic of Filtering with SQL
- Advanced Filtering with
- Using Wildcards in SQL
- Sorting Data with
- Math Operators
- Aggregate Functions
- Grouping Data with SQL
This post may contain affiliate links. See our policy page for more information.
The Basic of Filtering with SQL
In this section, we'll discuss how to use filtering in order to narrow down the data we want to retrieve from a database.
Filtering is also used when you only want to perform analysis on a subset of data or use specific data as part of the model.
In particular, we'll discuss how the use the
WHERE clause with common operators, the
BETWEEN clause, and explain the concept of a
The Benefits of Filtering
As mentioned, filtering is used when you need to be specific about the data you want to retrieve.
Filtering is also used to reduce the number of records you want to retrieve.
This can increase query performance and reduce the strain on the client application.
WHERE Clause Operators
To do so, we used the
WHERE clause after the
SELECT column name, column name FROM table_name WHERE column name operator value;
In terms of operator values in the
WHERE clause, we can use the following:
- = Equal operator
- <> Not equal
- > Greater than operator
- < Less than operator
- = Equal operator
- >= Greater than or equal
- <= Less than or equal
- BETWEEN Between an inclusive range
- IS NULL Is a null value
When we're filtering values, we can filter using a string or a single value. For example, we can select products above a certain price as follows:
SELECT ProductName, UnitPrice, SupplierID FROM Products WHERE UnitPrice >= 75;
Advanced Filtering with
In this section we'll look at several more advanced filtering techniques including
IN operator specifies a range of conditions to filter by.
In order to specify the number of conditions, we enclose the
IN operator with parentheses () and a comma delimited list of values:
SELECT ProductId, UnitPrice, SupplierID FROM Products WHERE SupplierID IN (9, 10, 11);
OR operator is another useful filtering technique.
It's important to note that a database management system will not evaluate the second condition in a
WHERE clause if the first condition is met.
This means you need to be specific about the order you place items in the query.
SELECT ProductId, UnitPrice, SupplierID FROM Products WHERE ProductName = 'Tofu' OR 'Konbu';
It's worth noting that
IN works the same as
IN gives you more options in the number of items you can list.
IN also executes faster than
OR and you don't need to worry about the order in which you place items.
Order of Operations
In writing a query you can use both
OR operators, although know that SQL processes
For this reason, it's best practice to not rely on the default order of operations and instead be specific about the order using
() when using
NOT operator is used to exclude certain options.
This is useful if you want to select everything from a database except certain items:
SELECT * FROM Employees WHERE NOT CITY='London' AND NOT City='Seattle';
Using Wildcards in SQL
In this section, we'll discuss the concept of wildcards in SQL and how to the
LIKE operator with wildcards.
What are Wildcards?
Wildcards are a special character used to match certain parts of a value.
Wildcards search for a pattern from literal text, wildcard character, or a combination of the two.
LIKE with wildcards, which is actually a predicate instead of an operator, although they're commonly referred to as operators.
It's important to note that
LIKE can only be used with strings and cannot be used for non-text datatypes.
As such, they are useful for data scientists as you're working with string variables.
Using % Wildcards
In order to use a wildcard you need to add a % sign before, after, or in the middle of a string. For example:
'%Pizza'gets anything ending with the word pizza
'Pizza%'gets anything after the word pizza
'%Pizza%'gets anything before and after the word pizza
'S%E'gets anything that starts with "S" and ends with "E"
Sorting Data with
Up until now we've focused on filtering data, although there is no logical order to the data. In other words, it is still returned in the same order as it was entered in the database.
By sorting data with the
ORDER BY clause, we can return data based on numerical order or based on ascending or descending order according to the alphabet.
ORDER BY allows you to sort data based on particular columns as follows:
SELECT something FROM database ORDER BY characteristic
A few rules for using
ORDER BY include:
- It takes the name of one or more columns
- Add a comma after each additional column name
- It can sort by a column not retrieved
- It must always be the last clause in a
You can also sort by column positions instead of the names by using
ORDER BY 2,3, which sorts by the second and third columns.
Finally, you can sort by directions using:
DESCfor descending order
ASCfor ascending order
This only applies to the column names it directly precedes.
After we've selected, filtered, and sorted data, we can then apply mathematical operations to it.
In this section, we'll discuss several basic math calculations you can perform using data and the order of operations.
Below are several simple math operators in SQL:
For example, below is an example that multiplies the total units on order by the unit cost in order to calculate the total order cost:
SELECT ProductID , UnitsOnOrder , UnitPrice , UnitsOnOrder * UnitPrice AS TotalOrderCost FROM Products
The order of operations for math operators in SQL follows the normal order:
Aggregate functions provide several ways to summarize data with SQL.
A few common use cases of aggregate functions include finding the highest and lowest values in a dataset, the total number of rows, the average value, and so on.
The aggregate functions we can use to analyze data are all quite self-explanatory, including:
For example, below we're selecting the average unit price from all products:
SELECT AVG(UnitPrice) AS avg_price FROM products
One thing to note about aggregate functions is that if the word
DISTINCT is not used, SQL will assume you want to retrieve
ALL the data. If there are duplicate records in the data,
DISTINCT allows you to remove them from the query.
Grouping Data with SQL
In the previous section on aggregate functions, we just looked at how to pull a single data field and aggregate over it.
Often, however, we need to be able to perform additional aggregations using the
GROUP BY and
HAVING clauses to better sort the data.
For example, if we want to know the number of customers by region we can do this as follows:
SELECT Region, COUNT(CustomerID) AS total_customers FROM Customers GROUP BY Region;
It's important to note that
GROUP BY clauses can contain multiple columns. Also, every column in the
SELECT statement must be present in a
GROUP BY clause, except for aggregated calculations.
In terms of grouping data, we cannot use the
WHERE clause because it filters based on rows.
Instead, we can use the
HAVING clause when filtering for groups with an aggregate function. For example, we can group by customers with more than 2 orders as follows:
SELECT CustomerID ,COUNT (*) AS orders FROM Orders GROUP BY CustomerID HAVING COUNT (*) >=2;
Summary: Filtering, Sorting, and Calculating Data with SQL
In this article, we discussed how to filter, sort, aggregate, calculate, and group data with SQL.
In summary, filtering allows you to narrow down your results, as well as to increase the query and application performance.
Filtering and sorting data also allows you to find specific values, blank values, and ranges of values. All of these functions help understanding your data better and with descriptive statistics.