How to Rewrite and Optimize Your SQL Queries to Pandas in 5 Simple Examples | by Byron Dolon | Jun, 2023

How to Rewrite and Optimize Your SQL Queries to Pandas in 5 Simple Examples | by Byron Dolon | Jun, 2023

[ad_1]

Querying a whole table

We can dive right into it by looking at the classic SELECT ALL from a table.

Here’s the SQL:

SELECT * FROM df

And here’s the pandas

df
Pandas code output — Image by author

All you need to do is call the DataFrame in Pandas to return the whole table and all its columns.

You may also want to just look at a small subset of your table as a quick check before writing a more complicated query. In SQL, you’d use LIMIT 10 or something similar to get only a select number of rows. In Pandas, similarly, you can call df.head(10) or df.tails(10) to get the first or last 10 rows of the table.

Querying a table without null values

To add to our initial select query, in addition to just limiting the number of rows, you would put conditions to filter the table inside a WHERE clause in SQL. For example, if you’d want all rows in the table without any null values in the Order_ID column, the SQL would look like this:

SELECT * FROM df WHERE Order_ID IS NOT NULL

In Pandas, you have two options:

# Option 1
df.dropna(subset="Order_ID")

# Option 2
df.loc[df["Order_ID"].notna()]

Pandas code output — Image by author

Now, the table we get back doesn’t have any null values from the Order_ID column (which you can compare to the first output above). Both options will return a table without the null values, but they work slightly differently.

You can use the native dropna method in Pandas to return the DataFrame without any null rows, specifying in the subset parameter which columns you’d like to drop nulls from.

Alternatively, the loc method lets you pass a mask or boolean label you can specify to filter the DataFrame. Here, we pass df["Order_ID"].notna(), which if you would call it on its own would return a Series of True and False values that can map to the original DataFrame rows for whether the Order_ID is null. When we pass it to the loc method, it instead returns the DataFrame where df["Order_ID"].notna() evaluates to True (so all rows where the Order_ID column isn’t null.

Querying specific columns from a table

Next, instead of selecting all columns from the table, let’s instead select just a few specific columns. In SQL, you’d write the column names in the SELECT part of the query like this:

SELECT Order_ID, Product, Quantity_Ordered FROM df

In Pandas, we’d write the code like this:

df[["Order_ID", "Product", "Quantity_Ordered"]]
Pandas code output — Image by author

To select a specific subset of columns, you can pass a list of the column names into the DataFrame in Pandas. You can also define the list separately like this for clarity:

target_cols = ["Order_ID", "Product", "Quantity_Ordered"]
df[target_cols]

Assigning a list of target columns that you can then pass into a DataFrame can make working with a table over time when you need to make changes in your code a little easier. For example, you could have a function return the columns you need as a list, or append and remove columns to the list as needed depending on what kind of output the user needs.

The GROUP BY in SQL and Pandas

We can now move on to aggregating data. In SQL, we do this by passing a column to the SELECT and GROUP BY clauses that we want to group on and then adding the column to an aggregate measure like COUNT in the SELECT clause as well. As an example, doing so will let us group all the individual Order_ID rows in the original table for each Product and count how many there are. The query can look like this:

SELECT 
Product,
COUNT(Order_ID)
FROM df
WHERE Order_ID IS NOT NULL
GROUP BY Product

In Pandas, it would look like this:

df[df["Order_ID"].notna()].groupby(["Product"])["Order_ID"].count()
Pandas code output — Image by author

The output is a Pandas Series where the table is grouped the products and there’s a count of all the Order_ID for each product. In addition to our previous query in Pandas where we included a filter, we now do three things:

  1. Add groupby and pass a column (or list of columns) that you want to group the DataFrame on;
  2. Pass the name of the column in square brackets on the raw grouped DataFrame;
  3. Call the count (or any other aggregate) method to perform the aggregation on the DataFrame for the target column.

For better readability, we can assign the condition to a variable (this will come in handy later) and format the query so it’s easier to read.

condition = df["Order_ID"].notna()
grouped_df = (
df.loc[condition]
.groupby("Product")
["Order_ID"] # select column to count
.count()
)
grouped_df

Now that we have most of the components of a complete SQL query, let’s take a look at a more complicated one and see what it would look like in Pandas.

SELECT 
Product,
COUNT(Order_ID)
FROM df
WHERE Order_ID IS NOT NULL
AND Purchase_Address LIKE "%Los Angeles%"
AND Quantity_Ordered == 1
GROUP BY Product
ORDER BY COUNT(Order_ID) DESC

Here, we add a little to our previous query by including multiple filter conditions as well as an ORDER BY so that the table returned in our query is sorted by the measure we’re aggregating on. Since there are a few more components to this query, let’s take a look step by step at how we’d implement this in Pandas.

First, instead of passing multiple conditions when we call the loc method, let’s instead define a list of conditions and assign them to a variable FILTER_CONDITIONS.

FILTER_CONDITIONS = [
df["Order_ID"].notna(),
df["Purchase_Address"].str.contains("Los Angeles"),
df["Quantity_Ordered"] == "1",
]

As before, a condition passed into loc should be a Pandas mask that evaluates to either true or false. It’s possible to pass multiple conditions to loc, but the syntax should look like this:

df.loc[condition_1 & condition_2 & condition_3]

However, just passing a list of conditions like this won’t work:

df.loc[FILTER_CONDITIONS]  
# doesn't work -> you can't just pass a list into loc

You’ll get an error if you try the above because each condition should be separated by the & operator for “and” conditions (or the | operator if you need “or” conditions). Instead, we can write some quick code to return the conditions in the correct format. We’ll make use of the functools.reduce method to put the conditions together.

If you want to see what it looks like in a notebook and see what it looks like to combine some strings using the reduce function, try this:

reduce(lambda x, y: f"{x} & {y}", ["condition_1", "condition_2", "condition_3"])

This outputs the string like this:

>>> 'condition_1 & condition_2 & condition_3'

Going back to our actual Pandas conditions, we can write this instead (without the string formatting and just using our defined list of conditions in the FILTER_CONDITIONS variable).

reduce(lambda x, y: x & y, FILTER_CONDITIONS)

What reduce does is apply a function cumulatively to the elements present in an iterable, or in our case run the lambda function over the items in our FILTER_CONDITIONS list which combines each of them with the & operator. This runs until there are no conditions left, or in this case, for all three conditions it would effectively return:

df["Order_ID"].notna() & df["Purchase_Address"].str.contains("Los Angeles") & df["Quantity_Ordered"] == "1"

Finally, let’s add the list of conditions to create a final group by query in Pandas:

final_df = (
df
.loc[reduce(lambda x, y: x & y, FILTER_CONDITIONS)]
.groupby("Product")
.size()
.sort_values(ascending=False)
)

You’ll notice two additional differences from the previous query:

  1. Instead of specifying the specific column to count on, we can simply call the size method which will return the number of rows in the DataFrame (as before where every Order_ID value was unique and meant to represent one row when we counted on it);
  2. There are a few different ways to do the ORDER BY in Pandas- one way is to simply call sort_values and pass ascending=False to sort on descending order.

If you wanted to use the previous syntax for aggregating the data it would look like this:

final_df = (
df
.loc[reduce(lambda x, y: x & y, FILTER_CONDITIONS)]
.groupby("Product")
["Order_ID"].count()
.sort_values(ascending=False)
)
Pandas code output — Image by author

The output of both methods will be the same as before, which is a Series with the column you’re grouping on and the counts for each product.

If instead, you wanted to output a DataFrame, you can call the reset_index method on the series to get the original column names back for which column you grouped on and the column you’re aggregating on (in this case we grouped on “Product” and are counting the “Order_ID”.

final_df.reset_index()
Pandas code output — Image by author

And there we have it! All the components of a full SQL query but finally written in Pandas. Some of the things we can do further to optimize this process for working with data over time include:

  • Putting the different lists of columns to SELECT or GROUP BY to their own variables or functions (so you or a user can modify them over time);
  • Move the logic to combine the list of columns for a filter condition to its own function so the end user doesn’t need to be confused over what the reduce logic is doing;
  • After passing reset_index we can rename the output column (or columns if we’re aggregating on multiple) for clarity, for example to “Count_Order_ID”.
[ad_2]
Source link

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *