PYTHON PROGRAMMING
Pandas offer a fantastic framework to operate on dataframes. In data science, we work with small, big — and sometimes very big dataframes. While analyzing small ones can be blazingly fast, even a single operation on a big dataframe can take noticeable time.
In this article I will show that often you can make this time shorter by something that costs practically nothing: the order of operations on a dataframe.
Imagine the following dataframe:
import pandas as pdn = 1_000_000
df = pd.DataFrame({
letter: list(range(n))
for letter in "abcdefghijklmnopqrstuwxyz"
})
With a million rows and 25 columns, it’s big. Many operation on such a dataframe will be noticeable on current personal computers.
Imagine we want to filter the rows, in order to take those which follow the following condition: a < 50_000 and b > 3000
and select five columns: take_cols=['a', 'b', 'g', 'n', 'x']
. We can do this in the following way:
subdf = df[take_cols]
subdf = subdf[subdf['a'] < 50_000]
subdf = subdf[subdf['b'] > 3000]
In this code, we take the required columns first, and then we perform the filtering of rows. We can achieve the same in a different order of the operations, first performing the filtering and then selecting the columns:
subdf = df[df['a'] < 50_000]
subdf = subdf[subdf['b'] > 3000]
subdf = subdf[take_cols]
We can achieve the very same result via chaining Pandas operations. The corresponding pipes of commands are as follows:
# first take columns then filter rows
df.filter(take_cols).query(query)# first filter rows then take columns
df.query(query).filter(take_cols)
Since df
is big, the four versions will probably differ in performance. Which will be the fastest and which will be the slowest?
Let’s benchmark this operations. We will use the timeit
module: