Polars is taking the world by storm thanks to it’s speed, memory efficiency, and beautiful API. If you want to know how powerful it is, look no further than the DuckDB Benchmarks. And these aren’t even using the most recent version of Polars.
For all the amazing things Polars can do though, it has not traditionally been a better solution than Pandas to do ALL the calculations you might want to do. There are a few exceptions where Polars has not outperformed. With the recent release of the Polars plugin system for Rust though, that may no longer be the case.
What exactly is a polars plugin? It is simply a way to create your own Polars Expressions using native Rust and exposing those to expressions using a custom namespace. It allows you to take the speed of Rust, and apply it to your Polars DataFrame to perform calculations in a way that takes advantage of the speed and built-in tooling Polars provides.
Let’s take a look at some concrete examples.
Sequential Calculations
One area that Polars seems to lack some functionality is operations that require a knowledge of the previous value of a DataFrame. Calculations that are sequential in nature are not always super easy or efficient to write in native Polars expressions. Let’s take a look at one specific example.
We have the following algorithm to calculate the cumulative value of an array of numbers for a given run, defined as a set of numbers that have the same sign. For example:
┌───────┬───────────┐
│ value ┆ run_value │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═══════╪═══════════╡
│ 1 ┆ 1 │ # First run starts here
│ 2 ┆ 3 │
│ 3 ┆ 6 │
│ -1 ┆ -1 │ # Run resets here
│ -2 ┆ -3 │
│ 1 ┆ 1 │ # Run resets here
└───────┴───────────┘
So we want to have a cumulative sum of a column which resets every time the sign of the value switches from either positive to negative or negative to positive.
Lets start with a baseline version written in pandas.
def calculate_runs_pd(s: pd.Series) -> pd.Series:
out = []
is_positive = True
current_value = 0.0
for value in s:
if value > 0:
if is_positive:
current_value += value
else:
current_value = value
is_positive = True
else:
if is_positive:
current_value = value
is_positive = False
else:
current_value += value
out.append(current_value)
return pd.Series(out)
We iterate over a series, calculating the current value of the run at each position, and returning a new Pandas Series.
Benchmarking
Before moving on, we are going to set up a few benchmarks. We are going to measure both execution speed and memory consumption using pytest-benchmark and pytest-memray. We will set up the problem such that we have an entity column, a time column, and a feature column. The goal is to calculate the run values for each entity in the data across time. We will set the number of entities and time stamps each to 1,000, giving us a DataFrame with 1,000,000 rows.
When we run our Pandas implementation against our benchmark using Pandas’ groupby apply functionality we get the following results: