PYTHON PROGRAMMING
How to inspect Pandas data frames in chained operations without breaking the chain into separate statements
Debugging lies in the heart of programming. I wrote about this in the following article:
This statement is quite general and language- and framework-independent. When you use Python for data analysis, you need to debug code irrespective of whether you’re conducting complex data analysis, writing an ML software product, or creating a Streamlit or Django app.
This article discusses debugging Pandas code, or rather a specific scenario of debugging Pandas code in which operations are chained into a pipe. Such debugging poses a challenging issue. When you don’t know how to do it, chained Pandas operations seem to be far more difficult to debug than regular Pandas code, that is, individual Pandas operations using typical assignment with square brackets.
To debug regular Pandas code using typical assignment with square brackets, it’s enough to add a Python breakpoint — and use the pdb
interactive debugger. This would be something like this:
>>> d = pd.DataFrame(dict(
... x=[1, 2, 2, 3, 4],
... y=[.2, .34, 2.3, .11, .101],
... group=["a", "a", "b", "b", "b"]
.. ))
>>> d["xy"] = d.x + d.y
>>> breakpoint()
>>> d = d[d.group == "a"]
Unfortunately, you can’t do that when the code consists of chained operations, like here:
>>> d = d.assign(xy=lambda df: df.x + df.y).query("group == 'a'")
or, depending on your preference, here:
>>> d = d.assign(xy=d.x + d.y).query("group == 'a'")
In this case, there is no place to stop and look at the code — you can only do so before or after the chain. Thus, one of the solutions is to break the main chain into two sub-chains (two pipes) in a…