NumPy and pandas are major components of our data analytics stack. These are fine packages that make data munging simple and easy. For some problems, they also contain efficient implementations of complex computations. We use them day-to-day without great pains. But the whole story is that we also have funny encounters with their quirks, and this post is about one of them.
Let’s get started!
import numpy as np #version 1.10.4
import pandas as pd #version 0.17.1
Suppose we have two pandas Series of boolean values:
series_1 = pd.Series([True, False, np.nan])
series_2 = pd.Series([False, False, False])
We want to compare those two Series with an element-wise logical OR operation. We are most excited about the last row. What could the expression “np.nan OR false” be possibly evaluated to?
One can perform an element-wise OR of two pandas Series with the | bitwise OR operator.
series_1 | series_2
It will return another Series:
Let us now use the logical_or function from numpy that can be applied on arrays of the same size.
This results in:
Finally, let us use logical_or again, but now in a vectorized form. Vectorized Python functions are evaluated successively over the rows of the input arrays.
def vectorized_or(a, b):
return np.logical_or(a, b)
For comparison, we convert the output of vectorized_or (which is a NumPy array) into a pandas Series:
This results in:
It seems that “NaN or False” can be evaluated in several ways, producing at least three different solutions. I am not sure about the reasons of what occurred previously. All I know from this StackOverflow thread is that NaN equals True when cast as boolean because it does not equal zero. What I would expect based on this is analogous to what happened in Solution #3. In a sense I can also accept the behaviour of Solution #2. It seems reasonble after having performed numerical operations with np.nans (e.g., np.nan + 123 results in np.nan). However, I cannot figure out Solution #1. If you have an idea for it, or a deeper interpretation of the stuff above, please share with us.
Anyway, we can conclude that NumPy and pandas showed again their unimaginable flexibility that provide freedom for data scientists and also a bit of fun during debugging! 🙂
UPDATE: you can also check the StackOverflow discussion about this that I have started.
With 2017 now done and dusted, it’s time to think ...
Like many years before it, 2017 has seen a large ...
When a child goes near something hot, a parent will ...
“The [Balabit] solution’s strongest points are the privileged session management, recording and search, and applying policy filters to apps and commands typed by administrators on monitored sessions.”
– The Forrester Wave, Privileged Identity Management, Q3 2016, by Andras Cser