UNSUPERVISED

Comparing logical arrays in numpy/pandas

Published on 02 May 2016

NumPy and pandas are major components of our data analytics stack. These are fine packages that make data munging simple and easy. For some problems, they also contain efficient implementations of complex computations. We use them day-to-day without great pains. But the whole story is that we also have funny encounters with their quirks, and this post is about one of them.

Let’s get started!

import numpy as np    #version 1.10.4
import pandas as pd   #version 0.17.1

Suppose we have two pandas Series of boolean values:

series_1 = pd.Series([True, False, np.nan])
series_2 = pd.Series([False, False, False])

We want to compare those two Series with an element-wise logical OR operation. We are most excited about the last row. What could the expression “np.nan OR false” be possibly evaluated to?

Solution #1

One can perform an element-wise OR of two pandas Series with the | bitwise OR operator.

series_1 | series_2

It will return another Series:

0 True
1 False
2 False
dtype: bool

Solution #2

Let us now use the logical_or function from numpy that can be applied on arrays of the same size.

np.logical_or(series_1, series_2)

This results in:

0 True
1 False
2 NaN
dtype: bool

Solution #3

Finally, let us use logical_or again, but now in a vectorized form. Vectorized Python functions are evaluated successively over the rows of the input arrays.

@np.vectorize
def vectorized_or(a, b):
  return np.logical_or(a, b)

For comparison, we convert the output of vectorized_or (which is a NumPy array) into a pandas Series:

pd.Series(vectorized_or(series_1, series_2))

This results in:

0 True
1 False
2 True
dtype: bool

Short conclusion
It seems that “NaN or False” can be evaluated in several ways, producing at least three different solutions. I am not sure about the reasons of what occurred previously. All I know from this StackOverflow thread is that NaN equals True when cast as boolean because it does not equal zero. What I would expect based on this is analogous to what happened in Solution #3. In a sense I can also accept the behaviour of Solution #2. It seems reasonble after having performed numerical operations with np.nans (e.g., np.nan + 123 results in np.nan). However, I cannot figure out Solution #1. If you have an idea for it, or a deeper interpretation of the stuff above, please share with us.

Anyway, we can conclude that NumPy and pandas showed again their unimaginable flexibility that provide freedom for data scientists and also a bit of fun during debugging! 🙂

UPDATE: you can also check the StackOverflow discussion about this that I have started.

by Arpad Fulop

Árpád is a data scientist at Balabit working on Privileged Account Analytics, part of Balabit's PAM solution. He applies machine learning and other analytical methods to computer network data in order to detect anomalies and discover security issues.

share this article
Mitigate against privileged account risks
Get in touch

Recent Resources

The top IT Security trends to watch out for in 2018

With 2017 now done and dusted, it’s time to think ...

The key takeaways from 2017’s biggest breaches

Like many years before it, 2017 has seen a large ...

Why is IT Security winning battles, but losing the war…?

When a child goes near something hot, a parent will ...

“The [Balabit] solution’s strongest points are the privileged session management, recording and search, and applying policy filters to apps and commands typed by administrators on monitored sessions.”

– The Forrester Wave, Privileged Identity Management, Q3 2016, by Andras Cser