UNSUPERVISED

Magic Trick with Pandas

Published on 08 November 2016

This post is deeply technical, containing several lines of codes which could be of interest to people who use the pandas package in python.

I would like to share an interesting example of pandas dataframe transformation found by our team. For the sake of simplicity, I will show it step by step copied from Jupyter notebook, so you can repeat my experiment if you wish.

In [1]:
import pandas as pd

Setp 0.

Let’s create a dataframe

In [2]:
df = pd.DataFrame([[1,2], [3,4]], columns=['a', 'b'])
In [3]:
df
Out[3]:
a b
0 1 2
1 3 4

Step 1.

Check the element in the column ‘a’ in the row indexed by 0 and then modify it.

In [4]:
df['a'].ix[0]
Out[4]:
1
In [5]:
df['a'].ix[0] = 10
In [6]:
df
Out[6]:
a b
0 10 2
1 3 4

Great, everything works as expected.

Step 2.

Let’s modify another value. For example the element in the column ‘a’ in the row indexed by 3. (Note, that this element does not exist yet!)

In [7]:
df['a'].ix[3] = 100
In [8]:
df
Out[8]:
a b
0 10 2
1 3 4

There is no error message and our dataframe hasn’t changed.

Step 3.

Let’s modify a third, but an existing value. For example the element in the column ‘a’ in the row indexed by 1.

In [9]:
df['a'].ix[1] = 200
In [10]:
df
Out[10]:
a b
0 10 2
1 3 4

The dataframe is the same, so the modification was not successful.

Step 4.

Let’s print the column ‘a’.

In [11]:
df['a']
Out[11]:
0     10
1    200
3    100
Name: a, dtype: int64

Something really weird is happening. When we print out the dataframe, the column ‘a’ is different. Can we print out that column, seen in the printed dataframe?

In [12]:
df[['a']]['a']
Out[12]:
0    10
1     3
Name: a, dtype: int64

Step 5.

The final trick. Let’s create a new column called ‘c’ containing the sum of columns ‘a’ and ‘b’. And then create another column called ‘d’ with exactly the same definition.

In [13]:
df['c']=df['a']+df['b']
df['d']=df['a']+df['b']
In [14]:
df
Out[14]:
a b c d
0 10 2 12 12
1 3 4 204 7

Columns ‘c’ and ‘d’ are not equal, although their definition was the same, and nothing has changed in between.

Explanation

The result in Step 5 is embarrassing. What could have caused two columns with exactly the same definition to be different?

It seems, that in Step 2, when we modified an element referring to a non-existing index, the df[‘a’] started to live an independent life. So df[‘a’] became an object which had nothing to do with the dataframe df. But at Step 5. when we created column ‘c’, the original dataframe was replaced by a new one (having three columns), and the independent df[‘a’] disappeared. From the first line of the Step 5, the df[‘a’] is ‘attached back” to the dataframe and it is equal to the column ‘a’ of the df.

by Eszter Windhager-Pokol

share this article
Mitigate against privileged account risks
Get in touch

Recent Resources

The top IT Security trends to watch out for in 2018

With 2017 now done and dusted, it’s time to think ...

The key takeaways from 2017’s biggest breaches

Like many years before it, 2017 has seen a large ...

Why is IT Security winning battles, but losing the war…?

When a child goes near something hot, a parent will ...

“The [Balabit] solution’s strongest points are the privileged session management, recording and search, and applying policy filters to apps and commands typed by administrators on monitored sessions.”

– The Forrester Wave, Privileged Identity Management, Q3 2016, by Andras Cser