This post is deeply technical, containing several lines of codes which could be of interest to people who use the pandas package in python.

I would like to share an interesting example of pandas dataframe transformation found by our team. For the sake of simplicity, I will show it step by step copied from Jupyter notebook, so you can repeat my experiment if you wish.

In [1]:
import pandas as pd


#### Setp 0.

Let’s create a dataframe

In [2]:
df = pd.DataFrame([[1,2], [3,4]], columns=['a', 'b'])

In [3]:
df

Out[3]:
a b
0 1 2
1 3 4

#### Step 1.

Check the element in the column ‘a’ in the row indexed by 0 and then modify it.

In [4]:
df['a'].ix[0]

Out[4]:
1
In [5]:
df['a'].ix[0] = 10

In [6]:
df

Out[6]:
a b
0 10 2
1 3 4

Great, everything works as expected.

#### Step 2.

Let’s modify another value. For example the element in the column ‘a’ in the row indexed by 3. (Note, that this element does not exist yet!)

In [7]:
df['a'].ix[3] = 100

In [8]:
df

Out[8]:
a b
0 10 2
1 3 4

There is no error message and our dataframe hasn’t changed.

#### Step 3.

Let’s modify a third, but an existing value. For example the element in the column ‘a’ in the row indexed by 1.

In [9]:
df['a'].ix[1] = 200

In [10]:
df

Out[10]:
a b
0 10 2
1 3 4

The dataframe is the same, so the modification was not successful.

#### Step 4.

Let’s print the column ‘a’.

In [11]:
df['a']

Out[11]:
0     10
1    200
3    100
Name: a, dtype: int64

Something really weird is happening. When we print out the dataframe, the column ‘a’ is different. Can we print out that column, seen in the printed dataframe?

In [12]:
df[['a']]['a']

Out[12]:
0    10
1     3
Name: a, dtype: int64

#### Step 5.

The final trick. Let’s create a new column called ‘c’ containing the sum of columns ‘a’ and ‘b’. And then create another column called ‘d’ with exactly the same definition.

In [13]:
df['c']=df['a']+df['b']
df['d']=df['a']+df['b']

In [14]:
df

Out[14]:
a b c d
0 10 2 12 12
1 3 4 204 7

Columns ‘c’ and ‘d’ are not equal, although their definition was the same, and nothing has changed in between.

#### Explanation

The result in Step 5 is embarrassing. What could have caused two columns with exactly the same definition to be different?

It seems, that in Step 2, when we modified an element referring to a non-existing index, the df[‘a’] started to live an independent life. So df[‘a’] became an object which had nothing to do with the dataframe df. But at Step 5. when we created column ‘c’, the original dataframe was replaced by a new one (having three columns), and the independent df[‘a’] disappeared. From the first line of the Step 5, the df[‘a’] is ‘attached back” to the dataframe and it is equal to the column ‘a’ of the df.