This post is deeply technical, containing several lines of codes which could be of interest to people who use the pandas package in python.
I would like to share an interesting example of pandas dataframe transformation found by our team. For the sake of simplicity, I will show it step by step copied from Jupyter notebook, so you can repeat my experiment if you wish.
import pandas as pd
Let’s create a dataframe
df = pd.DataFrame([[1,2], [3,4]], columns=['a', 'b'])
Check the element in the column ‘a’ in the row indexed by 0 and then modify it.
df['a'].ix = 10
Great, everything works as expected.
Let’s modify another value. For example the element in the column ‘a’ in the row indexed by 3. (Note, that this element does not exist yet!)
df['a'].ix = 100
There is no error message and our dataframe hasn’t changed.
Let’s modify a third, but an existing value. For example the element in the column ‘a’ in the row indexed by 1.
df['a'].ix = 200
The dataframe is the same, so the modification was not successful.
Let’s print the column ‘a’.
0 10 1 200 3 100 Name: a, dtype: int64
Something really weird is happening. When we print out the dataframe, the column ‘a’ is different. Can we print out that column, seen in the printed dataframe?
0 10 1 3 Name: a, dtype: int64
The final trick. Let’s create a new column called ‘c’ containing the sum of columns ‘a’ and ‘b’. And then create another column called ‘d’ with exactly the same definition.
Columns ‘c’ and ‘d’ are not equal, although their definition was the same, and nothing has changed in between.
The result in Step 5 is embarrassing. What could have caused two columns with exactly the same definition to be different?
It seems, that in Step 2, when we modified an element referring to a non-existing index, the df[‘a’] started to live an independent life. So df[‘a’] became an object which had nothing to do with the dataframe df. But at Step 5. when we created column ‘c’, the original dataframe was replaced by a new one (having three columns), and the independent df[‘a’] disappeared. From the first line of the Step 5, the df[‘a’] is ‘attached back” to the dataframe and it is equal to the column ‘a’ of the df.