import pandas as pd
pd.options.mode.copy_on_write = True
pd.options.future.infer_string = True
pd.options.plotting.backend = "plotly"
Functional data management: Example#
The imperative way#
Note that this will achieve the same thing as our functional code and already use a lot of best practices, e.g. good variable names, using the right pandas functions to achieve a given goal, setting efficient dtypes, using modern pandas features, …
Yet, we still think that this is not good code
df = pd.read_csv("survey.csv")
new_names = {
"Q001": "coding_genius",
"Q002": "learned_a_lot",
"Q003": "favorite_language",
}
df = df.rename(columns=new_names)
# clean the two variables with agreement scale
for var in ["coding_genius", "learned_a_lot"]:
df[var] = df[var].replace({"-77": pd.NA, "-99": pd.NA})
categories = ["strongly disagree", "disagree", "neutral", "agree", "strongly agree"]
dtype = pd.CategoricalDtype(categories=categories, ordered=True)
df[var] = df[var].astype(dtype)
# clean the favourite language variable
df["favorite_language"] = df["favorite_language"].replace({"-77": pd.NA, "-99": pd.NA})
df["favorite_language"] = df["favorite_language"].str.lower().str.strip()
df["favorite_language"] = df["favorite_language"].replace("ypthon", "python")
df["favorite_language"] = df["favorite_language"].astype(pd.CategoricalDtype())
df
coding_genius | learned_a_lot | favorite_language | |
---|---|---|---|
0 | strongly disagree | agree | python |
1 | strongly agree | strongly agree | python |
2 | NaN | disagree | r |
3 | agree | NaN | python |
4 | NaN | NaN | python |
5 | NaN | strongly agree | python |
6 | neutral | strongly agree | python |
7 | disagree | agree | python |
8 | strongly agree | NaN | python |
9 | agree | NaN | python |
Problems with the imperative way#
The variables inside
df
change many times but keep their nameThere are many invalid intermediate states of df where variables already have their final names. This is especially dangerous if code is spread across multiple cells.
The global namespace is cluttered with helper variables like
var
,categories
, anddtype
Since the code has no natural structure, we need comments to get some orientation
Since we have no other way of re-using code, the two agreement questions have to be cleaned at the same time, whether they are related or not
We either had to repeat the name
favorite_language
multiple times or use hard to read and debug method chaining
The functional way#
def clean_data(raw):
df = pd.DataFrame(index=raw.index)
df["coding_genius"] = clean_agreement_scale(raw["Q001"])
df["learned_a_lot"] = clean_agreement_scale(raw["Q002"])
df["favorite_language"] = clean_favorite_language(raw["Q003"])
return df
def clean_agreement_scale(sr):
sr = sr.replace({"-77": pd.NA, "-99": pd.NA})
categories = ["strongly disagree", "disagree", "neutral", "agree", "strongly agree"]
dtype = pd.CategoricalDtype(categories=categories, ordered=True)
return sr.astype(dtype)
def clean_favorite_language(sr):
sr = sr.replace({"-77": pd.NA, "-99": pd.NA})
sr = sr.str.lower().str.strip()
sr = sr.replace("ypthon", "python")
return sr.astype(pd.CategoricalDtype())
raw = pd.read_csv("survey.csv")
df = clean_data(raw)
df
coding_genius | learned_a_lot | favorite_language | |
---|---|---|---|
0 | strongly disagree | agree | python |
1 | strongly agree | strongly agree | python |
2 | NaN | disagree | r |
3 | agree | NaN | python |
4 | NaN | NaN | python |
5 | NaN | strongly agree | python |
6 | neutral | strongly agree | python |
7 | disagree | agree | python |
8 | strongly agree | NaN | python |
9 | agree | NaN | python |
Advantages of the functional way#
The function name clearly tell us what is happening in the code, no need for comments
Inside each function,
sr
is a perfectly fine name, so we save a lot of typing and clutterThere is no intermediate version of
df
There is no way of executing this code in the wrong order, even though we can spread the function definitions across many cells
We can re-use the code for cleaning agreement variables very easily and wherever we want
All of our functions are pure and testable with with tiny examples where we know the correct result
The top level function serves as a table of content to what comes next. This is why it is defined before the functions it calls.