import pandas as pd

pd.options.mode.copy_on_write = True
pd.options.future.infer_string = True
pd.options.plotting.backend = "plotly"

Functional data management: Example#

The imperative way#

Note that this will achieve the same thing as our functional code and already use a lot of best practices, e.g. good variable names, using the right pandas functions to achieve a given goal, setting efficient dtypes, using modern pandas features, …

Yet, we still think that this is not good code

df = pd.read_csv("survey.csv")

new_names = {
    "Q001": "coding_genius",
    "Q002": "learned_a_lot",
    "Q003": "favorite_language",
}
df = df.rename(columns=new_names)

# clean the two variables with agreement scale
for var in ["coding_genius", "learned_a_lot"]:
    df[var] = df[var].replace({"-77": pd.NA, "-99": pd.NA})
    categories = ["strongly disagree", "disagree", "neutral", "agree", "strongly agree"]
    dtype = pd.CategoricalDtype(categories=categories, ordered=True)
    df[var] = df[var].astype(dtype)

# clean the favourite language variable
df["favorite_language"] = df["favorite_language"].replace({"-77": pd.NA, "-99": pd.NA})
df["favorite_language"] = df["favorite_language"].str.lower().str.strip()
df["favorite_language"] = df["favorite_language"].replace("ypthon", "python")
df["favorite_language"] = df["favorite_language"].astype(pd.CategoricalDtype())
df
coding_genius learned_a_lot favorite_language
0 strongly disagree agree python
1 strongly agree strongly agree python
2 NaN disagree r
3 agree NaN python
4 NaN NaN python
5 NaN strongly agree python
6 neutral strongly agree python
7 disagree agree python
8 strongly agree NaN python
9 agree NaN python

Problems with the imperative way#

  • The variables inside df change many times but keep their name

  • There are many invalid intermediate states of df where variables already have their final names. This is especially dangerous if code is spread across multiple cells.

  • The global namespace is cluttered with helper variables like var, categories, and dtype

  • Since the code has no natural structure, we need comments to get some orientation

  • Since we have no other way of re-using code, the two agreement questions have to be cleaned at the same time, whether they are related or not

  • We either had to repeat the name favorite_language multiple times or use hard to read and debug method chaining

The functional way#

def clean_data(raw):
    df = pd.DataFrame(index=raw.index)
    df["coding_genius"] = clean_agreement_scale(raw["Q001"])
    df["learned_a_lot"] = clean_agreement_scale(raw["Q002"])
    df["favorite_language"] = clean_favorite_language(raw["Q003"])
    return df


def clean_agreement_scale(sr):
    sr = sr.replace({"-77": pd.NA, "-99": pd.NA})
    categories = ["strongly disagree", "disagree", "neutral", "agree", "strongly agree"]
    dtype = pd.CategoricalDtype(categories=categories, ordered=True)
    return sr.astype(dtype)


def clean_favorite_language(sr):
    sr = sr.replace({"-77": pd.NA, "-99": pd.NA})
    sr = sr.str.lower().str.strip()
    sr = sr.replace("ypthon", "python")
    return sr.astype(pd.CategoricalDtype())


raw = pd.read_csv("survey.csv")
df = clean_data(raw)
df
coding_genius learned_a_lot favorite_language
0 strongly disagree agree python
1 strongly agree strongly agree python
2 NaN disagree r
3 agree NaN python
4 NaN NaN python
5 NaN strongly agree python
6 neutral strongly agree python
7 disagree agree python
8 strongly agree NaN python
9 agree NaN python

Advantages of the functional way#

  • The function name clearly tell us what is happening in the code, no need for comments

  • Inside each function, sr is a perfectly fine name, so we save a lot of typing and clutter

  • There is no intermediate version of df

  • There is no way of executing this code in the wrong order, even though we can spread the function definitions across many cells

  • We can re-use the code for cleaning agreement variables very easily and wherever we want

  • All of our functions are pure and testable with with tiny examples where we know the correct result

  • The top level function serves as a table of content to what comes next. This is why it is defined before the functions it calls.