Functional data management: Example

Functional data management: Example#

The imperative way#

Note that this will achieve the same thing as our functional code and already use a lot of best practices, e.g. good variable names, using the right pandas functions to achieve a given goal, setting efficient dtypes, using modern pandas features, …

Yet, we still think that this is not good code

df = pd.read_csv("survey.csv")

new_names = {
    "Q001": "coding_genius",
    "Q002": "learned_a_lot",
    "Q003": "favorite_language",
}
df = df.rename(columns=new_names)

# clean the two variables with agreement scale
for var in ["coding_genius", "learned_a_lot"]:
    df[var] = df[var].replace({"-77": pd.NA, "-99": pd.NA})
    categories = ["strongly disagree", "disagree", "neutral", "agree", "strongly agree"]
    dtype = pd.CategoricalDtype(categories=categories, ordered=True)
    df[var] = df[var].astype(dtype)

# clean the favourite language variable
df["favorite_language"] = df["favorite_language"].replace({"-77": pd.NA, "-99": pd.NA})
df["favorite_language"] = df["favorite_language"].str.lower().str.strip()
df["favorite_language"] = df["favorite_language"].replace("ypthon", "python")
df["favorite_language"] = df["favorite_language"].astype(pd.CategoricalDtype())
df

	coding_genius	learned_a_lot	favorite_language
0	strongly disagree	agree	python
1	strongly agree	strongly agree	python
2	NaN	disagree	r
3	agree	NaN	python
4	NaN	NaN	python
5	NaN	strongly agree	python
6	neutral	strongly agree	python
7	disagree	agree	python
8	strongly agree	NaN	python
9	agree	NaN	python

Problems with the imperative way#

The variables inside df change many times but keep their name
There are many invalid intermediate states of df where variables already have their final names. This is especially dangerous if code is spread across multiple cells.
The global namespace is cluttered with helper variables like var, categories, and dtype
Since the code has no natural structure, we need comments to get some orientation
Since we have no other way of re-using code, the two agreement questions have to be cleaned at the same time, whether they are related or not
We either had to repeat the name favorite_language multiple times or use hard to read and debug method chaining

The functional way#

def clean_data(raw):
    df = pd.DataFrame(index=raw.index)
    df["coding_genius"] = clean_agreement_scale(raw["Q001"])
    df["learned_a_lot"] = clean_agreement_scale(raw["Q002"])
    df["favorite_language"] = clean_favorite_language(raw["Q003"])
    return df


def clean_agreement_scale(sr):
    sr = sr.replace({"-77": pd.NA, "-99": pd.NA})
    categories = ["strongly disagree", "disagree", "neutral", "agree", "strongly agree"]
    dtype = pd.CategoricalDtype(categories=categories, ordered=True)
    return sr.astype(dtype)


def clean_favorite_language(sr):
    sr = sr.replace({"-77": pd.NA, "-99": pd.NA})
    sr = sr.str.lower().str.strip()
    sr = sr.replace("ypthon", "python")
    return sr.astype(pd.CategoricalDtype())


raw = pd.read_csv("survey.csv")
df = clean_data(raw)
df

	coding_genius	learned_a_lot	favorite_language
0	strongly disagree	agree	python
1	strongly agree	strongly agree	python
2	NaN	disagree	r
3	agree	NaN	python
4	NaN	NaN	python
5	NaN	strongly agree	python
6	neutral	strongly agree	python
7	disagree	agree	python
8	strongly agree	NaN	python
9	agree	NaN	python

Advantages of the functional way#

The function name clearly tell us what is happening in the code, no need for comments
Inside each function, sr is a perfectly fine name, so we save a lot of typing and clutter
There is no intermediate version of df
There is no way of executing this code in the wrong order, even though we can spread the function definitions across many cells
We can re-use the code for cleaning agreement variables very easily and wherever we want
All of our functions are pure and testable with with tiny examples where we know the correct result
The top level function serves as a table of content to what comes next. This is why it is defined before the functions it calls.