Functional data cleaning: The How

Functional data cleaning: The How#

We have three rules for cleaning data in a functional way:

  1. Start with an empty DataFrame

  2. Touch every variable just once

  3. Touch with a pure function

Learning Objectives#

After working through this topic, you should be able to:

  • reproduce the three rules of functional data cleaning

  • apply the three rules whenever you are writing data cleaning code

  • discuss different strategies for applying the rules in Python scripts and Jupyter notebooks

Materials#

Video:

Download the slides.

Quiz#

from epp_topics.quiz_utilities import display_quiz

content = {
    "What does 'start with an empty DataFrame' mean in functional data cleaning?": {
        "Create a new DataFrame with only an index, then add cleaned columns one by"
        " one": True,
        "Create a DataFrame by copying all columns from the raw data": False,
        "Create an empty DataFrame with no index": False,
        "Start with a DataFrame containing all raw columns": False,
    },
    "What does 'touch every variable just once' mean?": {
        "Each variable from the raw data should be processed exactly once when creating"
        " the cleaned DataFrame": True,
        "Each variable should only be accessed once in the entire script": False,
        "Each variable should be modified in place only once": False,
        "Each variable should be saved to disk only once": False,
    },
    "What is a pure function in the context of data cleaning?": {
        "A function that has no side effects and returns the same output for the same"
        " input": True,
        "A function that modifies the input DataFrame in place": False,
        "A function that reads from external files": False,
        "A function that prints debugging information": False,
    },
    "According to the three rules, how should you structure your data cleaning code?": {
        "Create an empty DataFrame, then assign each cleaned variable using a pure"
        " function (which could be the identity function)": True,
        "Modify the raw DataFrame in place by cleaning each column": False,
        "Create multiple DataFrames and merge them at the end": False,
        "Copy the raw DataFrame and modify columns as needed": False,
    },
    "What is the identity function in the context of pure functions?": {
        "A function that returns its input unchanged (doing nothing)": True,
        "A function that checks if two values are equal": False,
        "A function that returns the index of a DataFrame": False,
        "A function that creates a new DataFrame": False,
    },
    "When you just need to rename a variable without any cleaning, what should you do"
    " according to the rules?": {
        "Assign the raw column directly to the column in the cleaned DataFrame": True,
        "Since we always need to call a function, we create a function that just"
        " returns the input column unchanged": False,
        "Copy the column and modify it slightly": False,
        "Skip the variable entirely": False,
    },
    "How do the three rules for cleaning apply differently in Python scripts vs. "
    "Jupyter notebooks?": {
        "In a script, you keep all operations on one line each in the global"
        " namespace": False,
        "Scripts require all three rules, notebooks only need two": False,
        "The rules do not apply to notebooks": False,
        "There is no difference in how the rules apply": False,
        "In a notebook, you typically keep all operations in the global namespace, with"
        " one cell per operation or function definition": True,
        "In a script, you put all cleaning operations in a pure function": True,
    },
    "In the functional approach, what should you do if you need to clean a variable"
    " that requires multiple steps?": {
        "Create a pure function that performs all the cleaning steps and returns the"
        " cleaned result": True,
        "Modify the variable multiple times in the main function": False,
        "Create multiple intermediate DataFrames": False,
        "Skip cleaning that variable": False,
    },
}

display_quiz(content)