Functional data cleaning: The How#
We have three rules for cleaning data in a functional way:
Start with an empty DataFrame
Touch every variable just once
Touch with a pure function
Learning Objectives#
After working through this topic, you should be able to:
reproduce the three rules of functional data cleaning
apply the three rules whenever you are writing data cleaning code
discuss different strategies for applying the rules in Python scripts and Jupyter notebooks
Materials#
Video:
Download the slides.
Quiz#
from epp_topics.quiz_utilities import display_quiz
content = {
"What does 'start with an empty DataFrame' mean in functional data cleaning?": {
"Create a new DataFrame with only an index, then add cleaned columns one by"
" one": True,
"Create a DataFrame by copying all columns from the raw data": False,
"Create an empty DataFrame with no index": False,
"Start with a DataFrame containing all raw columns": False,
},
"What does 'touch every variable just once' mean?": {
"Each variable from the raw data should be processed exactly once when creating"
" the cleaned DataFrame": True,
"Each variable should only be accessed once in the entire script": False,
"Each variable should be modified in place only once": False,
"Each variable should be saved to disk only once": False,
},
"What is a pure function in the context of data cleaning?": {
"A function that has no side effects and returns the same output for the same"
" input": True,
"A function that modifies the input DataFrame in place": False,
"A function that reads from external files": False,
"A function that prints debugging information": False,
},
"According to the three rules, how should you structure your data cleaning code?": {
"Create an empty DataFrame, then assign each cleaned variable using a pure"
" function (which could be the identity function)": True,
"Modify the raw DataFrame in place by cleaning each column": False,
"Create multiple DataFrames and merge them at the end": False,
"Copy the raw DataFrame and modify columns as needed": False,
},
"What is the identity function in the context of pure functions?": {
"A function that returns its input unchanged (doing nothing)": True,
"A function that checks if two values are equal": False,
"A function that returns the index of a DataFrame": False,
"A function that creates a new DataFrame": False,
},
"When you just need to rename a variable without any cleaning, what should you do"
" according to the rules?": {
"Assign the raw column directly to the column in the cleaned DataFrame": True,
"Since we always need to call a function, we create a function that just"
" returns the input column unchanged": False,
"Copy the column and modify it slightly": False,
"Skip the variable entirely": False,
},
"How do the three rules for cleaning apply differently in Python scripts vs. "
"Jupyter notebooks?": {
"In a script, you keep all operations on one line each in the global"
" namespace": False,
"Scripts require all three rules, notebooks only need two": False,
"The rules do not apply to notebooks": False,
"There is no difference in how the rules apply": False,
"In a notebook, you typically keep all operations in the global namespace, with"
" one cell per operation or function definition": True,
"In a script, you put all cleaning operations in a pure function": True,
},
"In the functional approach, what should you do if you need to clean a variable"
" that requires multiple steps?": {
"Create a pure function that performs all the cleaning steps and returns the"
" cleaned result": True,
"Modify the variable multiple times in the main function": False,
"Create multiple intermediate DataFrames": False,
"Skip cleaning that variable": False,
},
}
display_quiz(content)