# Rules for data management

## Learning Objectives

After working through this topic, you should be able to:

- Explain the importance of never modifying source data
- Discuss the importance of separating data management and analysis
- store tabular data according to the normal forms
- Explain the benefits of using "long" instead of "wide" format 




## Materials

Video:

<iframe
  src="https://electure.uni-bonn.de/paella7/ui/watch.html?id=986e7bb6-4bf2-411f-81c7-453bb338559f"
  width="640"
  height="360"
  frameborder="0"
  allowfullscreen
></iframe>

Download the [slides](pandas-rules.pdf).


## Quiz

In [None]:
from jupyterquiz import display_quiz

content = [
    {
        "question": ("What are examples of wide DataFrames"),
        "type": "multiple_choice",
        "answers": [
            {
                "answer": "A panel dataset where there is one column per variable \
                    and year",
                "correct": True,
                "feedback": "Correct.",
            },
            {
                "answer": "A dataset with more variables than observations",
                "correct": False,
                "feedback": "This does not imply a specific shape of the dataframe.",
            },
            {
                "answer": "A dataset with a MultiIndex",
                "correct": False,
                "feedback": "This does not imply a specific shape of the dataframe.",
            },
            {
                "answer": "A dataset with a column that contains the full \
                    name of individuals",
                "correct": False,
                "feedback": "This does not imply a specific shape of the dataframe.",
            },
        ],
    },
    {
        "question": (
            "What happens when you merge household level data with \
            individual level data"
        ),
        "type": "multiple_choice",
        "answers": [
            {
                "answer": "The resulting DataFrame contains redundant information",
                "correct": True,
                "feedback": "Correct.",
            },
            {
                "answer": "There is structure in variable names",
                "correct": False,
                "feedback": "Incorrect.",
            },
            {
                "answer": "There is structure in values",
                "correct": False,
                "feedback": "Incorrect.",
            },
            {
                "answer": "Data management and analysis is mixed",
                "correct": False,
                "feedback": "Incorrect.",
            },
        ],
    },
    {
        "question": (
            "After performing some data management operations on a DataFrame,\
            the best practice is to save the DataFrame to ..."
        ),
        "type": "multiple_choice",
        "answers": [
            {
                "answer": "a new file with a new name, so that the original data is \
                    not lost",
                "correct": True,
                "feedback": "Correct.",
            },
            {
                "answer": "the same file, so that the original data is overwritten",
                "correct": False,
                "feedback": "Overwriting the source data is never a good idea.",
            },
            {
                "answer": "a new file with the same name, so that the original data \
                    is not lost",
                "correct": False,
                "feedback": "This will possibly overwrite the source data.",
            },
        ],
    },
    {
        "question": (
            "You have a panel of individuals for which you have collected \
            data on their education, job satisfaction, income and sex over some months.\
            How would you store this data?"
        ),
        "type": "multiple_choice",
        "answers": [
            {
                "answer": "A single Dataframe with a column for each variable, and \
                    two index columns for the individual and the month.",
                "correct": False,
                "feedback": "This would be a good way to store the data, but it \
                    would not be the best way to store time-fixed characteristics.",
            },
            {
                "answer": "A single Dataframe with a column for each variable in \
                    each month, e.g. education_month_1, education_month_2, etc.",
                "correct": False,
                "feedback": "This is not good practice for data management. This \
                    long format is difficult to work with and uses more memory.",
            },
            {
                "answer": "Two Dataframes, one with the time-fixed characteristics \
                    and one with the time-varying characteristics that uses the \
                    individual and month as indices.",
                "correct": True,
                "feedback": "This is the best way to store the data. It is \
                    efficient and easy to work with, and less error-prone.",
            },
        ],
    },
    {
        "question": (
            "Why is it good practice to separate data management operations \
            from data analysis?"
        ),
        "type": "many_choice",
        "answers": [
            {
                "answer": "It allows for easier tracking and debugging of errors.",
                "correct": True,
                "feedback": "Correct. Having all the data management operations in \
                    one place makes it easier to track and debug errors.",
            },
            {
                "answer": "It makes it easier to test the results of the data \
                management operations..",
                "correct": True,
                "feedback": "Correct. The data management operations usually produce \
                a new DataFrame that you store, and this makes it easier to check \
                that the data management operations have been performed correctly.",
            },
            {
                "answer": "It makes the code faster.",
                "correct": False,
                "feedback": "Incorrect. The code will not necessarily be faster.",
            },
        ],
    },
    {
        "question": (
            " A database is said to be in first normal form if it satisfies \
            the following conditions..."
        ),
        "type": "many_choice",
        "answers": [
            {
                "answer": "Every column of the table contains atomic values.",
                "correct": True,
                "feedback": "Correct. This is one of the conditions for first normal \
                    form, and it implies that columns do not contain values that can \
                    be decomposed into smaller values, such as for example lists, or \
                    dictionaries.",
            },
            {
                "answer": "Every column of the table contains only numeric values.",
                "correct": False,
                "feedback": "Incorrect. This is not a condition for first normal form.",
            },
            {
                "answer": "There are no repeating groups of data.",
                "correct": True,
                "feedback": "Correct. This is one of the conditions for first normal \
                    form, and it implies that a table should not contain repeating \
                    columns.",
            },
            {
                "answer": "Every column of the table contains only unique values.",
                "correct": False,
                "feedback": "Incorrect. This is not a condition for first normal form.",
            },
        ],
    },
]

display_quiz(content, colors="fdsp")