# Creating variables

## Learning Objectives

After working through this topic, you should be able to:

- Assign new columns to a DataFrame
- Explain why loops over columns are harmless and loops over rows have to be avoided
- Use vectorized calculations between numeric columns
- Use vectorized string methods
- Combine information from two columns based on conditions
- Replace values in variables

## Materials

Video:

<iframe
  src="https://electure.uni-bonn.de/paella7/ui/watch.html?id=cc0eced6-56fc-4719-ba6c-00d9d486e9ae"
  width="640"
  height="360"
  frameborder="0"
  allowfullscreen
></iframe>

Download the [slides](pandas-creating_variables.pdf).

## Quiz


In [None]:
content = [
    {
        "question": "The code below ...",
        "code": "data['new_col'] = series",
        "type": "many_choice",
        "answers": [
            {
                "answer": (
                    "creates a new column in the data DataFrame with the name \
                        'new_col' and entries from the series Series"
                ),
                "correct": True,
                "feedback": "Correct. This is behavior occurs if there is no \
                    column 'new_col' in the data DataFrame.",
            },
            {
                "answer": (
                    "Produces error if the series Series has different length \
                        than the data DataFrame"
                ),
                "correct": True,
                "feedback": "Correct.",
            },
            {
                "answer": (
                    "Overwrites the column 'new_col' in the data DataFrame \
                        with the entries from the series Series if the column \
                        'new_col' already exists in the data DataFrame"
                ),
                "correct": True,
                "feedback": "Correct.",
            },
            {
                "answer": (
                    "Is used to rename the column 'new_col' in the data \
                        DataFrame to 'series'"
                ),
                "correct": False,
                "feedback": "Incorrect. To rename a column, use the rename \
                    method.",
            },
        ],
    },
    {
        "question": "In order to replace some values in a DataFrame with others you \
            should",
        "type": "multiple_choice",
        "answers": [
            {
                "answer": (
                    "loop over the rows and replace any occurrence of the values"
                ),
                "correct": False,
                "feedback": "You should avoid looping over rows.",
            },
            {
                "answer": ("use some vectorized if condition"),
                "correct": False,
                "feedback": "There is an easier way!",
            },
            {
                "answer": ("use the replace method"),
                "correct": True,
                "feedback": "Correct. Remember: this method takes a dictionary as \
                    argument.",
            },
            {
                "answer": (
                    "open the dataframe in a text editor and replace the values there"
                ),
                "correct": False,
                "feedback": "This is not feasible.",
            },
        ],
    },
    {
        "question": "Suppose that a DataFrame holds information on the savings of \
            some individuals \
                        and another contains the investments of the same observations.\
                        To calculate the fraction of saving not invested you should \
                            ...",
        "type": "multiple_choice",
        "answers": [
            {
                "answer": (
                    "Merge the dataset, loop over the rows and calculate the fraction \
                        for each row"
                ),
                "correct": False,
                "feedback": "You should avoid looping over rows.",
            },
            {
                "answer": ("It is sufficient to perform some vectorized operation"),
                "correct": True,
                "feedback": "Correct. The code \
                    saving_not_invested = (data_1['saving'] - data_2['investment'])/ \
                        data_1['saving']\
                    produces a series with the percentage of saving not invested.\
                    Note that the series doesn't need to be a column of a DataFrame.\
                    Note that the two datasets need to be aligned.",
            },
            {
                "answer": (
                    "It is not possible to easily perform operations between two \
                        datasets"
                ),
                "correct": False,
                "feedback": "Incorrect. Columns of datasets are simple series objects.",
            },
            {
                "answer": ("Merge the dataset, and perform some vectorized operation"),
                "correct": False,
                "feedback": "There is no need to merge the dataset. However, \
                    you still need to make sure that the two datasets are aligned.",
            },
        ],
    },
    {
        "question": "Suppose that a DataFrame contains a variable with the \
            date of birth of the observations as a string in the format dd/mm/yyyy.\
            To store the birthyear you could ...",
        "type": "many_choice",
        "answers": [
            {
                "answer": (
                    "loop over the rows and keep only the last 4 characters of the \
                        string"
                ),
                "correct": False,
                "feedback": "You should avoid looping over rows.",
            },
            {
                "answer": ("year = data['birth_date'].str[-4:].astype(int)"),
                "correct": True,
                "feedback": "Correct.",
            },
            {
                "answer": (
                    "year = data['birth_date'].str.split('/').str.get(-1).astype(int)"
                ),
                "correct": True,
                "feedback": "Correct. However, this is a bit cumbersome.",
            },
            {
                "answer": ("use the datetime module"),
                "correct": True,
                "feedback": "Correct. After importing the datetime module, \
                you could use the following code: \
                year = pd.to_datetime(data['birth_date'], format='%d/%m/%Y').dt.year",
            },
        ],
    },
    {
        "question": "The where method ...",
        "type": "many_choice",
        "answers": [
            {
                "answer": ("works as a vectorized if condition"),
                "correct": True,
                "feedback": "Correct. You should use this method instead \
                    every time you need to make operation on series \
                    based on their values.",
            },
            {
                "answer": (
                    "it is not very intuitive and could be avoided using simple syntax"
                ),
                "correct": False,
                "feedback": "While there are other ways to make operation on series \
                    based on their values, you should get used to the where method.",
            },
            {
                "answer": ("can only be used to generate a binary series"),
                "correct": False,
                "feedback": "The where method is very versatile and can be used to \
                    substitute the values respecting the cond with the different values\
                    in the other argument.",
            },
            {
                "answer": (
                    "can be used in a nested way to produce very general if conditions"
                ),
                "correct": True,
                "feedback": "Once you get used to the where method, it is very simple \
                    to produce very general if conditions.",
            },
        ],
    },
    {
        "question": "Looping over rows in a DataFrame is ...",
        "type": "multiple_choice",
        "answers": [
            {
                "answer": ("a slow operation and should be avoided if possible"),
                "correct": True,
                "feedback": "Indeed, This operation is slow and not scalable.",
            },
            {
                "answer": ("a fast operation and should be used whenever possible"),
                "correct": False,
                "feedback": "INCORRECT!!!",
            },
            {
                "answer": ("similar to looping over columns"),
                "correct": False,
                "feedback": "Incorrect. Looping over columns is fast and readable.",
            },
            {
                "answer": ("similar to the use of vectorized operations"),
                "correct": False,
                "feedback": "Vectorized operation are done in parallel, \
                        while looping over rows is done sequentially.",
            },
        ],
    },
]

from jupyterquiz import display_quiz

display_quiz(content, colors="fdsp")