# Inspecting and summarizing data

## Learning Objectives

After working through this topic, you should be able to:

- Summarize datasets
- Look at unique values and value_counts of data
- Visualize the distribution of variables
- Calculate summary statistics

## Materials

Video:

<iframe
  src="https://electure.uni-bonn.de/paella7/ui/watch.html?id=c1284f96-a00c-4117-a12a-832040f34dd7"
  width="640"
  height="360"
  frameborder="0"
  allowfullscreen
></iframe>

Download the [slides](pandas-inspecting_and_summarizing.pdf).



### pandas settings for getting "modern" behaviour and the plotly backend for graphs

```python
import pandas as pd

pd.options.mode.copy_on_write = True
pd.options.future.infer_string = True
```

<!-- 
## Further Reading
-->


## Quiz

In [None]:
content = [
    {
        "question": "The describe method of a dataset returns:",
        "type": "multiple_choice",
        "answers": [
            {
                "answer": "A table",
                "correct": False,
                "feedback": "Tables are not python objects.",
            },
            {
                "answer": "A DataFrame",
                "correct": True,
                "feedback": "Correct.",
            },
            {
                "answer": "A dictionary",
                "correct": False,
                "feedback": "Incorrect.",
            },
            {
                "answer": """An Array""",
                "correct": False,
                "feedback": "Incorrect.",
            },
        ],
    },
    {
        "question": "The report from calling `.describe()` on a DataFrame \
            contains:",
        "type": "many_choice",
        "answers": [
            {
                "answer": "The count of missing values",
                "correct": False,
                "feedback": "This information is not explicitly reported.",
            },
            {
                "answer": "The count of non-missing values",
                "correct": True,
                "feedback": "Correct.",
            },
            {
                "answer": "The median",
                "correct": True,
                "feedback": "Correct. This is indexed with 50%.",
            },
            {
                "answer": "The max",
                "correct": True,
                "feedback": "Correct.",
            },
        ],
    },
    {
        "question": "The plot method can be applied to:",
        "type": "multiple_choice",
        "answers": [
            {
                "answer": "Only to clean data",
                "correct": False,
                "feedback": "Plots of clean data provide useful information. \
                    Graphs that go into papers should be based on clean data. \
                    However, inspecting raw data can be very helpful.",
            },
            {
                "answer": "Any Series",
                "correct": True,
                "feedback": "Correct.",
            },
            {
                "answer": "Only summary statistics",
                "correct": False,
                "feedback": "Summary statistics of a single variable are rarely\
                    plotted. Summary statistics can be plotted to compare them \
                    across different variables.",
            },
            {
                "answer": "Only monotone series",
                "correct": False,
                "feedback": 'Monotone series might have "nicer" graphs. \
                    However, the plot method can be applied also to \
                    non-monotone series.',
            },
        ],
    },
    {
        "question": ("Scatterplots are useful for:"),
        "type": "multiple_choice",
        "answers": [
            {
                "answer": "presenting a large number of data points \
                    (thousands+).",
                "correct": False,
                "feedback": "Often scatterplots are not the best choice \
                    to present so many data points. It will be impossible to \
                    visually distinguish different observations.",
            },
            {
                "answer": "clearly presenting summary statistics.",
                "correct": False,
                "feedback": (
                    "This might give the wrong impression that (e.g.) means "
                    "constitute data points. Also, it will be hard to \
                    communicate "
                    "statistical uncertainty."
                ),
            },
            {
                "answer": "identifying outliers",
                "correct": True,
                "feedback": "The interactive plot can be used to eyeball \
                    outliers in small and medium-sized datasets.",
            },
            {
                "answer": "scatterplots just are not very useful",
                "correct": False,
                "feedback": (
                    "Very much to the contrary! Just not useful for everything."
                ),
            },
        ],
    },
    {
        "question": (
            "Suppose you are handling a dataset you are not familiar \
                with and that "
            "it does not come with a description of the variables. "
            'The dataset has the variable "group". '
            "The easiest way to discover how many groups are there \
                is to:"
        ),
        "type": "multiple_choice",
        "answers": [
            {
                "answer": "Open the dataset in data viewer,\
                sort the data by the group variable and \
                scroll down",
                "correct": False,
                "feedback": "This procedure is error prone.",
            },
            {
                "answer": "Convert the variable in a set and check the \
                    length of the set",
                "correct": False,
                "feedback": "This works but it is a bit cumbersome.",
            },
            {
                "answer": "Call `df['group'].unique()`.",
                "correct": True,
                "feedback": "Correct. This approach also identifies \
                    misspelled groups.",
            },
            {
                "answer": "Call `df['group'].value_counts()`.",
                "correct": False,
                "feedback": "Almost. This method reports the unique values, \
                    but these are not quite as easily accessible.",
            },
        ],
    },
]

from jupyterquiz import display_quiz

display_quiz(content, colors="fdsp")