# Data types

## Learning Objectives

After working through this topic, you should be able to:

- List the most important datatypes in pandas
- Discuss the benefits of modern strings
- Choose memory saving datatypes for your data
<!-- - Explain the benefits of modern nullable datatypes -->

## Materials

Video:

<iframe
  src="https://electure.uni-bonn.de/paella7/ui/watch.html?id=8ad5f658-648c-4d98-a006-eb8447bfb349"
  width="640"
  height="360"
  frameborder="0"
  allowfullscreen
></iframe>

Download the [slides](pandas-datatypes.pdf).





- [Pandas user guide on string/text data](https://pandas.pydata.org/docs/user_guide/text.html)
- [Pandas user guide on categorical data](https://pandas.pydata.org/docs/user_guide/categorical.html)
## Quiz

In [None]:
content = [
    {
        "name": "Intro",
        "front": (
            "Describe each of the DataTypes (rough amount of storage space, the broad "
            "range of values it can/should take, an example use case)."
        ),
        "back": (
            "Let's go through them step by step (click on next in the bottom right)."
        ),
    },
    {
        "name": "pd.Int16Dtype",
        "front": "pd.Int16Dtype",
        "back": (
            "An integer, requiring 16 bits of storage space, yielding a range of values"
            " from $[-2^{15}, 2^{15}) = [-32768, 32767]$. We may use it to store "
            "the year."
        ),
    },
    {
        "name": "pd.UInt32Dtype",
        "front": "pd.UInt32Dtype",
        "back": (
            "An unsigned integer, requiring 32 bits of storage space, yielding a range "
            "of values: $[0, 2^{32}) = [0, 4294967295]$. We may use it to store the "
            "person identifiers, unless we observe more than half of all humanity."
        ),
    },
    {
        "name": "pd.StringDtype",
        "front": "pd.StringDtype",
        "back": (
            "Strings of arbitrary length, we may use it to store answers to a "
            "free-text answer from a survey."
        ),
    },
    {
        "name": "pd.CategoricalDtype (ordered)",
        "front": "pd.CategoricalDtype (ordered)",
        "back": (
            "A categorical variable, with a fixed number of possible values, which are "
            "ordered in a way specified by the user. Very efficient storage behind the "
            "scenes. An example would be responses to a Likert scale question on "
            "subjective health with possible values "
            "excellent-very good-good-fair-poor."
        ),
    },
    {
        "name": "pd.CategoricalDtype (unordered)",
        "front": "pd.CategoricalDtype (unordered)",
        "back": (
            "A categorical variable, with a fixed number of possible values, which are "
            "not ordered. Very efficient storage behind the scenes. An example would "
            "be gender (female, male, other, ...)."
        ),
    },
]


from jupytercards import display_flashcards

display_flashcards(content)

In [None]:
content = [
    {
        "question": (
            "Assume you have a small survey dataset with 1500 rows and 50 columns. "
            "Among other things, your dataset has variables for gender, income, "
            "happiness (on a 3-point Likert scale), all of which are stored as "
            "pd.Int64DType. Tick all that apply."
        ),
        "type": "many_choice",
        "answers": [
            {
                "answer": (
                    "We should set each variable to the smallest possible integer type "
                    "to save on memory."
                ),
                "correct": False,
                "feedback": (
                    "Gender as an integer is not helpful at all! You always need to "
                    "remember what the numbers mean. Let's not get started on plotting "
                    "or accidentally using the values 0, 1, 2 directly in a regression."
                    "Also, with the size of these data, memory is not a concern."
                ),
            },
            {
                "answer": (
                    "We should set income to be of the pd.Float64Dtype variant in "
                    "order to save on memory"
                ),
                "correct": False,
                "feedback": (
                    "Both datatypes take up the same amount of memory. Also, with the "
                    "size of these data, memory is not a concern."
                ),
            },
            {
                "answer": (
                    "We should set income to be of the pd.Float64Dtype variant in "
                    "to make clear we are approximating a real number."
                ),
                "correct": True,
                "feedback": (
                    "We would recommend this indeed. Strictly speaking, you could not "
                    "calculate continuous distributions etc. with integers, although "
                    "the necessary type conversion will often happen implicitly."
                ),
            },
            {
                "answer": (
                    "We should set gender to be of the ordered pd.CategoricalDtype "
                    "variant."
                ),
                "correct": False,
                "feedback": "How would you order female/male/other?",
            },
            {
                "answer": (
                    "We should set gender to be of the unordered pd.CategoricalDtype "
                    "variant."
                ),
                "correct": True,
            },
            {
                "answer": (
                    "We should set happiness to be of the ordered pd.CategoricalDtype "
                    "variant."
                ),
                "correct": True,
                "feedback": "This is the correct representation.",
            },
            {
                "answer": (
                    "We should set happiness to be of the unordered pd.CategoricalDtype"
                    " variant."
                ),
                "correct": False,
                "feedback": (
                    "Order is built into Likert scales by definition; the data "
                    "type should reflect this."
                ),
            },
            {
                "answer": ("We can just leave happiness to be an Integer type."),
                "correct": False,
                "feedback": (
                    "A categorical data type makes much clearer what the variable "
                    "contains. The only reason you may want to leave it as an integer "
                    "would be to include it in a regression where you are comfortable "
                    "interpret differences in a cardinal way. While there is a debate "
                    "on whether this is not too far-fetched for Likert-scales with a "
                    "larger outcome space, for three outcomes it definitely is not."
                ),
            },
        ],
    },
]

from jupyterquiz import display_quiz

display_quiz(content, colors="fdsp")