TLDR
The types we choose to represent our data have an impact on the underlying size of its representation. There exists a multitude of data types, each with a different precisions, and memory consumption. To be as memory-efficient as possible you must understand your data, and choose the most fine-grained types possible to represent it.
Find the code related to this Bits of ... on GitHub.
Concept
Pandas is amazing at processing data on a single machine, on a single core, and when it fits into memory. If you deal with big datasets, you can easily exceed these settings and might have to look into other tools to handle your data. Before moving to distributed computing, with technologies like Spark, there are some simple optimizations that will help you scale to bigger datasets.
The main limiting factor is the amount of memory needed to load the data into memory. If we reduce the size of the data, we can load more data, process it faster … Consider a simple CSV file with two columns: age and job.
When you load data into memory, you have to define the types for each column of
your dataset, and it is often done for you by default. Pandas use numpy arrays
to represent each column in your data frame. These numpy arrays have a dtype,
which is the underlying type of all elements in the array. When using
pd.read_csv
, pandas will guess the data types of each column and assign numpy
arrays with default types. However the default types are core-grained, using
more space than needed to represent the different columns.
In our case we would get an age column with an int64
type and a job column
with an object
type. This is sub-optimal since an age has a range [0, 100],
needing only 8 bits instead of 64. Also a job is not “any string” but is
categorical, we most likely have a limited list of possible job and each job is
an element of this list.
Use the proper dtype for numerical columns
Pandas does not infer the best type for our numerical columns. In our case we
could represent an age with a uint8
type, because an age is unsigned (always
positive) and its range is [0, 100]. However, by default, pandas use an int64
which consumes 8x more memory.
The first step is to enforce optimized types for numerical columns.
Use the proper dtype for categorical columns
When a column is categorical with a low cardinality (the number of unique values is lower than the total number of rows -> lots of repetitions) the string representation is the worst. The string is a costly type, depending on the format (utf-8, ascii, …) we use between 1 and 4 bytes per character. It quickly consumes a huge amount of memory when repeated over thousands of rows.
But there is a better way. It’s more efficient to have a dictionary mapping
every unique value to an index with a small type, for instance 1 byte. Then we
repeat only 1 byte thousands of times instead of the complete string which is
far bigger. Fortunately, pandas has a dtype to handle it automatically for you,
you just have to use the category
dtype.
It usually brings the biggest improvement in memory consumption. Handling
categorical data with simple strings is a huge waste of resources. The str
type is heavy instead of a mapping with only indices repeated with a proper
categorical type.
In the calculation above, we assume that the smallest chunk of memory we can allocate is 1 byte (8 bits). Indeed, we only have two different values for the vocabulary, Data Scientist and Student. To represent two different values we only need 1 bit: either 0 or 1. But since we assumed that we can allocate 1 byte integer at minimum, we get 1 byte for each entry.
Load only relevant columns
It’s likely that you are not using all the columns of your dataset to conduct a specific analysis. Using the usecols keyword, you can load only the column of interests. It is a simple trick, but it helps you deal with huge datasets easily.
Pandas dtypes contain all numpy dtypes and specific pandas types. You can find the detailed list of all types: for pandas and for numpy.
In Practice
Let's get our hands dirty on a real dataset, to demonstrate the importance of data types. We will use a public dataset from Kaggle entitled: Resale Flat Prices in Sigapore (between 1990 and 1999).
First we import the needed libraries, and build the path to the dataset.
from pathlib import Pathimport numpy as npimport pandas as pd# Data source:# https://www.kaggle.com/sveneschlbeck/resale-flat-prices-in-singaporedata_path = Path(__file__).parents[0] / "flat-prices.zip"
Then we define two helper functions:
to_mb
: to convert a size in bytes (B) to a size in mega bytes (MB).analyze
: takes a data frame as input and print the data frame dtypes as well as a title containing the size in MB.
We use df.memory_usage(deep=True)
to get the memory size per column and sum
it to get the total size. The argument deep=True
is important to get real
amount of memory consumed.
def to_mb(bytes: float) -> float:"""Function to convert bytes to mega bytes (MB)"""return bytes / 1024 ** 2def analyze(df: pd.DataFrame, title: str) -> float:"""Print the memory used by the dataframe in MB and return thebytes"""bytes = df.memory_usage(deep=True).sum()mb = to_mb(bytes)print(f"{title}: {mb:.2f} MB")print(df.dtypes)return mb
Then we load the data using pd.read_csv
without any optimizations:
def no_optimization() -> pd.DataFrame:"""Load the data without any optimization"""df = pd.read_csv(data_path)return df...if __name__ == "__main__":# First we start with no optimizationsdf_no_opti = no_optimization()mb_no_opti = analyze(df_no_opti, "No optimizations")
No optimizations: 131.36 MBmonth objecttown objectflat_type objectblock objectstreet_name objectstorey_range objectfloor_area_sqm float64flat_model objectlease_commence_date int64resale_price int64dtype: object
The data frame takes 131.36 MB in memory. We also take a
look at column types and notice that some numerical types can be improved. Let's
take a look at the columns floor_area_sqm
and resale_price
:
floor_area_sqm resale_price31.0 900031.0 600031.0 800031.0 600073.0 47200... ...142.0 456000142.0 408000146.0 469000146.0 440000145.0 484000
The floor_area_sqm
can fit within a 16 bits float (with max value 6.55 ×
10e4) column and can be represented with a float16
. resale_price
can fit
into a 32 bits unisgned int (with max value 4.29e+09) and can be represented
with a unin32
.
You can inspect the count, min, max and some other statistics for every column in your dataset
using df.describe()
.
It is particularly useful to find the range of numerical columns, and the cardinality (count of unique values) of a categorical columns.
- Numerical columns with small range can be optimized with a smaller type
- The
category
type for a low cardinality column provides a huge memory gain
Let's apply our numrical type optimizations. We use the dtypes
argument in
pd.read_csv
to specify a type per column.
def with_numerical_types() -> pd.DataFrame:"""Load the data with proper numerical types"""df = pd.read_csv(data_path,dtype={"floor_area_sqm": np.float16,"resale_price": np.uint32,},)return df
The data frame size in memory is now 128.62 MB which is a slight reduction of 2.09%.
Now let's deal with the categorical columns. flat_model
, flat_type
,
storey_range
, block
and town
are good candidates for categorical colunms.
They have a low cardinality which indicates a great memory improvement
potential.
Let's apply these optimizations:
def with_numerical_and_categorical_types() -> pd.DataFrame:"""Load the data with proper numerical and categorical types"""df = pd.read_csv(data_path,dtype={"floor_area_sqm": np.float16,"resale_price": np.uint32,"flat_model": "category","flat_type": "category","storey_range": "category","block": "category","town": "category",},)return df
The data frame size in memory is now 42.54 MB which is a huge reduction of 67.61% compared to the case without optimizations!
def with_numerical_and_categorical_types_and_without_unused_columns() -> pd.DataFrame:"""Load the data with proper numerical and categorical typesand without unused columns"""dtype = {"floor_area_sqm": np.float16,"resale_price": np.uint32,"flat_model": "category","flat_type": "category","storey_range": "category","block": "category","town": "category",}df = pd.read_csv(data_path, dtype=dtype, usecols=list(dtype.keys()))return df
The data frame size in memory is now 3.39 MB which is a huge reduction of 97.42% compared to the case without optimizations!
Handling bigger datasets doesn’t always rhyme with a bigger computer.
Sometimes all you need is to be cautious about the types of your data and clarity about the columns you really need.