logo

3 Simple Ways To Handle Bigger Datasets In Pandas

3 Simple Ways To Handle Bigger Datasets In Pandas

TLDR

The types we choose to represent our data have an impact on the underlying size of its representation. There exists a multitude of data types, each with a different precisions, and memory consumption. To be as memory-efficient as possible you must understand your data, and choose the most fine-grained types possible to represent it.

Find the code related to this Bits of ... on GitHub.

Concept

Pandas is amazing at processing data on a single machine, on a single core, and when it fits into memory. If you deal with big datasets, you can easily exceed these settings and might have to look into other tools to handle your data. Before moving to distributed computing, with technologies like Spark, there are some simple optimizations that will help you scale to bigger datasets.

The main limiting factor is the amount of memory needed to load the data into memory. If we reduce the size of the data, we can load more data, process it faster … Consider a simple CSV file with two columns: age and job.

CSV with 2 columns
CSV with two columns

When you load data into memory, you have to define the types for each column of your dataset, and it is often done for you by default. Pandas use numpy arrays to represent each column in your data frame. These numpy arrays have a dtype, which is the underlying type of all elements in the array. When using pd.read_csv, pandas will guess the data types of each column and assign numpy arrays with default types. However the default types are core-grained, using more space than needed to represent the different columns.

In-memory representation of the data loaded by pandas without optimizations
In-memory representation of the data loaded by pandas without optimizations

In our case we would get an age column with an int64 type and a job column with an object type. This is sub-optimal since an age has a range [0, 100], needing only 8 bits instead of 64. Also a job is not “any string” but is categorical, we most likely have a limited list of possible job and each job is an element of this list.

Use the proper dtype for numerical columns

Pandas does not infer the best type for our numerical columns. In our case we could represent an age with a uint8 type, because an age is unsigned (always positive) and its range is [0, 100]. However, by default, pandas use an int64 which consumes 8x more memory.

The first step is to enforce optimized types for numerical columns.

Use the proper dtype for categorical columns

When a column is categorical with a low cardinality (the number of unique values is lower than the total number of rows -> lots of repetitions) the string representation is the worst. The string is a costly type, depending on the format (utf-8, ascii, …) we use between 1 and 4 bytes per character. It quickly consumes a huge amount of memory when repeated over thousands of rows.

But there is a better way. It’s more efficient to have a dictionary mapping every unique value to an index with a small type, for instance 1 byte. Then we repeat only 1 byte thousands of times instead of the complete string which is far bigger. Fortunately, pandas has a dtype to handle it automatically for you, you just have to use the category dtype.

It usually brings the biggest improvement in memory consumption. Handling categorical data with simple strings is a huge waste of resources. The str type is heavy instead of a mapping with only indices repeated with a proper categorical type.

In-memory representation of the data with type optimization.
In-memory representation of the data with type optimization. For the AGE column we use uint8 instead of int64. For the JOB column Student is mapped to 0 and Data Scientist is mapped to 1. Then we repeat 0 and 1 in the different data rows instead of the heavy string representation

In the calculation above, we assume that the smallest chunk of memory we can allocate is 1 byte (8 bits). Indeed, we only have two different values for the vocabulary, Data Scientist and Student. To represent two different values we only need 1 bit: either 0 or 1. But since we assumed that we can allocate 1 byte integer at minimum, we get 1 byte for each entry.

Load only relevant columns

It’s likely that you are not using all the columns of your dataset to conduct a specific analysis. Using the usecols keyword, you can load only the column of interests. It is a simple trick, but it helps you deal with huge datasets easily.

In-memory representation of the data after loading only the relevant columns.
In-memory representation of the data after loading only the relevant columns (assuming we need only the AGE column)

Pandas dtypes contain all numpy dtypes and specific pandas types. You can find the detailed list of all types: for pandas and for numpy.


In Practice

Let's get our hands dirty on a real dataset, to demonstrate the importance of data types. We will use a public dataset from Kaggle entitled: Resale Flat Prices in Sigapore (between 1990 and 1999).

First we import the needed libraries, and build the path to the dataset.

from pathlib import Path
import numpy as np
import pandas as pd
# Data source:
# https://www.kaggle.com/sveneschlbeck/resale-flat-prices-in-singapore
data_path = Path(__file__).parents[0] / "flat-prices.zip"

Then we define two helper functions:

  1. to_mb: to convert a size in bytes (B) to a size in mega bytes (MB).
  2. analyze: takes a data frame as input and print the data frame dtypes as well as a title containing the size in MB.

We use df.memory_usage(deep=True) to get the memory size per column and sum it to get the total size. The argument deep=True is important to get real amount of memory consumed.

def to_mb(bytes: float) -> float:
"""Function to convert bytes to mega bytes (MB)"""
return bytes / 1024 ** 2
def analyze(df: pd.DataFrame, title: str) -> float:
"""Print the memory used by the dataframe in MB and return the
bytes"""
bytes = df.memory_usage(deep=True).sum()
mb = to_mb(bytes)
print(f"{title}: {mb:.2f} MB")
print(df.dtypes)
return mb

Then we load the data using pd.read_csv without any optimizations:

def no_optimization() -> pd.DataFrame:
"""Load the data without any optimization"""
df = pd.read_csv(data_path)
return df
...
if __name__ == "__main__":
# First we start with no optimizations
df_no_opti = no_optimization()
mb_no_opti = analyze(df_no_opti, "No optimizations")
No optimizations: 131.36 MB
month object
town object
flat_type object
block object
street_name object
storey_range object
floor_area_sqm float64
flat_model object
lease_commence_date int64
resale_price int64
dtype: object

The data frame takes 131.36 MB in memory. We also take a look at column types and notice that some numerical types can be improved. Let's take a look at the columns floor_area_sqm and resale_price:

floor_area_sqm resale_price
31.0 9000
31.0 6000
31.0 8000
31.0 6000
73.0 47200
... ...
142.0 456000
142.0 408000
146.0 469000
146.0 440000
145.0 484000

The floor_area_sqm can fit within a 16 bits float (with max value 6.55 × 10e4) column and can be represented with a float16. resale_price can fit into a 32 bits unisgned int (with max value 4.29e+09) and can be represented with a unin32.

You can inspect the count, min, max and some other statistics for every column in your dataset using df.describe().

It is particularly useful to find the range of numerical columns, and the cardinality (count of unique values) of a categorical columns.

  • Numerical columns with small range can be optimized with a smaller type
  • The category type for a low cardinality column provides a huge memory gain

Let's apply our numrical type optimizations. We use the dtypes argument in pd.read_csv to specify a type per column.

def with_numerical_types() -> pd.DataFrame:
"""Load the data with proper numerical types"""
df = pd.read_csv(
data_path,
dtype={
"floor_area_sqm": np.float16,
"resale_price": np.uint32,
},
)
return df

The data frame size in memory is now 128.62 MB which is a slight reduction of 2.09%.

Now let's deal with the categorical columns. flat_model, flat_type, storey_range, block and town are good candidates for categorical colunms. They have a low cardinality which indicates a great memory improvement potential.

Let's apply these optimizations:

def with_numerical_and_categorical_types() -> pd.DataFrame:
"""Load the data with proper numerical and categorical types"""
df = pd.read_csv(
data_path,
dtype={
"floor_area_sqm": np.float16,
"resale_price": np.uint32,
"flat_model": "category",
"flat_type": "category",
"storey_range": "category",
"block": "category",
"town": "category",
},
)
return df

The data frame size in memory is now 42.54 MB which is a huge reduction of 67.61% compared to the case without optimizations!

def with_numerical_and_categorical_types_and_without_unused_columns() -> pd.DataFrame:
"""Load the data with proper numerical and categorical types
and without unused columns"""
dtype = {
"floor_area_sqm": np.float16,
"resale_price": np.uint32,
"flat_model": "category",
"flat_type": "category",
"storey_range": "category",
"block": "category",
"town": "category",
}
df = pd.read_csv(data_path, dtype=dtype, usecols=list(dtype.keys()))
return df

The data frame size in memory is now 3.39 MB which is a huge reduction of 97.42% compared to the case without optimizations!

Handling bigger datasets doesn’t always rhyme with a bigger computer.

Sometimes all you need is to be cautious about the types of your data and clarity about the columns you really need.

Copyright © 2022 Bitswired. All Rights Reserved.