Back to Blog
Data Analysis
10 January 2024
12 min read

Optimizing Pandas Performance for Large Datasets

Learn advanced techniques to speed up your pandas operations and handle large datasets efficiently.

Optimizing Pandas Performance for Large Datasets

Optimizing Pandas Performance for Large Datasets


Working with large datasets in pandas can be challenging. This comprehensive guide covers advanced techniques to optimize performance and memory usage when dealing with big data.


Understanding Memory Usage


Before optimizing, it's crucial to understand how pandas uses memory:


import pandas as pd
import numpy as np

# Check memory usage
df = pd.read_csv('large_dataset.csv')
print(df.info(memory_usage='deep'))
print(f"Total memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Memory usage by column
for col in df.columns:
    print(f"{col}: {df[col].memory_usage(deep=True) / 1024**2:.2f} MB")

1. Data Type Optimization


Choose Appropriate Data Types


# Before optimization
df = pd.read_csv('large_dataset.csv')
print("Before optimization:")
print(df.dtypes)
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Optimize data types
optimizations = {
    'category_col': 'category',
    'small_int_col': 'int8',
    'medium_int_col': 'int16',
    'boolean_col': 'bool',
    'float_col': 'float32'
}

for col, dtype in optimizations.items():
    if col in df.columns:
        df[col] = df[col].astype(dtype)

print("\nAfter optimization:")
print(df.dtypes)
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

Automatic Type Inference


def optimize_dtypes(df):
    """Automatically optimize DataFrame dtypes."""
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != 'object':
            c_min = df[col].min()
            c_max = df[col].max()
            
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
            
            elif str(col_type)[:5] == 'float':
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
        
        else:
            num_unique_values = len(df[col].unique())
            num_total_values = len(df[col])
            if num_unique_values / num_total_values < 0.5:
                df[col] = df[col].astype('category')
    
    return df

2. Chunking for Large Files


def process_large_csv(file_path, chunk_size=10000, operation='sum'):
    """Process large CSV files in chunks."""
    results = []
    
    for chunk in pd.read_csv(file_path, chunksize=chunk_size):
        # Optimize chunk dtypes
        chunk = optimize_dtypes(chunk)
        
        # Process each chunk
        if operation == 'sum':
            processed_chunk = chunk.groupby('category')['value'].sum()
        elif operation == 'mean':
            processed_chunk = chunk.groupby('category')['value'].mean()
        
        results.append(processed_chunk)
    
    # Combine results
    if operation == 'sum':
        final_result = pd.concat(results).groupby(level=0).sum()
    elif operation == 'mean':
        final_result = pd.concat(results).groupby(level=0).mean()
    
    return final_result

# Usage
result = process_large_csv('huge_dataset.csv', chunk_size=50000)

3. Vectorized Operations


Avoid Loops


# Slow: Using loops
result = []
for index, row in df.iterrows():
    result.append(row['value'] * 2 + row['bonus'])

# Fast: Vectorized operation
result = df['value'] * 2 + df['bonus']

# Even better: Using eval for complex expressions
result = df.eval('value * 2 + bonus')

Use Built-in Methods


# Slow
df['is_weekend'] = df['date'].apply(lambda x: x.weekday() >= 5)

# Fast
df['is_weekend'] = df['date'].dt.weekday >= 5

4. Advanced Performance Techniques


Using Categorical Data


# Convert string columns to categorical
df['status'] = df['status'].astype('category')
df['region'] = df['region'].astype('category')

# This can reduce memory usage by 50-90% for string columns

Parallel Processing with Dask


import dask.dataframe as dd

# Read large CSV with Dask
ddf = dd.read_csv('huge_dataset.csv')

# Perform operations
result = ddf.groupby('category')['value'].sum().compute()

5. Memory Management


Monitor Memory Usage


import psutil
import os

def get_memory_usage():
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / 1024 / 1024  # MB

print(f"Memory usage: {get_memory_usage():.2f} MB")

Clean Up Unused DataFrames


# Delete unused dataframes
del df_temp

# Force garbage collection
import gc
gc.collect()

Conclusion


Optimizing pandas performance requires a combination of techniques:


1. **Memory optimization** through appropriate data types

2. **Chunking** for datasets larger than memory

3. **Vectorized operations** instead of loops

4. **Efficient indexing** and filtering

5. **Parallel processing** for CPU-intensive tasks

6. **Out-of-core processing** with tools like Dask


Implement these techniques progressively, measuring performance improvements at each step. Remember that the best optimization strategy depends on your specific use case and data characteristics.


Performance Benchmarking


Always measure your optimizations:


import time

start_time = time.time()
# Your pandas operation here
end_time = time.time()

print(f"Operation took {end_time - start_time:.2f} seconds")

With these techniques, you can handle datasets 10-100x larger than your available RAM efficiently.

All Posts