Optimizing Pandas Performance for Large Datasets

Working with large datasets in pandas can be challenging. This comprehensive guide covers advanced techniques to optimize performance and memory usage when dealing with big data.

Understanding Memory Usage

Before optimizing, it's crucial to understand how pandas uses memory:

import pandas as pd
import numpy as np

# Check memory usage
df = pd.read_csv('large_dataset.csv')
print(df.info(memory_usage='deep'))
print(f"Total memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Memory usage by column
for col in df.columns:
    print(f"{col}: {df[col].memory_usage(deep=True) / 1024**2:.2f} MB")

1. Data Type Optimization

Choose Appropriate Data Types

# Before optimization
df = pd.read_csv('large_dataset.csv')
print("Before optimization:")
print(df.dtypes)
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Optimize data types
optimizations = {
    'category_col': 'category',
    'small_int_col': 'int8',
    'medium_int_col': 'int16',
    'boolean_col': 'bool',
    'float_col': 'float32'
}

for col, dtype in optimizations.items():
    if col in df.columns:
        df[col] = df[col].astype(dtype)

print("\nAfter optimization:")
print(df.dtypes)
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

Automatic Type Inference

def optimize_dtypes(df):
    """Automatically optimize DataFrame dtypes."""
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != 'object':
            c_min = df[col].min()
            c_max = df[col].max()
            
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
            
            elif str(col_type)[:5] == 'float':
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
        
        else:
            num_unique_values = len(df[col].unique())
            num_total_values = len(df[col])
            if num_unique_values / num_total_values < 0.5:
                df[col] = df[col].astype('category')
    
    return df

2. Chunking for Large Files

def process_large_csv(file_path, chunk_size=10000, operation='sum'):
    """Process large CSV files in chunks."""
    results = []
    
    for chunk in pd.read_csv(file_path, chunksize=chunk_size):
        # Optimize chunk dtypes
        chunk = optimize_dtypes(chunk)
        
        # Process each chunk
        if operation == 'sum':
            processed_chunk = chunk.groupby('category')['value'].sum()
        elif operation == 'mean':
            processed_chunk = chunk.groupby('category')['value'].mean()
        
        results.append(processed_chunk)
    
    # Combine results
    if operation == 'sum':
        final_result = pd.concat(results).groupby(level=0).sum()
    elif operation == 'mean':
        final_result = pd.concat(results).groupby(level=0).mean()
    
    return final_result

# Usage
result = process_large_csv('huge_dataset.csv', chunk_size=50000)

3. Vectorized Operations

Avoid Loops

# Slow: Using loops
result = []
for index, row in df.iterrows():
    result.append(row['value'] * 2 + row['bonus'])

# Fast: Vectorized operation
result = df['value'] * 2 + df['bonus']

# Even better: Using eval for complex expressions
result = df.eval('value * 2 + bonus')

Use Built-in Methods

# Slow
df['is_weekend'] = df['date'].apply(lambda x: x.weekday() >= 5)

# Fast
df['is_weekend'] = df['date'].dt.weekday >= 5

4. Advanced Performance Techniques

Using Categorical Data

# Convert string columns to categorical
df['status'] = df['status'].astype('category')
df['region'] = df['region'].astype('category')

# This can reduce memory usage by 50-90% for string columns

Parallel Processing with Dask

import dask.dataframe as dd

# Read large CSV with Dask
ddf = dd.read_csv('huge_dataset.csv')

# Perform operations
result = ddf.groupby('category')['value'].sum().compute()

5. Memory Management

Monitor Memory Usage

import psutil
import os

def get_memory_usage():
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / 1024 / 1024  # MB

print(f"Memory usage: {get_memory_usage():.2f} MB")

Clean Up Unused DataFrames

# Delete unused dataframes
del df_temp

# Force garbage collection
import gc
gc.collect()

Conclusion

Optimizing pandas performance requires a combination of techniques:

1. **Memory optimization** through appropriate data types

2. **Chunking** for datasets larger than memory

3. **Vectorized operations** instead of loops

4. **Efficient indexing** and filtering

5. **Parallel processing** for CPU-intensive tasks

6. **Out-of-core processing** with tools like Dask

Implement these techniques progressively, measuring performance improvements at each step. Remember that the best optimization strategy depends on your specific use case and data characteristics.

Performance Benchmarking

Always measure your optimizations:

import time

start_time = time.time()
# Your pandas operation here
end_time = time.time()

print(f"Operation took {end_time - start_time:.2f} seconds")

With these techniques, you can handle datasets 10-100x larger than your available RAM efficiently.