Pandas Performance Improvement Tips

Below are some tips for you to improve/optimize performance of your Pandas operations. These tips are helpful when you have lot of data or lot of processing or both. I have linked to some source articles where you can learn more about each topic.

Downcast

For memory optimization

Source: https://www.dataquest.io/blog/pandas-big-data/

Each DF is stored as blocks of corresponding datatypes.

Source: https://www.dataquest.io/blog/pandas-big-data/

Int & float blocks are represented as ndArray built around C Array and are stored in a contiguous block of memory making it fast to slice.

np.info(int32) -> gets us min & max values for the given type

Source: https://www.dataquest.io/blog/pandas-big-data/

We can use `downcast` to automatically convert pandas int & float to above lowest possible int8/16/32 or float 16/32/:

#Select int columns & Downcast to lowest required bytes 
gl_int = gl.select_dtypes(include=[‘int’])
converted_int = gl_int.apply(pd.to_numeric,downcast=’unsigned’)
#Select float columns & Downcast to lowest required bytes
gl_float = gl.select_dtypes(include=[‘float’])
converted_float = gl_float.apply(pd.to_numeric,downcast=’float’)

Converting string to date time

5–10x Faster, at times up to 35x Faster

Source: https://realpython.com/fast-flexible-pandas/

When we use the below function to convert a python string to date time, Pandas will use the dateutil package to convert each string to a date. Because it assumes date time format could be different in each row.

pd.to_datetime(df[column_name])

Conversely, if the raw datetime data is already in ISO 8601 format: YYYY-MM-DD HH:MM, Pandas can immediately take a fast route to parsing the dates.

You can also implicitly specify format to make it faster:

return pd.to_datetime(df[column_name], format=’%d/%m/%y %H:%M’)

or also pass infer_datetime_format=True parameter. This parameter is also available in the read_csv option.

Other related parameters in read_csv option are: parse_dates, infer_datetime_format, and date_parser parameters

Looping thru each row

30x Faster

Source: https://realpython.com/fast-flexible-pandas/

Worst: Looping using index and fetching each row manually using iloc is the slowest and also run into chain indexing problem which is a bad coding practice (https://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy)

Bad: (5x faster than looping) Using .iterrows() to get one row at a time and process it. itertuples() is also an option and is faster than .iterrows()

Good: (2x faster than iterrows) Using .apply() iterates through the row in cython and hence is faster. However the actual processing is still performed in python.

Better: (27x faster than .apply, 315x than loop) This uses vectorized operations. Entire part of the DF can be processed at once. Use isin operator to get the mask if you need to process only a part of DF

peak_hours = df.index.hour.isin(range(17, 24))
df.loc[peak_hours, ‘cost_cents’] = df.loc[peak_hours, ‘energy_kwh’] * 28

Best: (2x+ than vectorized) When there several sub parts of DF needs to be processed differently. You can create an another column first and process at once instead of processing them individually

cents_per_kwh = pd.cut(x=df.index.hour,
bins=[0, 7, 17, 24],
include_lowest=True,
labels=[12, 20, 28]).astype(int)
df[‘cost_cents’] = cents_per_kwh * df[‘energy_kwh’]

Parallelization

3–4x Faster (based on number of cores)

Source: https://towardsdatascience.com/heres-how-you-can-get-a-2-6x-speed-up-on-your-data-pre-processing-with-python-847887e63be5

By default, Python programs execute as a single process using a single CPU. Most modern machines made for machine learning have at least 2 CPU cores. That means, for the example of 2 CPU cores, that 50% or more of your computer’s processing power won’t be doing anything by default when you run your pre-processing! The situation gets even worse when you get to 4 cores (modern Intel i5) or 6cores (modern Intel i7).

import concurrent.futures
def load_and_resize(image_filename):
img = cv2.imread(image_filename)
img = cv2.resize(img, (600, 600))
with concurrent.futures.ProcessPoolExecutor() as executor: # Boots up as many Python processes as you have CPU cores
image_files = glob.glob(“*.jpg”) # Get all images
executor.map(load_and_resize, image_files) # Split the work across the process pool to use all CPUs

Profile

Don’t optimize pre-maturely. Use profiling to find the bottleneck

Source: https://www.quora.com/How-do-I-speed-up-my-Python-code

It is easy to profile code in python. The following flags will give you a complete analysis of how many times a particular function gets executed, how much time it consumes, etc. Identify ‘hot’ parts of your code. These are functions which end up taking large chunks of time or are executed many times (typically function calls inside loops).

python -m cProfile my_code.py
  1. Use builtin functions as much as possible. Builtin functions like map are implemented in C code. Also use c equivalents. like use cPickle instead of using pickle.
  2. Avoid use of global variables as much as possible. Python is very very slow at accessing global variables. If necessary make a local copy of global variables which will be used repeatedly inside loops.
  3. Python is also very slow at resolving function addresses. So if you are going to access a method often (inside a loop) consider remapping function it to a variable
myfunc = myObj.funcfor i in range(n):
myfunc(i) # faster than myObj.func(i)

4. PyPy is a JIT (Just-in-time) compiler for python. It runs half the time of cPython

5. If your program spends a lot of time waiting for database read/writes, network operations, etc. then consider using asynchronous libraries.

Lead Data Scientist @ Facebook