How can I speed up reading multiple files and putting the data into a dataframe?

Question

I’ve used this many times as it’s a particular easy implementation of multiprocessing.

import pandas as pd
from multiprocessing import Pool

def reader(filename):
    return pd.read_excel(filename)

def main():
    pool = Pool(4) # number of cores you want to use
    file_list = [file1.xlsx, file2.xlsx, file3.xlsx, ...]
    df_list = pool.map(reader, file_list) #creates a list of the loaded df's
    df = pd.concat(df_list) # concatenates all the df's into a single df

if __name__ == '__main__':
    main()

Using this you should be able to substantially increase the speed of your program without too much work at all. If you don’t know how many processors you have, you can check by pulling up your shell and typing

echo %NUMBER_OF_PROCESSORS%

EDIT: To make this run even faster, consider changing your files to csvs and using pandas function pandas.read_csv

Leave a Comment Cancel reply