When to use pandas series, numpy ndarrays or simply python dictionaries?

The rule of thumb that I usually apply: use the simplest data structure that still satisfies your needs. If we rank the data structures from most simple to least simple, it usually ends up like this:

  1. Dictionaries / lists
  2. Numpy arrays
  3. Pandas series / dataframes

So first consider dictionaries / lists. If these allow you to do all data operations that you need, then all is fine. If not, start considering numpy arrays. Some typical reasons for moving to numpy arrays are:

  • Your data is 2-dimensional (or higher). Although nested dictionaries/lists can be used to represent multi-dimensional data, in most situations numpy arrays will be more efficient.
  • You have to perform a bunch of numerical calculations. As already pointed out by zhqiat, numpy will give a significant speed-up in this case. Furthermore numpy arrays come bundled with a large amount of mathematical functions.

Then there are also some typical reasons for going beyond numpy arrays and to the more-complex but also more-powerful pandas series/dataframes:

  • You have to merge multiple data sets with each other, or do reshaping/reordering of your data. This diagram gives a nice overview of all the ‘data wrangling’ operations that pandas allows you to do.
  • You have to import data from or export data to a specific file format like Excel, HDF5 or SQL. Pandas comes with convenient import/export functions for this.

Leave a Comment

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)