The rule of thumb that I usually apply: use the simplest data structure that still satisfies your needs. If we rank the data structures from most simple to least simple, it usually ends up like this:
- Dictionaries / lists
- Numpy arrays
- Pandas series / dataframes
So first consider dictionaries / lists. If these allow you to do all data operations that you need, then all is fine. If not, start considering numpy arrays. Some typical reasons for moving to numpy arrays are:
- Your data is 2-dimensional (or higher). Although nested dictionaries/lists can be used to represent multi-dimensional data, in most situations numpy arrays will be more efficient.
- You have to perform a bunch of numerical calculations. As already pointed out by zhqiat, numpy will give a significant speed-up in this case. Furthermore numpy arrays come bundled with a large amount of mathematical functions.
Then there are also some typical reasons for going beyond numpy arrays and to the more-complex but also more-powerful pandas series/dataframes:
- You have to merge multiple data sets with each other, or do reshaping/reordering of your data. This diagram gives a nice overview of all the ‘data wrangling’ operations that pandas allows you to do.
- You have to import data from or export data to a specific file format like Excel, HDF5 or SQL. Pandas comes with convenient import/export functions for this.