PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance

Aware that this might be a reply that some will find highly controversial, I’m still posting my opinion here…

Proposed answer: Ignore the warning. If the user thinks/observes that the code suffers from poor performance, it’s the user’s responsibility to fix it, not the module’s responsibility to propose code refactoring steps.

Rationale for this harsh reply:
I am seeing this warning now that I have migrated to pandas v2.0.0 at many different places. Reason is that, at multiple places in the script, I remove and add records from dataframes, using many calls to .loc[] and .concat().

Now, given I am pretty savvy in vectorization, we perform these operations with performance in mind (e.g., never inside a for loop, but maybe ripping out an entire block of records, such as overwriting some “inner 20%” of the dataframe, after multiple pd.merge() operations – think of it as ETL operations on a database implemented in pandas instead of SQL). We see that the application runs incredibly fast, even though some dataframes contain ~4.5 mn records. More specifically: For one script, I get >50 of these warnings logged out in <0.3 seconds.. which I, subjectively, don’t perceive as particularly “poor performance” (running a serial application with PyCharm in ‘debugging’ mode – so not exactly a setup in which you would expect best performance in the first place).

So, I conclude:

  • The code ran with pandas <2.0.0, and never raised a warning
  • The performance is excellent
  • We have multiple colleagues with a PhD in high-performance computing working on the code, and they believe it’s fine
  • Module warning messages should not be abused for ‘tutorials’ or ‘educational purposes’ (even if well intented) – this is different than, for example, the “setting to copy of dataframe”, where chances are very high that the functional behavior of the module leads to incorrect output. Here, it’s just a 100% educational warning – that deserves, if anything, the logger level “info” (if not “debug”), certainly not “warning”
  • We get an incredibly dirty stdout log, for no reason
  • The warning itself is highly misleading – we don’t have a single call to .insert() across the entire ecosystem – the fragmentation that we do have in our dataframes comes from many iterative, but fast, updates – so thanks for sending us down the wrong path

We will certainly not refactor a code that is showing excellent performance, and has been tested and validated over and over again, just because someone from the pandas team wants to educate us about stuff we know :/ If at least the performance was poor, I would welcome this message as a suggestion for improvement (even then: not a warning, but an ‘info’) – but given its current indiscriminate way of popping up: For once, it’s actually the module that’s the problem, not the user.

Edit: This is 100% the same issue as the following warning PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance. – which, despite warning me about “performance”, pops up 28 times (!) in less than 3 seconds – again, in debugging mode of PyCharm. I’m pretty sure removing the warning alone would improve performance by 20% (or, 20 ms per operation ;)). Also, starts bothering as of pandas v2.0.0 and should be removed from the module altogether.

Leave a Comment

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)