Like, dude, why are you still using pandas/black/poetry/(another non-rust python util)?
Pandas is synonymous with data science. I used to work more in the data space, I was Head of Data Science at Waybridge and did a ton of data science in my own startup Wattson Blue. However, more recently I have been doing more software engineering; as such I have not actively been looking at what is the most performant way of playing with data (since everyone just uses pandas!).
Being on holidays this week, I decided to finally investigate different libraries and options for myself. In particular, I decided to give Polars a try, and OMG, it's not even close.
For me it's a bit like the first time I tried Ruff (in replacement of black and isort), or the first time my intern Danish showed me UV (in place of poetry.)
Like, dude, why are you still using pandas/black/poetry/(another non-rust python util)
E.g. compare loading the IMDB titles and ratings CSV file on my M1 Pro Macbook with 16GB of RAM
(feb2025) ➜ Feb2025 git:(master) ✗ uv run python imdb.py
'load'('title.basics.tsv', 'pandas')) took: 13.4196s - shape: 11mm, 9
'load'('title.ratings.tsv', 'pandas'), {}) took: 0.4610s - shape: 1mm, 3
'load'('title.basics.tsv', 'polars'), {}) took: 0.5464s - shape: 11mm, 9
'load'('title.ratings.tsv', 'polars'), {}) took: 0.0222s - shape: 1mm, 3
Even if I go small and load smaller CSVs with say 1000 rows, the results are substantially quicker - so the excuse of "but my data is not that big" does not apply I am afraid.
And the helpful errors and warnings - side note, why do I not write better errors and warnings like this?
PolarsInefficientMapWarning: Expr.map_elements is significantly slower than the native expressions API. Only use if you absolutely CANNOT implement your logic otherwise. Replace this expression...
- pl.col("numVotes").map_elements(lambda x: ...)
with this one instead:
+ pl.col("numVotes") / 1000000.0
Thanks Polars.
That's faster too! Here is a simple example of looking at a breakdown of movie ratings by year:
here is the code as an example, which I am sure could be made more efficient!
pandas_full = pd.merge(pandas_basics_df, pandas_ratings_df, on='tconst', how='outer')
pandas_full['roundedRating'] = ((pandas_full['averageRating'] * 2).round(0) / 2)
pandas_year_and_rating_df = pandas_full.groupby(['startYear', 'roundedRating'])['numVotes'].count()
vs 1s in polars
full = polars_basics_df.join(polars_ratings_df, on='tconst', how='full')
full = full.with_columns(
((pl.col('averageRating') * 2).round(0) / 2).alias('roundedRating'))
year_and_rating_df = full.group_by(['startYear', 'roundedRating']).len()
so again massive gains to be had.
(and this is not even the most efficient way of writing these, you can use lazy loading, and chain all the commands together to get an extra 20% or so gains, although you can have similar gains on pandas side too, so best keep it simple)
You might have heard this rule of thumb from the creator of pandas (see here)
pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset
Although pandas has tried to address this issue with the new pyarrows backend, you don't have such issues in polars:
Zero-copy operations: Polars uses Apache Arrow under the hood, which means many operations don't need to copy data in memory. In pandas, simple operations often create new copies of your data.
Lazy evaluation: Polars can optimise your query plan before execution. Write something like:
df.filter(pl.col("value") > 0).groupby("category").agg(pl.col("value").sum())
And Polars will figure out the most efficient way to run it.
Polar just seems smarter in memory management - see more in this stackoverflow post
As described in the polars documentation here, Scikit Learn and XGBoost work out of the box with polars dataframes. Moreover, thanks to LLMs, migrating old 'pandas' code to polars should be a doddle.
So stop it with the excuses, learn the Polars API, switch to polars, and let's reduce the memory requirements on all those docker containers.
Another quick performance reminder ('of course I knew this already!' I hear you say), is that numpy arrays that contain Python native float
can be substantially slower than arrays with numpy specific data types (e.g. numpy.float64
).
Just compare the following methods for matrix multiplication (of a 10k x 1k matrix with itself):
2. A numpy 2D-array
3. A PyArrow-backed pandas converted to numpy using df.to_numpy(dtype='np.float64')
4. A PyArrow-backed pandas dataframe converted to numpy using df.to_numpy()
5. A polars dataframe converted to numpy using df.to_numpy()
6. A standard pandas dataframe converted to numpy using df.to_numpy()
Feb2025 git:(master) ✗ uv run python example_polars_vs_pands2.py
direct mat mul time: 0.0037 seconds
pyarrow np.float64 mat mul time: 0.0086 seconds
pyarrow float mat mul time: 24.7052 seconds
polars mat mul time: 0.0078 seconds
pandas mat mul time: 0.0039 seconds
So keep that in mind, so when converting arrays into numpy, you might be better off specifying data types, e.g.
array = df.to_numpy(dtype='np.float64')
That dtype
argument could really save your bacon down the road.
I do acknowledge that polars is slower than pandas here, however, when you go to a 10kx10k matrix, the performance is much closer:
direct mat mul time: 2.5243 seconds
np.float64 mat mul time: 2.6598 seconds
[skipping the pyarrow float method as it will take too long]
polars mat mul time: 2.6275 seconds
pandas mat mul time: 2.5722 seconds
Well I tried to be clever and use pandas' PyArrow backend - however, this ends up giving Python float
rather than numpy.float64
by default when converting dataframes to numpy arrays. So watch out!
It's all too easy to save data to CSV when we want to "cache" some results to come back to later. However, there is very little reason to do that, unless you want to explore the CSV manually.
Parquet is faster on all fronts, especially if you're stuck with pandas. Plus, it gives you:
Loading from a CSV:
'load'('title.basics.tsv', 'pandas/pyarrow'), {}) took: 12.9046s - shape: 11.442166mm, 9
'load'('title.basics.tsv', 'pandas'), {}) took: 13.0133s - shape: 11.442166mm, 9
'load'('title.basics.tsv', 'polars'), {}) took: 0.3630s - shape: 11.442166mm, 9
Same data loaded from a parquet file:
'load_parquet'('title.basics.parquet', 'pandas/pyarrow'), {}) took: 0.3369s - shape: 11.442166mm, 9
'load_parquet'('title.basics.parquet', 'pandas'), {}) took: 7.4443s - shape: 11.442166mm, 9
'load_parquet'('title.basics.parquet', 'polars'), {}) took: 0.3119s - shape: 11.442166mm, 9
(and the file size is a lot smaller by default - in our case 254MB vs 987MB, unzipped!)
So apparently Narwhals is a library that provides a set of typed APIs/tools that allow you to write functions that support both pandas and polars.
Think of it as training wheels for your Polars journey - you can gradually migrate your codebase while keeping everything working. You write your functions once with Narwhals' typed interfaces, and they'll work with both pandas and polars DataFrames. Pretty neat when you're dealing with a large existing codebase!