When starting a data project, there is always a question about which data processing technology you should use. Most of the time, the answer is Pandas: it is the most popular python library. So when I discover this super shiny, fast and elegant library called Polars, I started to dig in order to convince my tech lead to choose Polars in the Polars vs Pandas battle. Let me share what I learned on this journey !
The underdog of Data Processing: Polars
What is Polars?
Polars is a DataFrame library written in Rust. It aims to be the best DataFrame library in terms of:
- Memory efficiency: Polars removes unnecessary operation and use memory cache (fastest memory access)
- Parallelism: Polars uses the multiple cores of your CPU for most computations.
To achieve these, it is based on:
- Apache Arrow: the most efficient in-memory data format. For example, it allows converting your Polars DataFrame in all libraries implementing Arrow (and there are lots : pandas, spark...) without any memory cost.
- Rust: the programming language with the safest concurrency implementation available now.
We don’t know yet if Polars will be a long term contender in the field, but right now it is booming.
As you can see, Polars is following the same path as the big names of DataFrames libraries.
With implementation of Rust bindings to python (pyO3) and great work from the open source community, it is available for all python lovers with a simple pip command:
pip install polars
What does the syntax looks like?
For the PySpark lovers, it is kind of the same! Here’s a simple aggregation script:
There are some excellent introductions to Polars syntax in the User Guide, especially when you are more familiar with Pandas than PySpark.
Polars vs Pandas in data processing?
You should use it because IT IS FAST!!!!!!
As you can see on the database like benchmark H2O.ai, Polars is the fastest man alive (remember to check out how the benchmark details before taking it for granted).
In any tasks of the benchmark, Polars outrun any others libraries.
Speed in Polars vs Pandas battle: The winner is Polars !
To temper this benchmark, if you are using a very small amount of data, Polars is going to be slower. Why ? Well, as I mentioned earlier, nearly all operations on Polars are run in a multi-thread manner. The setup for that makes it slower when processing too few rows of data.
My advice to write faster Pandas pipeline: Profiling.
Polars vs Pandas: What volume can I use ?
From my perspective, the data volume you can load is the most important criteria. As a Data Engineer, I need to keep in mind how my data volume will scale, and I can’t allow myself to have Out Of Memory errors in my data pipelines.
And in Pandas, the answer is quite simple. I can approximately charge the amount of data my RAM has. This explains why there are some OOM errors in the H2O benchmark for Pandas when the data volume is too big. So if you want to bypass these limitations, I would strongly recommend switching technology (the official documentation agrees with me).
So how does Polars overcome this big limitation of Pandas?
- Apache Arrow in-memory format:
The data formats of Polars DataFrames are based on Arrow Columnar memory format, which uses Chunked arrays for the columnar format. A chunked array is basically an array that is stored in multiple parts of the RAM, each part is called a chunk, enabling more memory optimization for storage.
Whereas Pandas DataFrames columns are basically NumPy arrays which are stored in one chunk. But you should know that PyArrow backend is available in Pandas now (yeah, Pandas is still improving, here’s a great article about that !).
Of course, there is more to Apache Arrow, I can recommend this talk if you want to learn more about Arrow and PyArrow !
- Lazy evaluations:
Lazy evaluation is about avoiding computing anything until you really need it. This allows the engine of a lazy technology to generate optimization based on the instructions you have given.
In a library like Polars, it could mean not charging a particular column when it is not required by the query. Therefore, Polars avoid any memory waste !
- Streaming fashion processing:
This one of the newest feature of Polars. It allows processing your data in batches. Meaning, data is not processed all-at-once but piece by piece. Allowing to process dataset that do not fit into memory, you can check out this example of 80 GB of data processed with 16 GB of RAM.
Be aware that the final result should fit in-memory, except if you use a
sink_ function like
sink_parquet to write the results of your pipeline (hello fellow data engineers 👋).
Streaming processing has constrains and not all operations can be covered by streaming, you can understand the field of possibilities with this documentation.
Polars vs Pandas - Setup :
It is quite reasonable to say both libraries are lightweights and don’t require any hard setup, a simple
pip install polars
pip install pandas
would do the job (my condolences to PySpark users).
Polars vs Pandas - Learning Curve:
For this section, I can only talk about my personal experience.
In Pandas, you are basically free of doing anything. There is always 2 to 3 different syntaxes to do the same thing. So achieving your goal is not that hard when learning it (except when it comes to indexes, it took me some time to master that part !). The counterpart comes when you want to write optimized code. You really need to master what you’re doing (a recommended resource to get to this level).
When it comes to Polars, the syntax is different, doing computations with
pl.col() is a different way of thinking and is more SQL oriented than Pandas. But as the library is quite young, using it to its fullest is easy ! Also, understanding lazy evaluation or the streaming fashion can take some time, even if they are powerful features.
To be fair, Pandas wins this by a small margin 🤭.
For any of you who already know PySpark AND Pandas, you may consider yourself as a Polars native speaker (the community made amazing documentations for you : Coming from Pandas & Coming from Spark)
When to choose Polars vs Pandas?
Polars is a great solution when you:
- want speed
- need/want low setup
- are manipulating larger than memory datasets
- using spark is overkill and painful
Any Data Engineer in the room ? Yes, I think Polars is perfect for a lot a Data Engineering use cases.
When is Pandas the obvious solution? I guess, when you are doing an exploratory data analysis (EDA) and you don’t know Polars that well. But to be honest, I don’t practice intense EDA a lot.
When it comes to ML pipelines, especially with Scikit-learn, Pandas has become a standard, but you just have to do
.to_pandas() to convert any Polars DataFrame to a Pandas DataFrame with a zero-copy operations (no re-allocation of memory). So I can’t judge yet, but I can guess that Polars will be better !
Thanks a lot for reading this article, as you can see I’m really excited about this new library. You can also read my previous articles about very different themes : Business Metrics in ML and Cloud Carbon Footprint.
Are you looking for Data Engineering Experts? Don't hesitate to contact us!