An update on Polars v0.15.0/0.8.17

August 2, 2021 by Ritchie Vink

I am largely in debt of quality documentation, regular project updates, a project website and so on. I mostly prioritize development work, because I want things to be better faster. But… Here goes one well overdue project update on Polars.

Quick reminder

Polars is an in-memory DataFrame library written in Rust and exposing an API in Rust and Python. It has tightly interopability with Apache Arrow and uses arrow memory as columnar store. It has got two public API’s:

  • Eager: an imperative API that bears some resemblance with Pandas.
  • Lazy: A lazy API that builds a logical plan and transpiles it to a query plan that’s executed on vectorized query engine built on top of Eager. I’ve heard that some people think it its build on top of DataFusion. This is not the case, but interopability with DataFusion is possible.

Performance

Performance has seen steadily improvements. With lot’s of performance tweaks adding up, and a few big bumps. Going through some of the improvements would be an interesting post on its own.

At the time of writing this post, Polars is the fastest performing solution in the db-benchmark! 🎉🎉🎉. The image below shows the summary reports of the largest dataset that could be run in memory in the following order:

  • join question
  • groupby advanced questions
  • groupby basic questions

Besides that I also have got very positive feedback on users bringing down their ETL time down from hours to minutes.

db-benchmark results

Lazy engine + Python API

The lazy query engine is maturing. In the python API, I am throwing more and more on this engine and most of the imperative code in Python is actually syntactic sugar that wrap polars lazy. The expressions syntax has turned out to be very flexible and performant. By throwing more on the lazy engine users are able to the expression syntax even when they don’t want to commit to a fully lazy API.

An additional benefit is that I can focus on a single entry point in the API. Where the lazy API is far more preferable as this gives me a lot more context on the whole query and makes more optimizations possible.

New features

TLDR: A lot. See the changelogs (where the Python one is best maintained):

Most important is that Polars at the moment has implemented most of what you need of a DataFrame library. If you do miss anything, please let me know!

Roadmap

I feel the project is stabilizing a lot in latest months. And I must say, it feels good to be on this side of the work that has to be done (most behind you).

I do plan to explore out-of-core functionality, but before we start on that large endavour, I first want to have some more time cleaning up, polishing the API and get user suggestions.

Contributors

I want to thank contributors that are so kind to help me on this project, both in code development or as being a user and giving me feeback/suggestions. The latter is just as valuable, as I learned that it’s hard to predict how people want to use your library from your ivory tower.

Getting started?

Want to try polars and get a blazingly last dataframe experience? Take a look a the following documents:

(c) 2020 Ritchie Vink.