Decathlon replaces Spark with Polars for part of its data platform. However, the company did not abandon Spark entirely. Instead, it focused on using the right tool for the right workload. For smaller datasets, Polars proved faster and cheaper.
Why Decathlon replaces Spark with Polars
Decathlon operates a large-scale data platform built on Apache Spark. Traditionally, Spark performs best when processing terabytes of data. However, the data team noticed a growing gap between Spark’s strengths and real workloads.
Many production pipelines processed only gigabytes or even megabytes. As a result, Spark introduced unnecessary overhead. Cluster startup times remained long. Resource usage stayed high. Consequently, cloud costs increased without adding real value.
Spark still worked as designed. Nevertheless, it was no longer the most efficient option for smaller jobs.
Why Polars is replacing Spark for smaller workloads
To address this mismatch, the team evaluated Polars. Polars is an open-source data processing library written in Rust. Importantly, it uses Apache Arrow as its in-memory format.
Polars stood out for several reasons. First, its API feels familiar to engineers coming from Spark. Second, it runs efficiently on a single node. Finally, it avoids the operational overhead of distributed clusters.
Because of this, Polars emerged as a strong candidate for lightweight and mid-sized workloads.
Measurable performance improvements
To validate the approach, Decathlon migrated one Spark job to Polars. The job processed a Parquet table of roughly 50 GB. Instead of running on a Spark cluster, it executed inside a single Kubernetes pod.
The results were immediate. Specifically, startup time dropped from about eight minutes to two. Developers received feedback much faster. As a result, iteration speed improved.
Moreover, after enabling Polars’ streaming engine, performance improved further. Streaming allowed datasets larger than available memory to run efficiently. In many cases, jobs finished before a Spark cluster could even cold-start.
When Spark is replaced by Polars — and when it isn’t
Decathlon did not attempt a full replacement. Rather, the team defined clear boundaries. Polars is used when datasets stay below 50 GB. Additionally, the data size must remain stable over time. Pipelines should also avoid heavy joins or complex aggregations.
Spark, by contrast, remains critical for large-scale distributed workloads. It continues to power pipelines that benefit from massive parallelism.
There are tradeoffs, however. Running Polars on Kubernetes introduces new operational complexity. For example, teams must manage containers and security policies. Therefore, adoption requires coordination with platform and data operations teams.
A broader lesson for data teams
Decathlon’s experience highlights a broader trend in data engineering. Increasingly, teams select tools based on workload characteristics, not tradition.
Decathlon replaces Spark with Polars selectively. In doing so, it reduces infrastructure costs, shortens feedback loops, and improves developer productivity. Ultimately, the goal is efficiency, not disruption.
For other organizations, the takeaway is clear. Distributed systems are powerful. However, they are not always necessary. In many cases, lighter tools deliver faster results at a lower cost.
Read also
Join the discussion in our Facebook community.