Row 61674
Content Data
This page contains data entry 61674 from the Axioma AXP content repository. The structured data below represents the complete record for this entry.
I hit publish on a blogpost last week on running Spark, Dask, DuckDB, and Polars on the TPC-H benchmark across a variety of scales (10 GiB, 100 GiB, 1 TiB, 10 TiB), both locally on a Macbook Pro and on the cloud. It’s a broad set of configurations. The results are interesting.
No project wins uniformly. They all perform differently at different scales:
* DuckDB and Polars are crazy fast on local machines * Dask and DuckDB seem to win on cloud and at scale * Dask ends up being most robust, especially at scale * DuckDB does shockingly well on large datasets on a single large machine * Spark performs oddly poorly, despite being the standard choice 😢
Tons of charts in this post to try to make sense of the data. If folks are curious, here’s the post:
[https://docs.coiled.io/blog/tpch.html](https://docs.coiled.io/blog/tpch.html)
Performance isn’t everything of course. Each project has its die-hard fans/critics for loads of different reasons. Anyone want to attack/defend their dataframe library of choice?
| Field | Value |
|---|---|
| text | I hit publish on a blogpost last week on running Spark, Dask, DuckDB, and Polars on the TPC-H benchmark across a variety of scales (10 GiB, 100 GiB, 1 TiB, 10 TiB), both locally on a Macbook Pro and on the cloud. It’s a broad set of configurations. The results are interesting. No project wins uniformly. They all perform differently at different scales: * DuckDB and Polars are crazy fast on local machines * Dask and DuckDB seem to win on cloud and at scale * Dask ends up being most robust, … |
| label | r/datascience |
| dataType | post |
| communityName | r/datascience |
| datetime | 2024-05-23 |
| username_encoded | Z0FBQUFBQm5Lak1aUm9nNncyaVBCYi1tMEU2OXZUZUZ2OWliTWN5YjlodlJhdnFPVGEyVWRyd25GVHRlaXZMc2ZycVUtV1U1N0ZnZnBWR29xTlFKYko1amsyT05Pa3J5ZVE9PQ== |
| url_encoded | Z0FBQUFBQm5Lak9weE5WS0xDV2ZXUy1DQlF1TllTejdGM0JFTTZ5MldoeHNuMEZpbWg5WGxNNW1wMzJjWV9GYVpuLThQVkRpRVoyOGdUUzg1anRmUnZsRmJyMU1LdDFXYmdYdG1ma0Y5MXlBX0J5QW85cFpjNG5jd0lxeUxmWm9WQUp2ek1NSmo2a0ZEWk9JWmJBcUpYR0VJUUplbzhaZFc4dGtjWWNYcGJYYWVVVEVaMmFycGhLZ204OEdScGhCUVpacmNsWGl1MFZiM0lkWHBsejRPeEloQnRHTzJ2Rzg1Zz09 |
Raw Record
{
"text": "I hit publish on a blogpost last week on running Spark, Dask, DuckDB, and Polars on the TPC-H benchmark across a variety of scales (10 GiB, 100 GiB, 1 TiB, 10 TiB), both locally on a Macbook Pro and on the cloud. It’s a broad set of configurations. The results are interesting.\n\nNo project wins uniformly. They all perform differently at different scales: \n\n* DuckDB and Polars are crazy fast on local machines\n* Dask and DuckDB seem to win on cloud and at scale\n* Dask ends up being most robust, especially at scale\n* DuckDB does shockingly well on large datasets on a single large machine\n* Spark performs oddly poorly, despite being the standard choice 😢\n\nTons of charts in this post to try to make sense of the data. If folks are curious, here’s the post:\n\n[https://docs.coiled.io/blog/tpch.html](https://docs.coiled.io/blog/tpch.html)\n\nPerformance isn’t everything of course. Each project has its die-hard fans/critics for loads of different reasons. Anyone want to attack/defend their dataframe library of choice?",
"label": "r/datascience",
"dataType": "post",
"communityName": "r/datascience",
"datetime": "2024-05-23",
"username_encoded": "Z0FBQUFBQm5Lak1aUm9nNncyaVBCYi1tMEU2OXZUZUZ2OWliTWN5YjlodlJhdnFPVGEyVWRyd25GVHRlaXZMc2ZycVUtV1U1N0ZnZnBWR29xTlFKYko1amsyT05Pa3J5ZVE9PQ==",
"url_encoded": "Z0FBQUFBQm5Lak9weE5WS0xDV2ZXUy1DQlF1TllTejdGM0JFTTZ5MldoeHNuMEZpbWg5WGxNNW1wMzJjWV9GYVpuLThQVkRpRVoyOGdUUzg1anRmUnZsRmJyMU1LdDFXYmdYdG1ma0Y5MXlBX0J5QW85cFpjNG5jd0lxeUxmWm9WQUp2ek1NSmo2a0ZEWk9JWmJBcUpYR0VJUUplbzhaZFc4dGtjWWNYcGJYYWVVVEVaMmFycGhLZ204OEdScGhCUVpacmNsWGl1MFZiM0lkWHBsejRPeEloQnRHTzJ2Rzg1Zz09"
}
Entry Information
- Entry ID: 61674
- Repository: Axioma AXP
- Dataset: arrmlet/reddit_dataset_36
- Total Entries: 100,000