Row 61674

Row ID: 61674 | Dataset Entry | Axioma AXP Content Repository

Content Data

This page contains data entry 61674 from the Axioma AXP content repository. The structured data below represents the complete record for this entry.

I hit publish on a blogpost last week on running Spark, Dask, DuckDB, and Polars on the TPC-H benchmark across a variety of scales (10 GiB, 100 GiB, 1 TiB, 10 TiB), both locally on a Macbook Pro and on the cloud. It’s a broad set of configurations. The results are interesting.

No project wins uniformly. They all perform differently at different scales:

* DuckDB and Polars are crazy fast on local machines * Dask and DuckDB seem to win on cloud and at scale * Dask ends up being most robust, especially at scale * DuckDB does shockingly well on large datasets on a single large machine * Spark performs oddly poorly, despite being the standard choice 😢

Tons of charts in this post to try to make sense of the data. If folks are curious, here’s the post:

[https://docs.coiled.io/blog/tpch.html](https://docs.coiled.io/blog/tpch.html)

Performance isn’t everything of course. Each project has its die-hard fans/critics for loads of different reasons. Anyone want to attack/defend their dataframe library of choice?

Field	Value
text	I hit publish on a blogpost last week on running Spark, Dask, DuckDB, and Polars on the TPC-H benchmark across a variety of scales (10 GiB, 100 GiB, 1 TiB, 10 TiB), both locally on a Macbook Pro and on the cloud. It’s a broad set of configurations. The results are interesting. No project wins uniformly. They all perform differently at different scales: * DuckDB and Polars are crazy fast on local machines * Dask and DuckDB seem to win on cloud and at scale * Dask ends up being most robust, …
label	r/datascience
dataType	post
communityName	r/datascience
datetime	2024-05-23
username_encoded	Z0FBQUFBQm5Lak1aUm9nNncyaVBCYi1tMEU2OXZUZUZ2OWliTWN5YjlodlJhdnFPVGEyVWRyd25GVHRlaXZMc2ZycVUtV1U1N0ZnZnBWR29xTlFKYko1amsyT05Pa3J5ZVE9PQ==
url_encoded	Z0FBQUFBQm5Lak9weE5WS0xDV2ZXUy1DQlF1TllTejdGM0JFTTZ5MldoeHNuMEZpbWg5WGxNNW1wMzJjWV9GYVpuLThQVkRpRVoyOGdUUzg1anRmUnZsRmJyMU1LdDFXYmdYdG1ma0Y5MXlBX0J5QW85cFpjNG5jd0lxeUxmWm9WQUp2ek1NSmo2a0ZEWk9JWmJBcUpYR0VJUUplbzhaZFc4dGtjWWNYcGJYYWVVVEVaMmFycGhLZ204OEdScGhCUVpacmNsWGl1MFZiM0lkWHBsejRPeEloQnRHTzJ2Rzg1Zz09

Raw Record

{
  "text": "I hit publish on a blogpost last week on running Spark, Dask, DuckDB, and Polars on the TPC-H benchmark across a variety of scales (10 GiB, 100 GiB, 1 TiB, 10 TiB), both locally on a Macbook Pro and on the cloud.  It’s a broad set of configurations.  The results are interesting.\n\nNo project wins uniformly.  They all perform differently at different scales: \n\n* DuckDB and Polars are crazy fast on local machines\n* Dask and DuckDB seem to win on cloud and at scale\n* Dask ends up being most robust, especially at scale\n* DuckDB does shockingly well on large datasets on a single large machine\n* Spark performs oddly poorly, despite being the standard choice 😢\n\nTons of charts in this post to try to make sense of the data.  If folks are curious, here’s the post:\n\n[https://docs.coiled.io/blog/tpch.html](https://docs.coiled.io/blog/tpch.html)\n\nPerformance isn’t everything of course.  Each project has its die-hard fans/critics for loads of different reasons.  Anyone want to attack/defend their dataframe library of choice?",
  "label": "r/datascience",
  "dataType": "post",
  "communityName": "r/datascience",
  "datetime": "2024-05-23",
  "username_encoded": "Z0FBQUFBQm5Lak1aUm9nNncyaVBCYi1tMEU2OXZUZUZ2OWliTWN5YjlodlJhdnFPVGEyVWRyd25GVHRlaXZMc2ZycVUtV1U1N0ZnZnBWR29xTlFKYko1amsyT05Pa3J5ZVE9PQ==",
  "url_encoded": "Z0FBQUFBQm5Lak9weE5WS0xDV2ZXUy1DQlF1TllTejdGM0JFTTZ5MldoeHNuMEZpbWg5WGxNNW1wMzJjWV9GYVpuLThQVkRpRVoyOGdUUzg1anRmUnZsRmJyMU1LdDFXYmdYdG1ma0Y5MXlBX0J5QW85cFpjNG5jd0lxeUxmWm9WQUp2ek1NSmo2a0ZEWk9JWmJBcUpYR0VJUUplbzhaZFc4dGtjWWNYcGJYYWVVVEVaMmFycGhLZ204OEdScGhCUVpacmNsWGl1MFZiM0lkWHBsejRPeEloQnRHTzJ2Rzg1Zz09"
}

Explore Dataset Explore Row

Entry Information

Entry ID: 61674
Repository: Axioma AXP
Dataset: arrmlet/reddit_dataset_36
Total Entries: 100,000