How do you write performant code for large data sources?
Lately, I have been trying to get into some AI workflows. It appears that conventional programming is usually less performant that Cython/.pyx implementations.
Some learnings from my explorations:
When working with DataFrames in Python, there are different methods to perform operations on data. Usually you would use iterrows, apply, or vectorized operations. Each method has different performance characteristics, especially as the size of the DataFrame grows.
To investigate this, I went to GPT and asked it for some synthetic simulation code to test out this thesis.
Generate DataFrames of sizes 10, 100, 1000, 10000, and 100000.
Then profile the following funcs:
1. iterrows: Iterate through each row, adding a constant to a column's value.
2. apply: Use the apply function with a lambda to add a constant to each item in a column.
3. Vectorized: Directly add a constant to the column using vectorized operations.
So for large DataFrames, avoid iterrows due to its poor performance. Use vectorized operations for the best efficiency, and resort to apply only if vectorized operations are not feasible. Most of your usecases should ideally be solved via vectorized ops tbh.
Some learnings from my explorations:
When working with DataFrames in Python, there are different methods to perform operations on data. Usually you would use iterrows, apply, or vectorized operations. Each method has different performance characteristics, especially as the size of the DataFrame grows.
To investigate this, I went to GPT and asked it for some synthetic simulation code to test out this thesis.
Generate DataFrames of sizes 10, 100, 1000, 10000, and 100000.
Then profile the following funcs:
1. iterrows: Iterate through each row, adding a constant to a column's value.
2. apply: Use the apply function with a lambda to add a constant to each item in a column.
3. Vectorized: Directly add a constant to the column using vectorized operations.
So for large DataFrames, avoid iterrows due to its poor performance. Use vectorized operations for the best efficiency, and resort to apply only if vectorized operations are not feasible. Most of your usecases should ideally be solved via vectorized ops tbh.
You can get more advantage using Nvidia cuDF, Dask, Polars based on where the real bottleneck is for your usecase.
Kalan Carmden
Stealth
4 months ago
tomorrow :)
See more comments
Ah I’ve been posting here for so long man. That’s why I post here.
Medium pe it’s just vanity. Atleast here people will actually read it.
Discover More
Curated from across