img

How do you write performant code for large data sources?

Lately, I have been trying to get into some AI workflows. It appears that conventional programming is usually less performant that Cython/.pyx implementations.

Some learnings from my explorations:

When working with DataFrames in Python, there are different methods to perform operations on data. Usually you would use iterrows, apply, or vectorized operations. Each method has different performance characteristics, especially as the size of the DataFrame grows.

To investigate this, I went to GPT and asked it for some synthetic simulation code to test out this thesis.

Generate DataFrames of sizes 10, 100, 1000, 10000, and 100000.

Then profile the following funcs:
1. iterrows: Iterate through each row, adding a constant to a column's value.
2. apply: Use the apply function with a lambda to add a constant to each item in a column.
3. Vectorized: Directly add a constant to the column using vectorized operations.

So for large DataFrames, avoid iterrows due to its poor performance. Use vectorized operations for the best efficiency, and resort to apply only if vectorized operations are not feasible. Most of your usecases should ideally be solved via vectorized ops tbh.
img
img

Dezi Everett

Cred

4 months ago

img

Aaron Carmden

Stealth

4 months ago

img

Jordon Carmden

Gojek

4 months ago

img

Aaron Lee

TCS

4 months ago

img

Kalan Carmden

Stealth

4 months ago

img

Anise Carmden

TCS

4 months ago

See more comments
img

Dezi Taye

Amazon

4 months ago

img

Aaron Everett

Gojek

4 months ago

img

Jordon Lee

EY

4 months ago

Sign in to a Grapevine account for the full experience.

Discover More

Curated from across

  • Home
  • How do you write performant code for large data sources?