Efficiently iterating over rows in a Pandas DataFrame

Never use iterrows and itertuples again!

data science
Author

Maxime Lbonne

Published

March 21, 2022

When I started machine learning, I followed the guidelines and created my own features by combining multiple columns in my dataset. It’s all well and good, but the way I did it was horribly inefficient. I had to wait several minutes to do the most basic operations.

My problem was simple: I didn’t know the fastest way to iterate over rows in Pandas.

I often see people online using the same techniques I used to apply. It’s not elegant but it’s ok if you don’t have much data. However, if you process more than 10k rows, it quickly becomes an obvious performance issue.

In this article, I’m gonna give you the best way to iterate over rows in a Pandas DataFrame, with no extra code required. It’s not just about performance: it’s also about understanding what’s going on under the hood to become a better data scientist.

Let’s import a dataset in Pandas. In this case, I chose the one I worked on when I started: it’s time to fix my past mistakes! 🩹

You can run the code with the following Google Colab notebook.

import pandas as pd
import numpy as np

df = pd.read_csv('https://raw.githubusercontent.com/mlabonne/how-to-data-science/main/data/nslkdd_test.txt')
df
duration protocol_type service flag src_bytes dst_bytes land wrong_fragment urgent hot ... dst_host_same_srv_rate dst_host_diff_srv_rate dst_host_same_src_port_rate dst_host_srv_diff_host_rate dst_host_serror_rate dst_host_srv_serror_rate dst_host_rerror_rate dst_host_srv_rerror_rate attack_type other
0 0 tcp private REJ 0 0 0 0 0 0 ... 0.04 0.06 0.00 0.00 0.00 0.0 1.00 1.00 neptune 21
1 0 tcp private REJ 0 0 0 0 0 0 ... 0.00 0.06 0.00 0.00 0.00 0.0 1.00 1.00 neptune 21
2 2 tcp ftp_data SF 12983 0 0 0 0 0 ... 0.61 0.04 0.61 0.02 0.00 0.0 0.00 0.00 normal 21
3 0 icmp eco_i SF 20 0 0 0 0 0 ... 1.00 0.00 1.00 0.28 0.00 0.0 0.00 0.00 saint 15
4 1 tcp telnet RSTO 0 15 0 0 0 0 ... 0.31 0.17 0.03 0.02 0.00 0.0 0.83 0.71 mscan 11
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
22539 0 tcp smtp SF 794 333 0 0 0 0 ... 0.72 0.06 0.01 0.01 0.01 0.0 0.00 0.00 normal 21
22540 0 tcp http SF 317 938 0 0 0 0 ... 1.00 0.00 0.01 0.01 0.01 0.0 0.00 0.00 normal 21
22541 0 tcp http SF 54540 8314 0 0 0 2 ... 1.00 0.00 0.00 0.00 0.00 0.0 0.07 0.07 back 15
22542 0 udp domain_u SF 42 42 0 0 0 0 ... 0.99 0.01 0.00 0.00 0.00 0.0 0.00 0.00 normal 21
22543 0 tcp sunrpc REJ 0 0 0 0 0 0 ... 0.08 0.03 0.00 0.00 0.00 0.0 0.44 1.00 mscan 14

22544 rows × 43 columns

This dataset has 22k rows and 43 columns with a combination of categorical and numerical values. Each row describes a connection between two computers.

Let’s say we want to create a new feature: the total number of bytes in the connection. We just have to sum up two existing features: src_bytes and dst_bytes. Let’s see different methods to calculate this new feature.

❌❌ 1. Iterrows

According to the official documentation, iterrows() iterates “over the rows of a Pandas DataFrame as (index, Series) pairs”. It converts each row into a Series object, which causes two problems:

  1. It can change the type of your data (dtypes);
  2. The conversion greatly degrades performance.

For these reasons, the ill-named iterrows() is the WORST possible method to actually iterate over rows.

%%timeit -n 10
# Iterrows
total = []
for index, row in df.iterrows():
    total.append(row['src_bytes'] + row['dst_bytes'])
10 loops, best of 5: 1.07 s per loop

Now let’s see slightly better techniques…

❌ 2. For loop with .loc or .iloc (3× faster)

This is what I used to do when I started: a basic for loop to select rows by index (with .loc or .iloc).

Why is it bad? Because DataFrames are not designed for this purpose. As with the previous method, rows are converted into Pandas Series objects, which degrades performance.

Interestingly enough, .iloc is faster than .loc. It makes sense since Python doesn’t have to check user-defined labels and directly look at where the row is stored in memory.

%%timeit -n 10
# For loop with .loc
total = []
for index in range(len(df)):
    total.append(df['src_bytes'].loc[index] + df['dst_bytes'].loc[index])
10 loops, best of 5: 600 ms per loop
%%timeit -n 10
# For loop with .iloc
total = []
for index in range(len(df)):
    total.append(df['src_bytes'].iloc[index] + df['dst_bytes'].iloc[index])
10 loops, best of 5: 377 ms per loop

Even this basic for loop with .iloc is 3 times faster than the first method!

❌ 3. Apply (4× faster)

The apply() method is another popular choice to iterate over rows. It creates code that is easy to understand but at a cost: performance is nearly as bad as the previous for loop.

This is why I would strongly advise you to avoid this function for this specific purpose (it’s fine for other applications).

Note that I convert the DataFrame into a list using the to_list() method to obtain identical results.

%%timeit -n 10
# Apply
df.apply(lambda row: row['src_bytes'] + row['dst_bytes'], axis=1).to_list()
10 loops, best of 5: 282 ms per loop

The apply() method is a for loop in disguise, which is why the performance doesn’t improve that much: it’s only 4 times faster than the first technique.

❌ 4. Itertuples (10× faster)

If you know about iterrows(), you probably know about itertuples(). According to the official documentation, it iterates “over the rows of a DataFrame as namedtuples of the values”. In practice, it means that rows are converted into tuples, which are much lighter objects than Pandas Series.

This is why itertuples() is a better version of iterrows(). This time, we need to access the values with an attribute (or an index). If you want to access them with a string (e.g., if there’s a space in the string), you can use the getattr() function instead.

%%timeit -n 10
# Itertuples
total = []
for row in df.itertuples():
    total.append(row.src_bytes + row.dst_bytes)
10 loops, best of 5: 99.3 ms per loop

This is starting to look better: it is now 10 times faster than iterrows().

❌ 5. List comprehensions (200× faster)

List comprehensions are a fancy way to iterate over a list as a one-liner.

For instance, [print(i) for i in range(10)] prints numbers from 0 to 9 without any explicit for loop. I say “explicit” because Python actually processes it as a for loop if we look at the bytecode.

So why is it faster? Quite simply because we don’t call the .append() method in this version.

%%timeit -n 100
# List comprehension
[src + dst for src, dst in zip(df['src_bytes'], df['dst_bytes'])]
100 loops, best of 5: 5.54 ms per loop

Indeed, this technique is 200 times faster than the first one! But we can still do better.

✅ 6. Pandas vectorization (1500× faster)

Until now, all the techniques used simply add up single values. Instead of adding single values, why not group them into vectors to sum them up? The difference between adding two numbers or two vectors is not significant for a CPU, which should speed things up.

On top of that, Pandas can process Series objects in parallel, using every CPU core available!

The syntax is also the simplest imaginable: this solution is extremely intuitive. Under the hood, Pandas takes care of vectorizing our data with an optimized C code using contiguous memory blocks.

%%timeit -n 1000
# Vectorization
(df['src_bytes'] + df['dst_bytes']).to_list()
1000 loops, best of 5: 734 µs per loop

This code is 1500 times faster than iterrows() and it is even simpler to write.

✅✅ 7. NumPy vectorization (1900× faster)

NumPy is designed to handle scientific computing. It has less overhead than Pandas methods since rows and dataframes all become np.array. It relies on the same optimizations as Pandas vectorization.

There are two ways of converting a Series into a np.array: using .values or .to_numpy(). The former has been deprecated for years, which is why we’re gonna use .to_numpy() in this example.

%%timeit -n 1000
# Numpy vectorization
(df['src_bytes'].to_numpy() + df['dst_bytes'].to_numpy()).tolist()
1000 loops, best of 5: 575 µs per loop

We found our winner with a technique that is 1900 times faster than our first competitor! Let’s wrap things up.

Conclusion

Don’t be like me: if you need to iterate over rows in a DataFrame, vectorization is the way to go! You can find the code to reproduce the experiments at this address. Vectorization is not harder to read, it doesn’t take longer to write, and the performance gain is incredible.

It’s not just about performance: understanding how each method works under the hood helped me to write better code. Performance gains are always based on the same techniques: transforming data into vectors and matrices to take advantage of parallel processing. Alas, this is often at the expense of readability. But it doesn’t have to be.

Iterating over rows is just an example but it shows that, sometimes, you can have the cake and eat it. 🎂

If you liked this article, follow me on Twitter @maximelabonne for more tips about data science and machine learning!

📈 Bonus

We can measure the performance of each method depending on the size of the DataFrame. I reimplemented all of them in this dummy example using perfplot to show that the leaderboard might be different under 300 rows. Anyway, such a dataset would be so small that we wouldn’t need much optimization.

!pip install -q perfplot

import perfplot 
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 22})

# Techniques
def forloop(df):
    total = []
    for index in range(len(df)):
        total.append(df['col1'].iloc[index] 
                   + df['col2'].iloc[index])
    return total

def itertuples(df):
    total = []
    for row in df.itertuples():
        total.append(row[1] + row[2])
    return total

def iterrows(df):
    total = []
    for index, row in df.iterrows():
        total.append(row['col1']
                   + row['col2'])
    return total

def apply(df):
    return df.apply(lambda row: row['col1']
                              + row['col2'], axis=1).to_list()

def comprehension(df):
    return [src + dst for src, dst in zip(df['col1'], df['col2'])]

def pd_vectorize(df):
    return (df['col1'] + df['col2']).to_list()

def np_vectorize(df):
    return (df['col1'].to_numpy() + df['col2'].to_numpy()).tolist()

# Perfplot
functions = [iterrows, forloop, apply, itertuples,
             comprehension, pd_vectorize, np_vectorize]
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})

out = perfplot.bench(
      setup=lambda n: pd.concat([df]*n, ignore_index=True),
      kernels=functions,
      labels=[str(f.__name__) for f in functions],
      n_range=[2**n for n in range(20)],
      xlabel='Number of rows',
)

plt.figure(figsize=(20,12))
out.show()


(8.117999999999996e-06, 51.558470168)