When I started machine learning, I followed the guidelines and created my own features by combining multiple columns in my dataset. It’s all well and good, but the way I did it was horribly inefficient. I had to wait several minutes to do the most basic operations.
My problem was simple: I didn’t know the fastest way to iterate over rows in Pandas.
I often see people online using the same techniques I used to apply. It’s not elegant but it’s ok if you don’t have much data. However, if you process more than 10k rows, it quickly becomes an obvious performance issue.
In this article, I’m gonna give you the best way to iterate over rows in a Pandas DataFrame, with no extra code required. It’s not just about performance: it’s also about understanding what’s going on under the hood to become a better data scientist.
Let’s import a dataset in Pandas. In this case, I chose the one I worked on when I started: it’s time to fix my past mistakes! 🩹
import pandas as pdimport numpy as npdf = pd.read_csv('https://raw.githubusercontent.com/mlabonne/how-to-data-science/main/data/nslkdd_test.txt')df
duration
protocol_type
service
flag
src_bytes
dst_bytes
land
wrong_fragment
urgent
hot
...
dst_host_same_srv_rate
dst_host_diff_srv_rate
dst_host_same_src_port_rate
dst_host_srv_diff_host_rate
dst_host_serror_rate
dst_host_srv_serror_rate
dst_host_rerror_rate
dst_host_srv_rerror_rate
attack_type
other
0
0
tcp
private
REJ
0
0
0
0
0
0
...
0.04
0.06
0.00
0.00
0.00
0.0
1.00
1.00
neptune
21
1
0
tcp
private
REJ
0
0
0
0
0
0
...
0.00
0.06
0.00
0.00
0.00
0.0
1.00
1.00
neptune
21
2
2
tcp
ftp_data
SF
12983
0
0
0
0
0
...
0.61
0.04
0.61
0.02
0.00
0.0
0.00
0.00
normal
21
3
0
icmp
eco_i
SF
20
0
0
0
0
0
...
1.00
0.00
1.00
0.28
0.00
0.0
0.00
0.00
saint
15
4
1
tcp
telnet
RSTO
0
15
0
0
0
0
...
0.31
0.17
0.03
0.02
0.00
0.0
0.83
0.71
mscan
11
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
22539
0
tcp
smtp
SF
794
333
0
0
0
0
...
0.72
0.06
0.01
0.01
0.01
0.0
0.00
0.00
normal
21
22540
0
tcp
http
SF
317
938
0
0
0
0
...
1.00
0.00
0.01
0.01
0.01
0.0
0.00
0.00
normal
21
22541
0
tcp
http
SF
54540
8314
0
0
0
2
...
1.00
0.00
0.00
0.00
0.00
0.0
0.07
0.07
back
15
22542
0
udp
domain_u
SF
42
42
0
0
0
0
...
0.99
0.01
0.00
0.00
0.00
0.0
0.00
0.00
normal
21
22543
0
tcp
sunrpc
REJ
0
0
0
0
0
0
...
0.08
0.03
0.00
0.00
0.00
0.0
0.44
1.00
mscan
14
22544 rows × 43 columns
This dataset has 22k rows and 43 columns with a combination of categorical and numerical values. Each row describes a connection between two computers.
Let’s say we want to create a new feature: the total number of bytes in the connection. We just have to sum up two existing features: src_bytes and dst_bytes. Let’s see different methods to calculate this new feature.
❌❌ 1. Iterrows
According to the official documentation, iterrows() iterates “over the rows of a Pandas DataFrame as (index, Series) pairs”. It converts each row into a Series object, which causes two problems:
It can change the type of your data (dtypes);
The conversion greatly degrades performance.
For these reasons, the ill-named iterrows() is the WORST possible method to actually iterate over rows.
This is what I used to do when I started: a basic for loop to select rows by index (with .loc or .iloc).
Why is it bad? Because DataFrames are not designed for this purpose. As with the previous method, rows are converted into Pandas Series objects, which degrades performance.
Interestingly enough, .iloc is faster than .loc. It makes sense since Python doesn’t have to check user-defined labels and directly look at where the row is stored in memory.
%%timeit -n 10# For loop with .loctotal = []for index inrange(len(df)): total.append(df['src_bytes'].loc[index] + df['dst_bytes'].loc[index])
10 loops, best of 5: 600 ms per loop
%%timeit -n 10# For loop with .iloctotal = []for index inrange(len(df)): total.append(df['src_bytes'].iloc[index] + df['dst_bytes'].iloc[index])
10 loops, best of 5: 377 ms per loop
Even this basic for loop with .iloc is 3 times faster than the first method!
❌ 3. Apply (4× faster)
The apply() method is another popular choice to iterate over rows. It creates code that is easy to understand but at a cost: performance is nearly as bad as the previous for loop.
This is why I would strongly advise you to avoid this function for this specific purpose (it’s fine for other applications).
Note that I convert the DataFrame into a list using the to_list() method to obtain identical results.
The apply() method is a for loop in disguise, which is why the performance doesn’t improve that much: it’s only 4 times faster than the first technique.
❌ 4. Itertuples (10× faster)
If you know about iterrows(), you probably know about itertuples(). According to the official documentation, it iterates “over the rows of a DataFrame as namedtuples of the values”. In practice, it means that rows are converted into tuples, which are much lighter objects than Pandas Series.
This is why itertuples() is a better version of iterrows(). This time, we need to access the values with an attribute (or an index). If you want to access them with a string (e.g., if there’s a space in the string), you can use the getattr() function instead.
This is starting to look better: it is now 10 times faster than iterrows().
❌ 5. List comprehensions (200× faster)
List comprehensions are a fancy way to iterate over a list as a one-liner.
For instance, [print(i) for i in range(10)] prints numbers from 0 to 9 without any explicit for loop. I say “explicit” because Python actually processes it as a for loop if we look at the bytecode.
So why is it faster? Quite simply because we don’t call the .append() method in this version.
%%timeit -n 100# List comprehension[src + dst for src, dst inzip(df['src_bytes'], df['dst_bytes'])]
100 loops, best of 5: 5.54 ms per loop
Indeed, this technique is 200 times faster than the first one! But we can still do better.
✅ 6. Pandas vectorization (1500× faster)
Until now, all the techniques used simply add up single values. Instead of adding single values, why not group them into vectors to sum them up? The difference between adding two numbers or two vectors is not significant for a CPU, which should speed things up.
On top of that, Pandas can process Series objects in parallel, using every CPU core available!
The syntax is also the simplest imaginable: this solution is extremely intuitive. Under the hood, Pandas takes care of vectorizing our data with an optimized C code using contiguous memory blocks.
This code is 1500 times faster than iterrows() and it is even simpler to write.
✅✅ 7. NumPy vectorization (1900× faster)
NumPy is designed to handle scientific computing. It has less overhead than Pandas methods since rows and dataframes all become np.array. It relies on the same optimizations as Pandas vectorization.
There are two ways of converting a Series into a np.array: using .values or .to_numpy(). The former has been deprecated for years, which is why we’re gonna use .to_numpy() in this example.
We found our winner with a technique that is 1900 times faster than our first competitor! Let’s wrap things up.
Conclusion
Don’t be like me: if you need to iterate over rows in a DataFrame, vectorization is the way to go! You can find the code to reproduce the experiments at this address. Vectorization is not harder to read, it doesn’t take longer to write, and the performance gain is incredible.
It’s not just about performance: understanding how each method works under the hood helped me to write better code. Performance gains are always based on the same techniques: transforming data into vectors and matrices to take advantage of parallel processing. Alas, this is often at the expense of readability. But it doesn’t have to be.
Iterating over rows is just an example but it shows that, sometimes, you can have the cake and eat it. 🎂
If you liked this article, follow me on Twitter @maximelabonne for more tips about data science and machine learning!
📈 Bonus
We can measure the performance of each method depending on the size of the DataFrame. I reimplemented all of them in this dummy example using perfplot to show that the leaderboard might be different under 300 rows. Anyway, such a dataset would be so small that we wouldn’t need much optimization.
!pip install -q perfplotimport perfplot import matplotlib.pyplot as pltplt.rcParams.update({'font.size': 22})# Techniquesdef forloop(df): total = []for index inrange(len(df)): total.append(df['col1'].iloc[index] + df['col2'].iloc[index])return totaldef itertuples(df): total = []for row in df.itertuples(): total.append(row[1] + row[2])return totaldef iterrows(df): total = []for index, row in df.iterrows(): total.append(row['col1']+ row['col2'])return totaldefapply(df):return df.apply(lambda row: row['col1']+ row['col2'], axis=1).to_list()def comprehension(df):return [src + dst for src, dst inzip(df['col1'], df['col2'])]def pd_vectorize(df):return (df['col1'] + df['col2']).to_list()def np_vectorize(df):return (df['col1'].to_numpy() + df['col2'].to_numpy()).tolist()# Perfplotfunctions = [iterrows, forloop, apply, itertuples, comprehension, pd_vectorize, np_vectorize]df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})out = perfplot.bench( setup=lambda n: pd.concat([df]*n, ignore_index=True), kernels=functions, labels=[str(f.__name__) for f in functions], n_range=[2**n for n inrange(20)], xlabel='Number of rows',)plt.figure(figsize=(20,12))out.show()