background image

Linux Software Router: Data Plane Optimization 

and Performance Evaluation 

 

Raffaele Bolla and Roberto Bruschi 

DIST - Department of Communications, Computer and Systems Science 

University of Genoa 

Via Opera Pia 13, 16145 Genoa, Italy  

Email: {raffaele.bolla, roberto.bruschi}@unige.it 

 
 

Abstract - Recent technological advances provide an 
excellent opportunity to achieve truly effective results in the 
field of open Internet devices, also known as Open Routers 
or ORs. Even though some initiatives have been undertaken 
over the last few years to investigate ORs and related topics, 
other extensive areas still require additional investigation. 
In this contribution we report the results of the in-depth 
optimization and testing carried out on a PC Open Router 
architecture based on Linux software and COTS hardware. 
The main focus of this paper was the forwarding 
performance evaluation of different OR Linux-based 
software architectures. This analysis was performed with 
both external (throughput and latencies) and internal 
(profiling) measurements. In particular, for the external 
measurements, a set of RFC2544 compliant tests was also 
proposed and analyzed. 
Index Terms -
 Linux Router; Open Router; RFC 2544; IP 
forwarding. 

I. 

I

NTRODUCTION

 

Internet technology has been developed in an open 

environment and all Internet-related protocols, 
architectures and structures are publicly created and 
described. For this reason, in principle, everyone can 
“easily” develop an Internet device (e.g., a router). On the 
contrary, and to a certain extent quite surprising, most of 
the professional devices are developed in an extremely 
“closed” manner. In fact, it is very difficult to acquire 
details about internal operations and to perform anything 
more complex than a parametrical configuration.  

From a general viewpoint, this is not very strange since 

it can be considered a clear attempt to protect the 
industrial investment. However, sometimes the 
“experimental” nature of the Internet and its diffusion in 
many contexts might suggest a different approach. Such a 
need is even more evident within the scientific 
community, which often runs into various problems when 
carrying out experiments, testbeds and trials to evaluate 
new functionalities and protocols.  

Today, recent technological advances provide an 

opportunity to do something truly effective in the field of 
open Internet devices, sometimes called Open Routers 
(ORs). Such an opportunity arises from the use of Open 
Source Operative Systems (OSs) and COTS/PC 
components. The attractiveness of the OR solution can be 
summarized as: multi-vendor availability, low-cost and 
continuous updating/evolution of the basic parts. As far 
as performance is concerned, the PC architecture is 

general-purpose which means that, in principle, it cannot 
attain the same performance level as custom, high-end 
network devices, which often use dedicated HW elements 
to handle and to parallelize the most critical operations. 
Otherwise, the performance gap might not be so large 
and, in any case, more than justified by the cost 
differences. Our activities, carried out within the 
framework of the BORA-BORA project [1], are geared to 
facilitate the investigation by reporting the results of an 
extensive optimization and testing operation carried out 
on OR architecture based on Linux software. We focused 
our attention mainly on packet forwarding functionalities. 
Our main objectives were the performance evaluation of 
an optimized OR, in addition to external (throughput and 
latencies) and internal (profiling) measurements. To this 
regard, we identified a high-end reference PC-based 
hardware architecture and Linux kernel 2.6 for the 
software data plane. Subsequently, we optimized this OR 
structure, defined a test environment and finally 
developed a complete series of tests with an accurate 
evaluation of the software module’s role in defining 
performance limits.  

With regard to the state-of-the-art of OR devices, some 

initiatives have been undertaken over the last few years to 
develop and investigate the ORs and related topics. In the 
software area, one of the most important initiatives is the 
Click Modular Router Project [2], which proposes an 
effective data plane solution. In the control plane area 
two important projects can be cited: Zebra [3] and Xorp 
[4].  

Despite custom developments, some standard Open 

Source OSs can also provide very effective support for an 
OR project. The most relevant OSs in this sense are 
Linux [5][6] and FreeBSD [7]. Other activities focus on 
hardware: [8] and [9] propose a router architecture based 
on a PC cluster, while [10] reports some performance 
results (in packet transmission and reception) obtained 
with a PC Linux-based testbed. Some evaluations have 
also been carried out on network boards (see, for 
example, [11]).  

Other fascinating projects involving Linux-based ORs 

can be found in [12] and [13], where Bianco et al. report 
some interesting performance results. In [14] a 
performance analysis of an OR architecture enhanced 
with FPGA line cards, which allows direct NIC-to-NIC 
packet forwarding, is introduced. [15] describes the Intel 

6

JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007

© 2007 ACADEMY PUBLISHER

background image

I/OAT, a technology that enables DMA engines to 
improve network reception and transmission by 
offloading the CPU of some low-level operations. 

In [16] the virtualization of a multiservice OR 

architecture is discussed: the authors propose multiple 
Click forwarding chains virtualized with Xen.  

Finally, in [17], we proposed an in-depth study of the 

IP lookup mechanism included in the Linux kernel. 

The paper is organized as follows. the hardware and 

software details of the proposed OR architecture are 
reported in sections II and III reports, while Section IV 
contains a description of performance tuning and 
optimization techniques. The benchmarking scenario and 
the performance results are reported in Sections V and 
VI, respectively. Conclusions are presented in Section 
VII. 

II.  L

INUX 

OR

 

S

OFTWARE 

A

RCHITECTURE

 

The OR architecture has to provide many different 

types of functionalities: from those directly involved in 
the packet forwarding process to the ones needed for 
control functionalities, dynamic configuration and 
monitoring.  

As outlined in [5], in [18] and in [19], all the 

forwarding functions are developed inside the Linux 
kernel, while most of the control and monitoring 
operations (the signaling protocols such as routing 
protocols, control protocols, etc.) are daemons / 
applications running in the user mode.  

Like the older kernel versions, the Linux networking 

architecture is basically based on an interrupt mechanism: 
network boards signal the kernel upon packet reception or 
transmission through HW interrupts. Each HW interrupt 
is served as soon as possible by a handling routine, which 
suspends the operations currently being processed by the 
CPU. Until completed, the runtime cannot be interrupted 
by anything, and not even by other interrupt handlers. 
Thus, with the clear purpose of making the system 
reactive, the interrupt handlers are designed to be very 
short, while all the time-consuming tasks are performed 
by the so-called “Software Interrupts” (SoftIRQs) 
afterwards. This is the well-known “top half–bottom 
half” IRQ routine division implemented in the Linux 
kernel [18].  

SoftIRQs are actually a form of kernel activity that can 

be scheduled for later execution rather than real 
interrupts. They differ from HW IRQs mainly in that a 
SoftIRQ is scheduled for execution by a kernel activity, 
such as an HW IRQ routine, and has to wait until it is 
called by the scheduler. SoftIRQs can be interrupted only 
by HW IRQ routines. 

The “NET_TX_SOFTIRQ” and the “NET_RX_ 

SOFTIRQ” are two of the most important SortIRQs in the 
Linux kernel and the backbone of the entire networking 
architecture, since they are designed to manage the packet 
transmission and reception operations, respectively. In 
detail, the forwarding process is triggered by an HW IRQ 
generated from a network device, which signals the 
reception or the transmission of packets. Then the 
corresponding routine performs some fast checks, and 

schedules the correct SoftIRQ, which is activated by the 
kernel scheduler as soon as possible. When the SoftIRQ 
is finally executed, it performs all the packet forwarding 
operations. 

As shown in Figure 1, which reports a scheme of 

Linux source code involved in the forwarding process, 
these operations computed during SoftIRQs can be 
organized in a chain of three different modules: a 
“reception API” that handles packet reception (NAPI

1

), a 

module that carries out the IP layer elaboration and, 
finally, a “transmission API” that manages the 
forwarding operations to the egress network interfaces.  

In particular, the reception and the transmission APIs 

are the lowest level modules, and are activated by both 
HW IRQ routines and scheduled SoftIRQs. They handle 
the network interfaces and perform some layer 2 
functionalities.  

The NAPI [20] was introduced in the 2.4.27 kernel 

version, and has been explicitly created to increase 
reception process scalability. It handles network interface 
requests with an interrupt moderation mechanism, 
through it is possible to adaptively switch from a classical 
to a polling interrupt management of the network 
interfaces.  

In greater detail, this is accomplished by inserting the 

identifier of the board generating the IRQ on a special 
list, called the “poll list”, during the HW IRQ routine, 
scheduling a reception SoftIRQ, and disabling the HW 
IRQs for that device. When the SoftIRQ is activated, the 
kernel polls all the devices, whose identifier is included 
on the poll list, and a maximum of quota packets are 
served per device. If the board buffer (Rx Ring) is 
emptied, then the identifier is removed from the poll list 
and its HW IRQs re-enabled. Otherwise, its HW IRQs is 
left disabled, the identifier remains on the poll list and 
another SoftIRQ is scheduled. While this mechanism 
behaves like a pure interrupt mechanism in the presence 
of a low ingress rate (i.e., we have more or less one HW 
IRQ per packet), when traffic increases, the probability of 
emptying the RxRing, and thus re-enabling HW IRQs, 
decreases more and more, and the NAPI starts working 
like a polling mechanism. 

For each packet received during the NAPI processing a 

descriptor, called skbuff [21], is immediately allocated. In 
particular, as shown in Figure 1, to avoid unnecessary and 
tedious memory transfer operations, the packets are 
allowed to reside in the memory locations used by the 
DMA-engines of ingress network interfaces, and each 
subsequent operation is performed using the skbuffs. 
These descriptors do in fact consist of pointers to the 
different key fields of the headers contained in the 
associated packets, and are used for all the layer 2 and 3 
operations. 

A packet is elaborated in the same NET_RX SoftIRQ, 

until it is enqueued in an egress device buffer, called 
Qdisc. Each time a NET_TX SoftIRQ is activated or a 
new packet is enqueued, the Qdisc buffer is served. When 

                                                           

1

 In greater detail, the NAPI architecture includes a part 

of the interrupt handler. 

JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007

7

© 2007 ACADEMY PUBLISHER

background image

a packet is dequeued from the Qdisc buffer, it is placed 
on the Tx Ring of the egress device. After the board 
successfully transmits one or more packets, it generates 
an HW IRQ, whose routine schedules a NET_TX 
SoftIRQ.  

The Tx Ring is periodically cleaned of all the 

descriptors of transmitted packets, which will be de-
allocated and refilled by the packets coming from the 
Qdisc buffer. 

Another interesting characteristic of the 2.6 kernels 

(introduced to reduce performance deterioration due to 
CPU concurrency) is the Symmetric Multi-Processors 
(SMP) support that may assign management of each 
network interface to a single CPU for both the 
transmission and reception functionalities. 

III.  H

ARDWARE 

A

RCHITECTURE

 

The Linux OS supports many different hardware 

architectures, but only a small portion of them can be 
effectively used to obtain high OR performance. 

In particular, we must take into account that, during 

networking operations, the PC internal data path has to 
use a centralized I/O structure consisting of the I/O bus, 
the memory channel (used by DMA to transfer data from 
network interfaces to RAM and vice versa) and the Front 
Side Bus (FSB) (used by the CPU with the memory 
channel to access the RAM during packet elaboration). 
The selection criterions for hardware elements have been 
very fast internal busses, RAM with very low access 
times, and CPUs with high integer computational power 
(i.e., packet processing does not generally require any 
floating point operations).  

In order to understand how hardware architecture 

affects overall system performance, we selected two 
different architectures that represent the current state-of-
the art of server architectures and the state-of-the-art from 
3 years ago, respectively.  

To this regard, as old HW architecture, we chose a 

system based on the Supermicro X5DL8-GG mainboard: 
it can support a dual-Xeon system with a dual memory 
channel and a PCI-X bus at 133MHz with 64 parallel bits. 
The Xeon processors (32 bit and mono-core) we utilized 
have a 2.4 GHz clock and a 512KB cache. For the new 
OR architecture we used a Supermicro X7DBE 
mainboard, equipped with both the PCI Express and PCI-
X busses, and with a 5050 Intel Xeon (dual core 64-bit 
processor). 

Network interfaces are another critical element, since 

they can heavily affect PC Router performance. As 
reported in [11], the network adapters on the market offer 
different performance levels and configurability. With 
this in mind, we selected two different types of adapters 
with different features and speed: a high performance and 
configurable Gigabit Ethernet interface, namely Intel 
PRO 1000, which is equipped with a PCI-X controller 
(XT version) or a PCI-Express (PT version) [22]; a D-
Link DFE-580TX [23] that is a network card equipped 
with four Fast Ethernet interfaces and a PCI 2.1 
controller.  

IV.  S

OFTWARE 

P

ERFORMANCE 

T

UNING

 

The entire networking Linux kernel architecture is 

quite complex and has numerous aspects and parameters 
that can be tuned for system optimization. In particular, in 
this environment, since the OS has been developed to act 

HW Interrupt

interrupt 

handler

Device 1

Device 2

Device 3

net_rx_action

Poll_Queue 

CPU1

e1000_clean_rx_irq

Kernel 

Memory

netif_receive_skb

e1000_alloc_rx_buffers

alloc_skb

eth_type_trans

eth_header

ip_rcv

Rx_Ring

device 3

NAPI

ip_rcv_finish

ip_route_input

rt_hash_code

ip_forward

ip_forward_finish

ip_send

ip_finish_output

ip_output

IP Processing

dev_queue_xmit

hard_start_xmit

e1000_xmit_frame 

Root Qdisc

device 2

qdisc_restart

Tx_Ring
device 2

net_tx_action

Completion 

queue

kfree

TX-API

Netfilter hook

e1000_clean_tx_irq

DMA engines

 

Figure 1. Detailed scheme of forwarding code in 2.6 Linux kernel versions. 

8

JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007

© 2007 ACADEMY PUBLISHER

background image

as network host (i.e., workstation, server, etc.), it is 
natively tuned for “general purpose” network end-node 
usage. In this last case, packets are not fully processed 
inside kernel-space, but are usually delivered from 
network interfaces to applications in user-space, and vice 
versa. When the Linux kernel is used in an OR 
architecture, it generally works in a different manner, and 
should be specifically tuned and customized to obtain the 
maximum packet forwarding performance.  

As reported in [19] and [25], where a more detailed 

description of the adopted tuning actions can be found, 
this optimization is very important for obtaining 
maximum performance. Some of the optimal parameter 
values can be identified by logical considerations, but 
most of them have to be empirically determined, since 
their optimal value cannot be easily derived from the 
software structure and because they also depend on the 
hardware components. So we carried out our tuning first 
by identifying the critical elements on which to operate, 
and, then, by finding the most convenient values with 
both logical considerations and experimental measures. 

As far as the adopted tuning settings are concerned, we 

used the 6.3.9 e1000 driver [24], configured with both the 
Rx and Tx ring buffers to 256 descriptors, while the Rx 
interrupt generation was not limited. The qdisc size for all 
the adapters was dimensioned to 20,000 descriptors, 
while the scheduler clock frequency was fixed to 100Hz. 

Moreover the 2.6.16.13 kernel images used to obtain 

the numerical results in Section VI include two structural 
patches that we created to test and/or optimize kernel 
functionalities. In particular, those patches are described 
in the following discussion. 

A. Skbuff Recycling patch 

We studied and developed a new version of the skbuff 

Recycling patch, originally proposed by R. Olsson [26] 
for the “e1000” driver. In particular, the new version is 
stabilized for the 2.6.16.13 kernel version and extended to 
the “sundance” driver. 

This patch intercepts the skbuff descriptors of 

transmitted packets before they are de-allocated, and re-
uses them for new incoming packets. As shown in [19], 
this architectural change significantly reduces the 
computation weight of the memory management 
operations, thus attaining a very high performance level 
(i.e., about 150-175% of the maximum throughput of 
standard kernels).  

B. Performance Counter patch 

To further analyze the OR’s internal behavior, we 

decided to introduce a set of counters in the kernel source 
code in order to understand how many times a certain 
procedure is called, or how many packets are kept per 
time. Specifically, we introduced the following counters: 

•  IRQ: number of interrupt handlers generated by a 

network card; 

•  Tx/Rx IRQ: number of tx/rx IRQ routines per 

device; 

•  Tx/Rx SoftIRQ: number of tx/rx software IRQ 

routines; 

•  Qdiscrun and Qdiscpkt: number of times the 

output buffer (Qdisc) is served, and number of 
served packets per time. 

•  Pollrun and Pollpkt: number of times the rx ring 

of a device is served, and the number of served 
packets per time. 

•  tx/rx clean: number of times the tx/rx procedures 

of the driver are activated. 

The values of all these parameters have been mapped 

in the Linux “proc” file system. 

V.  B

ENCHMARKING 

S

CENARIO

 

To benchmark the OR forwarding performance, we 

used a professional device, known as Agilent N2X Router 
Tester [27], which can be used to obtain throughput and 
latency measurements with high availability and accuracy 
levels (i.e., the minimum guaranteed timestamp 
resolution is 10 ns). Moreover, with two dual Gigabit 
Ethernet cards and one 16 Fast Ethernet card, we can 
analyze the OR behavior with a large number of Fast and 
Gigabit Ethernet interfaces.  

To better support the performance analysis and to 

identify the OR bottlenecks, we also performed some 
internal measurements using specific software tools 
(called profilers) placed inside the OR which trace the 
percentage of CPU utilization for each software module 
running on the node. The problem is that with many of 
these profilers the relevant computational effort required 
perturbs system performance, thus generating what are 
not very meaningful the results. We verified with many 
different tests that one of the best is Oprofile [28], an 
open source tool that continuously monitors system 
dynamics with frequent and quite regular sampling of 
CPU hardware registers. Oprofile effectively and 
profoundly evaluate CPU utilization of each software 
application and each single kernel function running in the 
system with very low computational overhead. 

With regard to the benchmarking scenario, we decided 

to start by defining a reasonable set of test setups (with 
increasing level of complexity) and for each selected 
setup to apply some of the tests defined in the RFC 2544 
[29]. In particular, we chose to perform these activities by 
using both a core and an edge router configuration: the 
former consists of a few high-speed (Gigabit Ethernet) 
network interfaces, while the latter utilizes a high-speed 
gateway interface and a large number of Fast Ethernet 
cards which collect traffic from the access networks. 
More specifically, we performed our tests by using the 
following setups (see Figure 2):  
1) Setup A: a single mono directional flow crosses the OR 

from one Gigabit port to another one;  

2) Setup B: two full duplex flows cross the OR, each one 

using a different pair of Gigabit ports;  

3) Setup C: a full-meshed (and full-duplex) traffic matrix 

applied on 4 Gigabit Ethernet ports;  

4) Setup D: a full-meshed (and full-duplex) traffic matrix 

applied on 1 Gigabit Ethernet port and 12 Fast Ethernet 
interfaces.  

In greater detail, each OR forwarding benchmarking 

session essentially consists of three test sets, and namely: 

JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007

9

© 2007 ACADEMY PUBLISHER

background image

a) Throughput and latency: this test set is performed by 

using constant bit rate traffic flows, consisting of fixed 
size datagrams, to obtain: a) the maximum effective 
throughput (in Kpackets/s and as a percentage with 
respect to the theoretical value) versus different IP 
datagram sizes; b) the average, maximum and minimum 
latencies versus different IP datagram sizes; 

b) Back-to-back: these tests are carried out by using burst 

traffic flows and by changing both the burst dimension 
(i.e., the number of the packets comprising the burst) 
and the datagram size. The main results for this kind of 
test are: a) zero loss burst length versus different IP 
datagram sizes; b) average, maximum and minimum 
latencies versus different sizes of IP datagram 
comprising the burst (“zero loss burst length” is the 
maximum number of packets transmitted with 
minimum inter-frame gaps that the System Under Test 
(SUT) can handle without any loss). 

c) Loss Rate: this kind of test is carried out by using CBR 

traffic flows with different offered loads and IP 
datagram sizes; the obtainable results can be 
summarized in throughput versus both offered load and 
IP datagram sizes. 

Note that all these tests have been performed by using 

different IP datagram sizes (i.e., 40, 64, 128, 256, 512, 
1024 and 1500 bytes) and both CBR and burst traffic 
flows. 

 

 

Setup A

 

 

Setup B

 

 

Setup C

 

Setup D

 

Figure 2.  Benchmarking setups. 

VI.  N

UMERICAL 

R

ESULTS

 

A selection of the experimental results is reported in 

this section. In particular, the results of the benchmarking 
setups shown in Figure 2 are reported in Subsections A, 
B, C and D. In all such cases, the tests were performed 
with the “old” hardware architecture described in Section 
III (i.e., 32-bit Xeon and PCI-X bus). 

With regard to Software architecture, we decided to 

compare different 2.6.16 Linux kernel configurations and 
a Click Modular Router. In particular, we used the 
following versions of the 2.6.16 Linux kernel:  

•  single-processor 2.6.16 optimized kernel (a 

version based on the standard one with single 
processor support that includes the descriptor 
recycling patch).  

•  dual-processor 2.6.16 standard kernel (a standard 

NAPI kernel version similar to the previous one 
but with SMP support);  

Note that we decided not take into account the SMP 

versions of both the optimized Linux kernel and the Click 
Modular Router, since they lack a minimum acceptable 
level of stability. 

Subsection E summarizes the results obtained in the 

previous tests by showing the maximum performance for 
each benchmarking setup. Finally, the performance of the 

two hardware architectures described in Section III are 
reported in Subsection F, in order to evaluate how HW 
evolution affects forwarding performance. 

A.  Setup A numerical results 

In the first benchmarking session, we performed the 

RFC 2544 tests by using setup A (see Figure 2) with both 
the single-processor 2.6.16 optimized kernel and Click. 
As we can observe in Figs. 3, 4 and 5, which report the 
numerical results of the throughput and latency tests, both 
software architectures cannot achieve the maximum 
theoretical throughput in the presence of small datagram 
sizes. 

As demonstrated by the profiling measurements 

reported in Fig. 6, obtained with the single processor 
optimized 2.6.16 kernel and with 64 Bytes sized 
datagrams, this effect is clearly caused by the 
computational CPU capacity that limits the maximum 
forwarding rate of the Linux kernel to about 700 
Kpackets/s (40% of the full Gigabit speed). In fact, even 
if the CPU idle goes to zero at 40% of full load, the CPU 
occupancies of all the most important function sets 
appear to adapt their contributions up to 700 Kpackets/s; 
after this point their percentage contributions to CPU 
utilization remains almost constant. 

 

 0

 20

 40

 60

 80

 100

 0  200 400 600 800 1000

 1200

 1400

 1600

Throughput [%]

IP Packet Size [Bytes]

2.6.16

Click

 

Figure 

3. Throughput and latencies test, testbed A: effective 

throughput results for the single-processor 2.6.16 optimized kernel and 
Click. 

 0

 50

 100

 150

 200

 250

 300

 350

 400

 0  200 400 600 800 1000

 1200

 1400

 1600

Latency [us]

IP Packet Size [Bytes]

2.6.16 min
2.6.16 max

Click min
Click max

 

Figure 4.  Throughput and latencies test, testbed A: minimum and 
maximum latencies for both the single-processor 2.6.16 optimized 
kernel and Click. 

 
More expressly, Fig. 5 shows that the computational 

weight of memory management operations (like sk_buff 
allocations and de-allocations) is substantially limited, 
thanks to the descriptor recycling patch, to less than 25%. 
In other our works, such as [19], we have shown that this 
patch can be used to save a CPU time share equal to 
about 20%.  

10

JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007

© 2007 ACADEMY PUBLISHER

background image

 0

 50

 100

 150

 200

 250

 300

 350

 400

 0  200 400 600 800 1000

 1200

 1400

 1600

Latency [us]

IP Packet Size [Bytes]

2.6.16 avg

Click avg

 

Figure 5.  Throughput and latencies test, testbed A: average latencies 
for both the single-processor 2.6.16 optimized kernel and Click. 

0

10

20

30

40

50

60

0

10

20

30

40

50

60

70

80

CPU Utilization [%]

Offered Load [%]

idle

scheduler

memory

IP processing

NAPI

Tx API

IRQ

Eth processing

oprofile

 

 
Figure 6.  Profiling results of the optimized Linux kernel obtained with 
testbed setup A. 

0

0.5

1

1.5

2

2.5

0

10

20

30

40

50

60

70

80

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Occurrences [# / Pkts]

Occurrences [# / Pkts]

Offered Load [%]

IRQ

Poll Run

rxSoftIrq

 

Figure 7.  Number of IRQ routines, polls and Rx SoftIRQ (second y-
axis) for the RX board for the skbuff recycling patched kernel, in the 
presence of an incoming traffic flow with only 1 IP source address. 

0

0.001

0.002

0.003

0.004

0.005

0.006

0

10

20

30

40

50

60

70

80

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Occurrences [# / Pkts]

Occurrences [# / Pkts]

Offered Load [%]

IRQ

Wake
Func

 

Figure 8.  Number of IRQ routines for the TX board, of Tx Ring 
cleaned by TxSoftIRQ (“func”) and by RxSoftIRQ (“wake”) for the 
skbuff recycling patched kernel, in the presence of an incoming traffic 
flow with only 1 IP source address. The second y axis refers to “wake”. 

 
The behavior of the IRQ management operations 

would appear to be rather strange: in fact, their CPU 
utilization level decreases with an increase in input rate. 
There are mainly two reasons for such a behavior related 
to the packet grouping effect in the Tx and in the RxAPI: 
in particular, when the ingress packet rate rises, NAPI 
tends to moderate the IRQ rate by causing it to operate 
more like a polling than an interrupt-like mechanism (and 
thus we have the first interrupt number reduction), while 
TxAPI, under the same conditions, can better exploit the 
packet grouping mechanism by sending more packets at 
time (and then the number of interrupts for successful 
transmission confirmations decreases). When the IRQ 
weight becomes zero, the OR reaches the saturation 
point, and operates like a polling mechanism. 

With regard to all the other operation sets (i.e., IP and 

Ethernet processing, NAPI and TxAPI), their behaviour is 
clearly bound by the number of forwarded packets: the 
weight of almost all the classes increases linearly up to 
the saturation point, and subsequently remains more or 
less constant.  

This analysis is confirmed also by the performance 

counters reported in Figs. 7 and 8, in which both the Tx 
and Rx boards reduce their IRQ generation rates, while 
the kernel passes from polling the Rx Ring twice per 
received packet, to about 0.22 times. The number of Rx 
SoftIRQ per received packet also decreases as offered 
traffic load rises. For what concerns transmission 
dynamics, Fig. 8 shows very low function occurrences: in 
fact, the Tx IRQ routines decrease their occurrences up to 
saturation, while the “wake” function, which represents 
the number of times that the Tx Ring is cleaned and the 
Qdisc buffer is served during an Rx SoftIRQ, exhibits a 
mirror-like behavior: this occurs because when the OR 
reaches the saturation, all the tx functionalities are 
activated when the Rx SoftIRQ starts. 

 1

 10

 100

 1000

 10000

 100000

 0  200 400 600 800

 1000

 1200

 1400

 1600

Burst Length [pkt]

IP Packet Size [Bytes]

2.6.16

Click

 

Figure 9.  Back-to-back test, testbed A: maximum zero loss burst 
lengths. 

TABLE I.  

B

ACK

-

TO

-

BACK TEST

,

 TESTBED 

A:

 LATENCY VALUES 

FOR BOTH THE SINGLE

-

PROCESSOR OPTIMIZED KERNEL AND 

C

LICK

 

optimized 2.6.16 Kernel 

Click 

PktLength Latency 

Latency 

[Byte] 

Min 

[us] 

Average 

[us] 

Max   

[us] 

Min 

[us] 

Average 

[us] 

Max    

[us] 

40

16.16

960.08 1621.47 23.47  1165.53 1693.64

64

14.95

929.27 1463.02 23.52  1007.42 1580.74

128

16.04

469.9

925.93 19.34 

54.88

53.45

256

16.01

51.65

58.84 22.49 

52.62

47.67

512

18.95

54.96

61.51 20.72 

62.92

59.95

768

23.35

100.76

164.56 22.85  116.61

155.59

1024

25.31

123.68

164.21 32.02  128.85

154.72

1280

28.6

143.43

166.46 24.77  151.81

178.45

1500

30.38

142.22

163.63 32.01  154.79

181.43

 
Similar considerations can also be made for the Click 

modular router: the performance limitations in the 
presence of short-sized datagrams continue to be caused 
by a computational bottleneck, but the simple Click 
packet receive API based on the polling mechanism 
improves throughput performance by lowering the weight 
of IRQ management and RxAPI functions. For the same 
reasons, as shown in Figs. 4 and 5, the receive 
mechanism included in Click introduces higher packet 
latencies. According to the previous results, the back-to-
back tests, as reported in Fig. 9 and Table I, also 
demonstrate that the optimized 2.6.16 Linux kernel and 
Click continue to be affected by small-sized datagrams. 
In fact, while when using 256 Byte or higher sized 
datagrams the measured zero-loss burst length is quite 

JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007

11

© 2007 ACADEMY PUBLISHER

background image

close to the maximum burst length used in the tests 
carried out, it appears to be heavily limited in the 
presence of 40, 64 and, only for what concerns the Linux 
kernel, 128 Byte-sized packets. Exception is made for the 
single 128-Byte case, in which the computational 
bottleneck starts to affect NAPI while the Click 
forwarding rate continues to be very close to the 
theoretical one. The Linux kernel provides a better 
support for burst traffic than Click. As a result, zero-loss 
burst lengths are longer and associated latency times are 
smaller. The loss rate test results are reported in Fig. 10. 

 0

 20

 40

 60

 80

 100

 0

 20

 40

 60

 80

 100

Throughput [%]

Offered Load [%]

2.6.16 40B
2.6.16 64B

2.6.16 128B

Click 40B
Click 64B

 

Figure 10.  Loss Rate test, testbed A: maximum throughput. 

B.  Setup B numerical results 

In the second benchmarking session we analyzed the 

performance achieved by the optimized single processor 
Linux kernel, the SMP standard Linux kernel and the 
Click modular router with testbed setup B (see Fig. 2). 
Fig. 11 reports the maximum effective throughput in 
terms of forwarded packets per second for a single router 
interface. From this figure it is clear that, in the presence 
of short-sized packets, the performance level of all three 
software architectures is not close to the theoretical one. 
More specifically, while the best throughput values are 
achieved by Click, the SMP kernel seems to provide 
better forwarding rates with respect to the optimized 
kernel. In fact, as outlined in [25], if no explicit CPU-
interface bounds are present, the SMP kernel processes 
the received packets (using, if possible, the same CPU for 
the entire packet elaboration) and attempts to dynamically 
distribute the computational load among the CPUs. 

 0

 5

 10

 15

 20

 25

 30

 0  200 400 600 800 1000

 1200

 1400

 1600

Throughput [%]

IP Packet Size [Bytes]

2.6.16

Click

SMP

 

Figure 11. Throughput and latencies test, testbed setup B: effective 
throughput.

 

 
Thus, in this particular setup, computational load 

sharing attempts to manage the two interfaces, to which a 
traffic pair is applied, with a single fixed CPU, fully 
processing each received packet with only one CPU, thus 
avoiding any memory concurrency problems. Figs. 12 
and 13 report the minimum, the average and the 
maximum latency values according to different datagram 

sizes obtained for all three software architectures. In 
particular, we note that both Linux kernels, which in this 
case provide very similar results, ensure minimum 
latencies lower than Click. Instead, Click provides better 
average and maximum latency values for short-sized 
datagrams. 

 10

 100

 1000

 10000

 0  200 400 600 800 1000

 1200

 1400

 1600

Latency [us]

IP Packet Size [Bytes]

2.6.16 min
2.6.16 max

Click min
Click max

SMP min
SMP max

 

Figure 12. Throughput and latencies test, testbed B: minimum and 
maximum latencies. 

 

 10

 100

 1000

 10000

 0  200 400 600 800

 1000

 1200

 1400

 1600

Latency [us]

IP Packet Size [Bytes]

2.6.16 avg

Click avg

SMP avg

 

Figure 13.  Throughput and latencies test, testbed B: average latencies. 
 

 0

 100

 200

 300

 400

 500

 600

 700

 800

 0

 200

 400

 600

 800

 1000  1200  1400  1600

Burst Length [pkt]

IP Packet Size [Bytes]

2.6.16

Click

SMP

 

Figure 14. Back-to-back test, testbed B: maximum zero-loss burst 
lengths. 

 
The back-to-back results, reported in Fig. 14 and Table 

II, show that the performance level of all analyzed 
architectures is nearly comparable in terms of zero-loss 
burst length, while as far as latencies are concerned, the 
Linux kernels provide better values. By analyzing Fig. 
15, which reports the loss rate results, we note how the 
performance values obtained with Click and the SMP 
kernel are better, especially for low-sized datagrams, than 
the one obtained by the optimized single processor 
kernel. Moreover, Fig. 15 also shows that all three OR 
software architectures do not achieve the full Gigabit/s 
speeds even for large datagrams, with a maximum 
forwarding rate of about 650 Mbps per interface. To 
improve the readability of these results, we reported in 
Fig. 15 and in all the following loss rate tests only the OR 
behavior with the minimum and maximum datagram 
sizes since they are, respectively, the performance lower 
and upper bound. 

12

JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007

© 2007 ACADEMY PUBLISHER

background image

TABLE II.  

B

ACK

-

TO

-

BACK TEST

,

 TESTBED 

B:

 LATENCY VALUES 

FOR ALL THREE SOFTWARE ARCHITECTURES

 

Optimized 2.675 Kernel 

Click 

2.6.16 SMP Kernel

Pkt 

Length

 

Latency [us] 

Latency [us] 

Latency [us] 

[Byte] Min 

 

Average Max  Min 

Average 

Max Min 

Average

Max

40 122.3 2394.7 

5029.4 

27.9 5444.6  13268 70.48 1505 3483

64 124.8 1717.1 

3320.4 

30.1 2854.7  16349 89.24 1577 3474

128 212.1 1313.1 

2874.2 

46.9 2223.5 7390.0 67.02 1202 3047

256 139.6  998.8 

2496.9 

45.4 1314.9 5698.6 37.86 1005 2971

512 70.8 688.2 

2088.7 

21.2 574.5 2006.4 31.77

728 2085

768 55.7 585.3 

2122.7 

28.2 480.0 1736.5 35.01

587 1979

1024 71.0 373.8 

1264.5 

33.8 458.3 1603.3 37.19

361 1250

1280 58.6 427.7 

1526.6 

45.0 426.7 1475.5 38.58

482 1868

1500 66.4 485.6 

1707.8 

38.3 462.3 1524.3 36.68

478 1617

 0

 20

 40

 60

 80

 100

 0

 20

 40

 60

 80  100

Throughput [%]

Offered Load [%]

2.6.16 40B

2.6.16 1500B

Click 40B

Click 1500B

SMP 40B

SMP 1500B

 

Figure 15. Loss Rate test, testbed B: maximum throughput versus both 
offered load and IP datagram sizes. 

C.  Setup C numerical results 

In this benchmarking session, the three software 

architectures were tested in the presence of four Gigabit 
Ethernet interfaces with a full-meshed traffic matrix (Fig. 
2). By analyzing the maximum effective throughput 
values in Fig. 16, we note that Click appears to achieve a 
better performance level with respect to the Linux kernels 
while, unlike the previous case, the single processor 
kernel provides maximum forwarding rates larger than 
the SMP version with small packets.  

In fact, the SMP kernel tries to share the computational 

load of the incoming traffic among the CPUs, resulting in 
an almost static assignment of each CPU to two specific 
network interfaces. Since, in the presence of a full-
meshed traffic matrix, about half of the forwarded 
packets cross the OR between two interfaces managed by 
different CPUs, this decreases performance due to 
memory concurrency problems [19]. Figs. 17 and 18 
show the minimum, the maximum and the average 
latency values obtained during this test set. In observing 
the last results, we note how the SMP kernel, in the 
presence of short-sized datagrams, continues to undergo 
memory concurrency problems which lowers OR 
performance while considerably increasing both the 
average and the maximum latency values.  

By analyzing Fig. 19 and Table III, which report the 

back-to-back test results, we note that all three OR 
architectures achieve a similar zero-loss burst length, 
while Click reaches very high average and maximum 
latencies with respect to the single-processor and SMP 
kernels when small packets are used.  

The loss-rate results in Fig. 20 highlight the 

performance decay of the SMP kernel, while a fairly 
similar behavior is achieved by the other two 
architectures. Moreover, as in the previous benchmarking 
session, the maximum forwarding rate for each Gigabit 
network interface is limited to about 600/650 Mbps. 

 0

 5

 10

 15

 20

 25

 30

 0  200 400 600 800 1000

 1200

 1400

 1600

Throughput [%]

IP Packet Size [Bytes]

2.6.16

Click

SMP

 

Figure 16. Throughput and latencies test, setup C: effective throughput 
results. 
 

 10

 100

 1000

 10000

 0  200 400 600 800 1000

 1200

 1400

 1600

Latency [us]

IP Packet Size [Bytes]

2.6.16 min
2.6.16 max

Click min
Click max

SMP min
SMP max

 

Figure 17. Throughput and latencies test, testbed C: minimum and 
maximum latencies. 
 

 10

 100

 1000

 10000

 0  200 400 600 800

 1000

 1200

 1400

 1600

Latency [us]

IP Packet Size [Bytes]

2.6.16 avg

Click avg

SMP avg

 

Figure 18. Throughput and latencies test, results for testbed C: average 
latencies. 
 

 0

 100

 200

 300

 400

 500

 600

 700

 800

 0  200 400 600 800 1000

 1200

 1400

 1600

Burst Length [pkt]

IP Packet Size [Bytes]

2.6.16

Click

SMP

 

Figure 19. Back-to-back test, testbed C: maximum zero loss burst 
lengths. 
 

 0

 20

 40

 60

 80

 100

 0

 20

 40

 60

 80

 100

Throughput [%]

Offered Load [%]

2.6.16 40B

2.6.16 1500B

Click 40B

Click 1500B

SMP 40B

SMP 1500B

 

Figure 20. Loss Rate test, testbed C: maximum throughput versus both 
offered load and IP datagram sizes 

JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007

13

© 2007 ACADEMY PUBLISHER

background image

TABLE III.  

B

ACK

-

TO

-

BACK TEST

,

 TESTBED 

C:

 LATENCY VALUES 

FOR THE SINGLE

-

PROCESSOR 

2.6.16

 OPTIMIZED KERNEL

,

 THE 

C

LICK 

M

ODULAR 

R

OUTER AND THE 

SMP

 

2.6.16

 KERNEL

 

 

optimized 2.6.16 

Kernel 

Click 

2.6.16 SMP Kernel

Pkt 

Length 

Latency [us] 

Latency [us] 

Latency [us] 

[Byte] Min Average Max 

  Min Average Max  Min 

Average Max 

40 92.1 2424.3 

5040.3 73.5 6827.0 15804.9 74.8  3156.8 6164.2

64 131.3  1691.7 3285.1 176.7  6437.6 16651.1 66.5  2567.5 5140.6

128 98.1 1281.0 

2865.9 60.2 3333.8 9482.1 

77.1 1675.1 3161.6

256 19.6  915.9 

2494.6 16.7 1388.9 3972.2 

44.1  790.8 1702.8

512 23.9  666.9 

2138.9 15.9  649.2 2119.3 

23.2  815.8 2189.0

768 22.3  571.3 

2079.7 22.5  543.7 2002.6 

23.6  737.3 2193.0

1024 22.0  353.7 

1232.2 36.3  382.2 1312.7 

30.0  411.7 1276.8

1280 25.9  436.4 

1525.4 34.6  443.0 1460 

29.8 

447.7 1469.5

1500 27.4  469.5 

1696.7 36.7  457.5 1525.7 

30.0  482.6 1719.6

D.  Setup D numerical results 

In the last benchmarking session, we applied setup D, 

which provides a full-meshed traffic matrix between one 
Gigabit Ethernet and 12 Fast Ethernet interfaces, to the 
single-processor Linux kernel and to the SMP version. 
We did not use Click in this last test since, at the moment 
and for this software architecture, there are no drivers 
with polling support for the D-Link interfaces.  

By analyzing the throughput and latency results in 

Figs. 21, 22 and 23, we note how, in the presence of a 
high number of interfaces and a full-meshed traffic 
matrix, the performance of the SMP kernel version drops 
significantly: the maximum measured value for the 
effective throughput is limited to about 2400 packets/s 
and the corresponding latencies would appear to be much 
higher with respect to those obtained with the single 
processor kernel. However, the single processor kernel 
also does not support the maximum theoretical rate: it 
achieves 10% of full speed in the presence of short-sized 
datagrams and about 75% for high datagram sizes.  

 0

 10

 20

 30

 40

 50

 60

 70

 80

 0  200 400 600 800 1000

 1200

 1400

 1600

Throughput [%]

IP Packet Size [Bytes]

2.6.16

SMP

 

Figure 21. Throughput and latencies test, setup D: effective throughput 
results for both Linux kernels. 

 10

 100

 1000

 10000

 0  200 400 600 800 1000

 1200

 1400

 1600

Latency [us]

IP Packet Size [Bytes]

2.6.16 min
2.6.16 max

SMP min
SMP max

 

Figure 

22. Throughput and latencies test, results for testbed D: 

minimum and maximum latencies for both Linux kernels. 

To better understand why the OR does not attain full-

speed with such a high number of interfaces, we decided 
to perform several profiling tests. In particular, these tests 
were carried out using two simple traffic matrices: the 

first (Fig. 24) consists of 12 CBR flows that cross the OR 
from the Fast Ethernet interfaces to the Gigabit one, 
while the second (Fig. 25) still consists of 12 CBR flows 
that cross the OR in the opposite direction (e.g., from the 
Gigabit to the Fast Ethernet interfaces). These simple 
traffic matrices allow us to separately analyze the 
reception and transmission operations.

 

 10

 100

 1000

 10000

 0  200 400 600 800

 1000

 1200

 1400

 1600

Latency [us]

IP Packet Size [Bytes]

2.6.16 avg

SMP avg

 

Figure 23. Throughput and latencies test, testbed D: average latencies 
for both the Linux kernels. 
 

 0

 10

 20

 30

 40

 50

 60

 70

 80

 10

 20

 30

 40

 50

 60

 70

 80

 90

 100

CPU Percentage [%]

Offered Load [Kpackets/s]

idle

scheduler

memory

IP processing

NAPI

Tx API

IRQ

Eth processing

oprofile

control

 

Figure 24. Profiling results obtained by using 12 CBR flows that cross 
the OR from the Fast Ethernet interfaces to the Gigabit one. 
 

 0

 10

 20

 30

 40

 50

 60

 70

 10

 20

 30

 40

 50

 60

 70

 80

 90

 100

CPU Percentage [%]

Offered Load [Kpackets/s]

idle

scheduler

memory

IP processing

NAPI

Tx API

IRQ

Eth processing

oprofile

control

 

Figure 25. Profiling results obtained by using 12 CBR flows that cross 
the OR from a Gigabit interface to 12 FastEthernet ones. 

 
Thus, Figs. 24 and 25 report the profiling results 

corresponding to the two traffic matrices. The internal 
measurements shown in Fig. 24 highlight that fact that 
the CPUs are overloaded by the very high computational 
load of the IRQ and TX API management operations. 
This is due to the fact that during the transmission 
process each interface must signal the state of both the 
transmitting packets and the transmission ring to the 
associated driver instance through interrupts. More 
specifically, and again referring to Fig. 24, we note that 
IRQ CPU occupancy decreases by up to 30% of the 
offered load, and afterwards, while the OR reaches 
saturation, it remains constantly at about 50% of the 
computational resources. The initial decreasing behavior 
is due to the fact that by increasing the offered load 
traffic, the OR can better exploit packet grouping effects. 
Instead, the constant behavior is due to the fact that the 
OR manages the same packet quantity. Referring to Fig. 

14

JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007

© 2007 ACADEMY PUBLISHER

background image

25, we note how the presence of traffic incoming from 
many interfaces increases the computational weights of 
both the IRQ and the memory management operations. 
The decreasing behavior of the IRQ management 
computational weight is not due, as in the previous case, 
to the packet grouping effect, but to the typical NAPI 
structure that passes from an IRQ based mechanism to a 
polling one. The high memory management values can be 
explained quite simply by the fact that the recycling patch 
is not operating with the Fast Ethernet driver.  

 0

 5

 10

 15

 20

 25

 30

 35

 0  200 400 600 800 1000

 1200

 1400

 1600

Burst Length [pkt]

IP Packet Size [Bytes]

2.6.16

SMP

 

Figure 26. Back-to-back test, testbed D: maximum zero loss burst 
lengths. 
 

TABLE IV.  

B

ACK

-

TO

-B

ACK TEST

,

 TESTBED 

D:

 LATENCY VALUES 

FOR THE SINGLE

-

PROCESSOR 

2.6.16

 OPTIMIZED KERNEL AND THE 

SMP

 

2.6.16

 KERNEL

 

Optimized 2.6.16 

Kernel 

2.6.16 SMP Kernel 

Pkt 

Length 

Latency [us] 

Latency [us] 

[Byte] 

Min Average Max  Min Average Max 

40 285.05  1750.89 2921.23 37.04  1382.96 2347.89
64 215.5 1821.81 

2892.13 

38.07 1204.43  1963.7

128 216.15  1847.76 3032.22 34.52  1244.87  1984.4
256 61.83 1445.15 

2353.12 

30.61 2082.76  3586.6

512 57.73 2244.68 

4333.44 

32.97 908.19 

1661.18

768 101.78  1981.5 3497.81 50.64  1007.27 1750.45

1024 108.17  1386.19  2394.4 52.14  819.98 1642.13
1280 73.15 1662.13 

3029.54 

58.11 981.92 

1953.62

1500 109.24  1149.76 2250.78 70.92  869.36 1698.68

 0

 20

 40

 60

 80

 100

 0

 20

 40

 60

 80

 100

Throughput [%]

Offered Load [%]

2.6.16 40B

2.6.16 1500B

SMP 40B

SMP 1500B

 

Figure 27. Loss Rate test, testbed D: maximum throughput versus both 
offered load and IP datagram sizes. 

 
The back-to-back results, reported in Fig. 26 and Table 

IV, show a very particular behavior: in fact, even if the 
single processor kernel can achieve longer zero-loss burst 
lengths than the SMP kernel, the latter appears to ensure 
lower minimum, average and maximum latency values. 
In the end, Fig. 27 reports the loss rate test results, which, 
compatible with the previous results, show that a single 
processor kernel can sustain a higher forwarding 
throughput than the SMP version.

 

E.  Maximum Performance 

In order to effectively synthesize and improve the 

evaluation of the proposed performance results, we report 

in Figs. 28 and 29 the aggregated

2

 maximum values for 

each testbed of, respectively, the effective throughput and 
the maximum throughput (obtained in the loss rate test). 
By analyzing Fig. 28, we note that in the presence of 
more network interfaces, the OR generates values higher 
than 1 Gbps and, in particular, that it reaches maximum 
values equal to 1.6 Gbps with testbed D. We can also 
point out that the maximum effective throughput of 
setups B and C are almost the same: in fact, these very 
similar testbeds have only one difference (i.e., the traffic 
matrix), which has an effect only on the performance 
level of the SMP kernel, but practically no effect on the 
behaviors of the single processor kernel and Click.  

 

 0

 200

 400

 600

 800

 1000

 1200

 1400

 1600

 1800

 0

 200

 400

 600

 800  1000  1200  1400  1600

Effective Throughput [Mbps

]

IP Packet Size [Bytes]

setup A
setup B
setup C
setup D

 

Figure 28. Maximum effective throughput values obtained in the 
implemented testbeds. 
 

 0

 500

 1000

 1500

 2000

 2500

 3000

 3500

 0

 200

 400

 600

 800  1000  1200  1400  1600

Throughput [Mbps]

IP Packet Size [Bytes]

setup A
setup B
setup C
setup D

 

Figure 29. Maximum throughput values obtained in the implemented 
testbeds. 

 
The aggregated maximum throughput values, as 

reported in Fig. 29, are obviously higher than the ones in 
Fig. 28. This highlights the fact that the maximum 
forwarding rates sustainable by the OR are achieved in 
setups B and C with 2.5 Gbps. Moreover, while in setup 
A the maximum theoretical rate is achieved for packet 
sizes larger than 128, in all the other setups the maximum 
throughput values are not much higher than half the 
theoretical ones. 

F. Hardware Architecture Impact 

In the final benchmarking session, we decided to 

compare the performance of the two hardware 
architectures introduced in Section III, which represent 
the current and the state-of-the-art of server architectures 
four years ago. The benchmarking scenario is the one 
used in testbed A (with reference to Fig. 2), while the 
selected software architecture is the single processor 
optimized kernel. 

                                                           

2

 

In this case, “aggregated” refers to the sum of the forwarding rates of 

all the OR network interfaces. 

JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007

15

© 2007 ACADEMY PUBLISHER

background image

It is clear that the purposes of these tests was to 

understand how the continuous evolution of COTS 
hardware affects overall OR performance. Therefore, 
Figs. 30, 31 and 32 report the results of effective 
throughput tests for the “old” architecture (i.e., 32-bit 
Xeon) and the “new” one (i.e., 64-bit Xeon) equipped 
with both PCI-X and PCI-Express busses. The loss rate 
results are shown in Fig. 33. 

 10

 20

 30

 40

 50

 60

 70

 80

 90

 100

 100

 1000

Throughput [%]

IP Packet Size [Bytes]

Old

PCI-Ex

PCI-X

 

Figure 30. Throughput and latencies test, setup A with the “old” HW 
architecture and the “new” one equipped with PCI-X and PCI-Express 
busses: effective throughput results for the single processor optimized 
kernel. Note that the x-axis is in the logarithmic scale. 
  

 0

 50

 100

 150

 200

 250

 300

 0

 200

 400

 600

 800  1000  1200  1400  1600

Latency [us]

IP Packet Size [Bytes]

Min Old
Max Old

Min PCI-Ex
Max PCI-Ex

Min PCI-X
Max PCI-X

 

Figure 31. Throughput and latencies test, results for testbed A with the 
“old” HW architecture and the “new” one equipped with PCI-X and 
PCI-Express busses: minimum and maximum latencies for the single 
processor optimized kernel. 
 

 0

 50

 100

 150

 200

 250

 0

 200

 400

 600

 800  1000  1200  1400  1600

Latency [us]

IP Packet Size [Bytes]

Avg Old

Avg PCI-Ex

Avg PCI-X

 

Figure 32. Throughput and latencies test, results for testbed A with the 
“old” HW architecture and the “new” one equipped with PCI-X and 
PCI-Express busses: average latencies for the single processor 
optimized kernel. 

 
By observing the comparisons in Figs. 30 and 31, it is 

clear that the new architecture generally provides better 
performance values than the old one: more specifically, 
while using the new architecture with the PCI-X bus 
slightly improves performance, when the PCI-Express is 
used the OR effective throughput is an impressive 88% 
with 40 Byte-sized packets, achieving the maximum 
theoretical rate for all other packet sizes. All this is 
clearly due to the high efficiency of the PCI Express bus. 
In fact, with this I/O bus DMA transfers occur with a 
very low control overhead (since it behaves like a leased 
line), which probably leads to less heavy accesses to the 

RAM and, subsequently, to benefits in terms of memory 
accesses by the CPU. In other words, this high 
performance enhancement is caused by a more effective 
memory access of the CPU, thanks to the features of the 
PCI Express DMA. 

 0

 20

 40

 60

 80

 100

 0

 200

 400

 600

 800

 1000

 1200

 1400

 1600

Throughput [%]

IP Packet Size [Bytes]

Old

PCI-Ex

PCI-X

 

Figure 33. Loss Rate test, testbed A for the “old” HW architecture and 
the “new” one equipped with PCI-X and PCI-Express busses: maximum 
throughput versus both offered load and IP datagram sizes. 

VII.  C

ONCLUSIONS

 

In this contribution we report the results of the in-depth 

optimization and testing carried out on PC Open Router 
architecture based on Linux software and, more 
specifically, based on the Linux kernel. We have 
presented a performance evaluation in some common 
working environments of three different data plane 
architectures, including the optimized Linux 2.6 kernel, 
the Click Modular Router and the SMP Linux 2.6 kernel, 
with external (throughput and latencies) and internal 
(profiling) measurements. External measurements were 
performed in an RFC2544 [29] compliant manner by 
using professional devices [27]. Two hardware 
architectures were tested and compared for the purpose of 
understanding how the evolution in COTS hardware may 
affect performance. 

The experimental results show that the optimized 

version of the Linux kernel with suitable hardware 
architectures can achieve such high performance levels to 
effectively support several Gigabit interfaces. The results 
obtained show that the OR can achieve very interesting 
performance levels while attaining aggregated forwarding 
rate values of about 2.5 Gbps with relatively low 
latencies. 

R

EFERENCES

 

[1]  Building Open Router Architectures Based On Router 

Aggregation project (BORA-BORA), homepage at 
http://www.tlc.polito.it/borabora. 

[2]  E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F. 

Kaashoek, "The Click modular router", ACM Transactions 
on Computer Systems 18(3), Aug. 2000, pp. 263-297. 

[3]  Zebra, http://www.zebra.org/. 
[4]  M. Handley, O. Hodson, E. Kohler, “XORP: an open 

platform for network research”, ACM SIGCOMM 
Computer Communication Review, Vol. 33 Issue 1, Jan. 
2003, pp. 53-57. 

[5]  S. Radhakrishnan, “Linux - Advanced networking 

overview”, http://qos.ittc.ku.edu/howto.pdf. 

[6]  M. Rio et al., “A map of the networking code in Linux 

kernel 2.4.20”, Technical Report DataTAG-2004-1, 
FP5/IST DataTAG Project, Mar. 2004. 

[7]  FreeBSD, http://www.freebsd.org.[10] 

16

JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007

© 2007 ACADEMY PUBLISHER

background image

[8]  B. Chen and R. Morris, "Flexible Control of Parallelism in 

a Multiprocessor PC Router", Proc. of the 2001 USENIX 
Annual Technical Conference (USENIX '01), Boston, 
USA, June 2001. 

[9]  C. Duret, F. Rischette, J. Lattmann, V. Laspreses, P. Van 

Heuven, S. Van den Berghe, P. Demeester, “High Router 
Flexibility and Performance by Combining Dedicated 
Lookup Hardware (IFT), off the Shelf Switches and Linux
”, 
Proc. of the 2

nd

 International IFIP-TC6 Networking 

Conference, Pisa, Italy, May 2002, LNCS 2345, Ed E. 
Gregori et al, Springer-Verlag 2002, pp. 1117-1122. 

[10] A. Barczyk, A. Carbone, J.P. Dufey, D. Galli, B. Jost, U. 

Marconi, N. Neufeld, G. Peco, V. Vagnoni, “Reliability of 
datagram transmission on Gigabit Ethernet at full link 
load
”, LHCb technical note, LHCB 2004-030 DAQ, Mar. 
2004. 

[11] P. Gray, A. Betz, “Performance Evaluation of Copper-

Based Gigabit Ethernet Interfaces”, Proc. of 27

th

 Annual 

IEEE Conference on Local Computer Networks (LCN'02), 
Tampa, Florida, November 2002, pp. 679-690.  

[12] A. Bianco, R. Birke, D. Bolognesi, J. M. Finochietto, G. 

Galante, M. Mellia, M.L.N.P.P. Prashant, Fabio Neri, 
Click vs. Linux: Two Efficient Open-Source IP Network 
Stacks for Software Routers
”, Proc. of the 2005 IEEE 
Workshop on High Performance Switching and Routing 
(HPSR 2005), Hong Kong, May 2005, pp. 18-23. 

[13] A. Bianco, J. M. Finochietto, G. Galante, M. Mellia, F. 

Neri, “Open-Source PC-Based Software Routers: a Viable 
Approach to High-Performance Packet Switching
”, Proc. 
of the 3

rd

 International Workshop on QoS in Multiservice 

IP Networks (QOS-IP 2005), Catania, Italy, Feb. 2005, pp. 
353-366 

[14] A. Bianco, R. Birke, G. Botto, M. Chiaberge, J. 

Finochietto, G. Galante, M. Mellia, F. Neri, M. Petracca, 
Boosting the Performance of PC-based Software Routers 
with FPGA-enhanced Network Interface Cards
”, Proc. of 
the 2006 IEEE Workshop on High Performance Switching 
and Routing (HPSR 2006), Poznan, Poland, June 2006, pp. 
121-126. 

[15] A. Grover, C. Leech, “Accelerating Network Receive 

Processing: Intel I/O Acceleration Technology”, Proc. of 
the 2005 Linux Symposium, Ottawa, Ontario, Canada, Jul. 
2005, vol. 1, pp. 281-288. 

[16] R. McIlroy, J. Sventek, “Resource Virtualization of 

Network Routers”, Proc. of the 2006 IEEE Workshop on 
High Performance Switching and Routing (HPSR 2006), 
Poznan, Poland, June 2006, pp. 15-20.  

[17] R. Bolla, R. Bruschi, “The IP Lookup Mechanism in a 

Linux Software Router: Performance Evaluation and 
Optimizations
”, Proc. of the 2007 IEEE Workshop on High 
Performance Switching and Routing (HPSR 2007), New 
York, USA. 

[18] K. Wehrle, F. Pählke, H. Ritter, D. Müller, M. Bechler, 

The Linux Networking Architecture: Design and 
Implementation of Network Protocols in the Linux Kernel
”, 
Pearson Prentice Hall, Upper Saddle River, NJ, USA, 
2004. 

[19] R. Bolla, R. Bruschi, “A high-end Linux based Open 

Router for IP QoS networks: tuning and performance 
analysis with internal (profiling) and external 
measurement tools of the packet forwarding capabilities
”, 
Proc. of the 3

rd

 International Workshop on Internet 

Performance, Simulation, Monitoring and Measurements 
(IPS MoMe 2005), Warsaw, Poland, Mar. 2005. 

[20] J. H. Salim, R. Olsson, A. Kuznetsov, “Beyond Softnet”, 

Proc. of the 5

th

 annual Linux Showcase & Conference, 

Nov. 2001, Oakland, California, USA. 

[21] A. Cox, "Network Buffers and Memory Management

Linux Journal, Oct. 1996, http://www2.linuxjournal.com/ 
lj−issues/issue30/1312.html. 

[22] The Intel PRO 1000 XT Server Adapter, 

http://www.intel.com/network/connectivity/products/pro10
00xt.htm. 

[23]  

The D-Link DFE-580TX quad network adapter, 

http://support.dlink.com/products/view.asp?productid=DF
E%2D580TX#. 

[24] J. A. Ronciak, J. Brandeburg, G. Venkatesan, M. Williams, 

Networking Driver Performance and Measurement – 
e1000 A Case Study
”, Proc. of the 2005 Linux Symposium, 
Ottawa, Ontario, Canada, July 2005, vol. 2, pp. 133-140. 

[25] R. Bolla, R. Bruschi, “IP forwarding Performance 

Analysis in presence of Control Plane Functionalities in a 
PC-based Open Router
”, Proc. of the 2005 Tyrrhenian 
International Workshop on Digital Communications 
(TIWDC 2005), Sorrento, Italy, June 2005, and in F. 
Davoli, S. Palazzo, S. Zappatore, Eds., “Distributed 
Cooperative Laboratories: Networking, Instrumentation, 
and Measurements”, Springer, Norwell, MA, 2006, pp. 
143-158. 

[26] The descriptor recycling patch, ftp://robur.slu.se/pub/ 

Linux/net-development/skb_recycling/. 

[27] The Agilent N2X Router Tester, http://advanced. 

comms.agilent.com/n2x/products/. 

[28] Oprofile, http://oprofile.sourceforge.net/news/. 
[29] Request for Comments 2544 (RFC 2544), http://www.faqs. 

org/rfcs/rfc2544.html. 

 
 

Raffaele Bolla was born in Savona (Italy) in 1963. He 

received his Master of Science degree in Electronic Engineering 
from the University of Genoa in 1989 and his Ph.D. degree in 
Telecommunications at the Department of Communications, 
Computer and Systems Science (DIST) in 1994 from the same 
university. From 1996 to 2004 he worked as a researcher at 
DIST where, since 2004, he has been an Associate Professor, 
and teaches a course in Telecommunication Networks and 
Telematics. His current research interests focus on resource 
allocation, Call Admission Control and routing in Multi-service 
IP networks, Multiple Access Control, resource allocation and 
routing in both cellular and ad hoc wireless networks. He has 
authored or coauthored over 100 scientific publications in 
international journals and conference proceedings. He has been 
the Principal Investigator in many projects in the 
Telecommunication Networks field. 
 
 

Roberto Bruschi was born in Genoa (Italy) in 1977. He 

received his Master of Science degree in Telecommunication 
Engineering in 2002 from the University of Genoa and his 
Ph.D. in Electronic Engineering in 2006 from the same 
university. He is presently working with the Telematics and 
Telecommunication Networks Lab (TNT) in the Department of 
Communication, Computer and System Sciences (DIST) at the 
University of Genoa. He is also a member of CNIT, the Italian 
inter-university Consortium for Telecommunications. Roberto is 
an active member of various Italian research projects in the 
networking area, such as BORA-BORA, FAMOUS, TANGO 
and EURO. He has co-authored over 10 papers in international 
conferences and journals. His main interests include Linux 
Software Router, Network processors, TCP and network 
modeling, VPN design, P2P modeling, bandwidth allocation, 
admission control and routing in multiservice QoS IP/MPLS 
networks. 
 

JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007

17

© 2007 ACADEMY PUBLISHER