Linux Software Router: Data Plane Optimization
and Performance Evaluation
Raffaele Bolla and Roberto Bruschi
DIST - Department of Communications, Computer and Systems Science
University of Genoa
Via Opera Pia 13, 16145 Genoa, Italy
Email: {raffaele.bolla, roberto.bruschi}@unige.it
Abstract - Recent technological advances provide an
excellent opportunity to achieve truly effective results in the
field of open Internet devices, also known as Open Routers
or ORs. Even though some initiatives have been undertaken
over the last few years to investigate ORs and related topics,
other extensive areas still require additional investigation.
In this contribution we report the results of the in-depth
optimization and testing carried out on a PC Open Router
architecture based on Linux software and COTS hardware.
The main focus of this paper was the forwarding
performance evaluation of different OR Linux-based
software architectures. This analysis was performed with
both external (throughput and latencies) and internal
(profiling) measurements. In particular, for the external
measurements, a set of RFC2544 compliant tests was also
proposed and analyzed.
Index Terms - Linux Router; Open Router; RFC 2544; IP
forwarding.
I.
I
NTRODUCTION
Internet technology has been developed in an open
environment and all Internet-related protocols,
architectures and structures are publicly created and
described. For this reason, in principle, everyone can
“easily” develop an Internet device (e.g., a router). On the
contrary, and to a certain extent quite surprising, most of
the professional devices are developed in an extremely
“closed” manner. In fact, it is very difficult to acquire
details about internal operations and to perform anything
more complex than a parametrical configuration.
From a general viewpoint, this is not very strange since
it can be considered a clear attempt to protect the
industrial investment. However, sometimes the
“experimental” nature of the Internet and its diffusion in
many contexts might suggest a different approach. Such a
need is even more evident within the scientific
community, which often runs into various problems when
carrying out experiments, testbeds and trials to evaluate
new functionalities and protocols.
Today, recent technological advances provide an
opportunity to do something truly effective in the field of
open Internet devices, sometimes called Open Routers
(ORs). Such an opportunity arises from the use of Open
Source Operative Systems (OSs) and COTS/PC
components. The attractiveness of the OR solution can be
summarized as: multi-vendor availability, low-cost and
continuous updating/evolution of the basic parts. As far
as performance is concerned, the PC architecture is
general-purpose which means that, in principle, it cannot
attain the same performance level as custom, high-end
network devices, which often use dedicated HW elements
to handle and to parallelize the most critical operations.
Otherwise, the performance gap might not be so large
and, in any case, more than justified by the cost
differences. Our activities, carried out within the
framework of the BORA-BORA project [1], are geared to
facilitate the investigation by reporting the results of an
extensive optimization and testing operation carried out
on OR architecture based on Linux software. We focused
our attention mainly on packet forwarding functionalities.
Our main objectives were the performance evaluation of
an optimized OR, in addition to external (throughput and
latencies) and internal (profiling) measurements. To this
regard, we identified a high-end reference PC-based
hardware architecture and Linux kernel 2.6 for the
software data plane. Subsequently, we optimized this OR
structure, defined a test environment and finally
developed a complete series of tests with an accurate
evaluation of the software module’s role in defining
performance limits.
With regard to the state-of-the-art of OR devices, some
initiatives have been undertaken over the last few years to
develop and investigate the ORs and related topics. In the
software area, one of the most important initiatives is the
Click Modular Router Project [2], which proposes an
effective data plane solution. In the control plane area
two important projects can be cited: Zebra [3] and Xorp
[4].
Despite custom developments, some standard Open
Source OSs can also provide very effective support for an
OR project. The most relevant OSs in this sense are
Linux [5][6] and FreeBSD [7]. Other activities focus on
hardware: [8] and [9] propose a router architecture based
on a PC cluster, while [10] reports some performance
results (in packet transmission and reception) obtained
with a PC Linux-based testbed. Some evaluations have
also been carried out on network boards (see, for
example, [11]).
Other fascinating projects involving Linux-based ORs
can be found in [12] and [13], where Bianco et al. report
some interesting performance results. In [14] a
performance analysis of an OR architecture enhanced
with FPGA line cards, which allows direct NIC-to-NIC
packet forwarding, is introduced. [15] describes the Intel
6
JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007
© 2007 ACADEMY PUBLISHER
I/OAT, a technology that enables DMA engines to
improve network reception and transmission by
offloading the CPU of some low-level operations.
In [16] the virtualization of a multiservice OR
architecture is discussed: the authors propose multiple
Click forwarding chains virtualized with Xen.
Finally, in [17], we proposed an in-depth study of the
IP lookup mechanism included in the Linux kernel.
The paper is organized as follows. the hardware and
software details of the proposed OR architecture are
reported in sections II and III reports, while Section IV
contains a description of performance tuning and
optimization techniques. The benchmarking scenario and
the performance results are reported in Sections V and
VI, respectively. Conclusions are presented in Section
VII.
II. L
INUX
OR
S
OFTWARE
A
RCHITECTURE
The OR architecture has to provide many different
types of functionalities: from those directly involved in
the packet forwarding process to the ones needed for
control functionalities, dynamic configuration and
monitoring.
As outlined in [5], in [18] and in [19], all the
forwarding functions are developed inside the Linux
kernel, while most of the control and monitoring
operations (the signaling protocols such as routing
protocols, control protocols, etc.) are daemons /
applications running in the user mode.
Like the older kernel versions, the Linux networking
architecture is basically based on an interrupt mechanism:
network boards signal the kernel upon packet reception or
transmission through HW interrupts. Each HW interrupt
is served as soon as possible by a handling routine, which
suspends the operations currently being processed by the
CPU. Until completed, the runtime cannot be interrupted
by anything, and not even by other interrupt handlers.
Thus, with the clear purpose of making the system
reactive, the interrupt handlers are designed to be very
short, while all the time-consuming tasks are performed
by the so-called “Software Interrupts” (SoftIRQs)
afterwards. This is the well-known “top half–bottom
half” IRQ routine division implemented in the Linux
kernel [18].
SoftIRQs are actually a form of kernel activity that can
be scheduled for later execution rather than real
interrupts. They differ from HW IRQs mainly in that a
SoftIRQ is scheduled for execution by a kernel activity,
such as an HW IRQ routine, and has to wait until it is
called by the scheduler. SoftIRQs can be interrupted only
by HW IRQ routines.
The “NET_TX_SOFTIRQ” and the “NET_RX_
SOFTIRQ” are two of the most important SortIRQs in the
Linux kernel and the backbone of the entire networking
architecture, since they are designed to manage the packet
transmission and reception operations, respectively. In
detail, the forwarding process is triggered by an HW IRQ
generated from a network device, which signals the
reception or the transmission of packets. Then the
corresponding routine performs some fast checks, and
schedules the correct SoftIRQ, which is activated by the
kernel scheduler as soon as possible. When the SoftIRQ
is finally executed, it performs all the packet forwarding
operations.
As shown in Figure 1, which reports a scheme of
Linux source code involved in the forwarding process,
these operations computed during SoftIRQs can be
organized in a chain of three different modules: a
“reception API” that handles packet reception (NAPI
1
), a
module that carries out the IP layer elaboration and,
finally, a “transmission API” that manages the
forwarding operations to the egress network interfaces.
In particular, the reception and the transmission APIs
are the lowest level modules, and are activated by both
HW IRQ routines and scheduled SoftIRQs. They handle
the network interfaces and perform some layer 2
functionalities.
The NAPI [20] was introduced in the 2.4.27 kernel
version, and has been explicitly created to increase
reception process scalability. It handles network interface
requests with an interrupt moderation mechanism,
through it is possible to adaptively switch from a classical
to a polling interrupt management of the network
interfaces.
In greater detail, this is accomplished by inserting the
identifier of the board generating the IRQ on a special
list, called the “poll list”, during the HW IRQ routine,
scheduling a reception SoftIRQ, and disabling the HW
IRQs for that device. When the SoftIRQ is activated, the
kernel polls all the devices, whose identifier is included
on the poll list, and a maximum of quota packets are
served per device. If the board buffer (Rx Ring) is
emptied, then the identifier is removed from the poll list
and its HW IRQs re-enabled. Otherwise, its HW IRQs is
left disabled, the identifier remains on the poll list and
another SoftIRQ is scheduled. While this mechanism
behaves like a pure interrupt mechanism in the presence
of a low ingress rate (i.e., we have more or less one HW
IRQ per packet), when traffic increases, the probability of
emptying the RxRing, and thus re-enabling HW IRQs,
decreases more and more, and the NAPI starts working
like a polling mechanism.
For each packet received during the NAPI processing a
descriptor, called skbuff [21], is immediately allocated. In
particular, as shown in Figure 1, to avoid unnecessary and
tedious memory transfer operations, the packets are
allowed to reside in the memory locations used by the
DMA-engines of ingress network interfaces, and each
subsequent operation is performed using the skbuffs.
These descriptors do in fact consist of pointers to the
different key fields of the headers contained in the
associated packets, and are used for all the layer 2 and 3
operations.
A packet is elaborated in the same NET_RX SoftIRQ,
until it is enqueued in an egress device buffer, called
Qdisc. Each time a NET_TX SoftIRQ is activated or a
new packet is enqueued, the Qdisc buffer is served. When
1
In greater detail, the NAPI architecture includes a part
of the interrupt handler.
JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007
7
© 2007 ACADEMY PUBLISHER
a packet is dequeued from the Qdisc buffer, it is placed
on the Tx Ring of the egress device. After the board
successfully transmits one or more packets, it generates
an HW IRQ, whose routine schedules a NET_TX
SoftIRQ.
The Tx Ring is periodically cleaned of all the
descriptors of transmitted packets, which will be de-
allocated and refilled by the packets coming from the
Qdisc buffer.
Another interesting characteristic of the 2.6 kernels
(introduced to reduce performance deterioration due to
CPU concurrency) is the Symmetric Multi-Processors
(SMP) support that may assign management of each
network interface to a single CPU for both the
transmission and reception functionalities.
III. H
ARDWARE
A
RCHITECTURE
The Linux OS supports many different hardware
architectures, but only a small portion of them can be
effectively used to obtain high OR performance.
In particular, we must take into account that, during
networking operations, the PC internal data path has to
use a centralized I/O structure consisting of the I/O bus,
the memory channel (used by DMA to transfer data from
network interfaces to RAM and vice versa) and the Front
Side Bus (FSB) (used by the CPU with the memory
channel to access the RAM during packet elaboration).
The selection criterions for hardware elements have been
very fast internal busses, RAM with very low access
times, and CPUs with high integer computational power
(i.e., packet processing does not generally require any
floating point operations).
In order to understand how hardware architecture
affects overall system performance, we selected two
different architectures that represent the current state-of-
the art of server architectures and the state-of-the-art from
3 years ago, respectively.
To this regard, as old HW architecture, we chose a
system based on the Supermicro X5DL8-GG mainboard:
it can support a dual-Xeon system with a dual memory
channel and a PCI-X bus at 133MHz with 64 parallel bits.
The Xeon processors (32 bit and mono-core) we utilized
have a 2.4 GHz clock and a 512KB cache. For the new
OR architecture we used a Supermicro X7DBE
mainboard, equipped with both the PCI Express and PCI-
X busses, and with a 5050 Intel Xeon (dual core 64-bit
processor).
Network interfaces are another critical element, since
they can heavily affect PC Router performance. As
reported in [11], the network adapters on the market offer
different performance levels and configurability. With
this in mind, we selected two different types of adapters
with different features and speed: a high performance and
configurable Gigabit Ethernet interface, namely Intel
PRO 1000, which is equipped with a PCI-X controller
(XT version) or a PCI-Express (PT version) [22]; a D-
Link DFE-580TX [23] that is a network card equipped
with four Fast Ethernet interfaces and a PCI 2.1
controller.
IV. S
OFTWARE
P
ERFORMANCE
T
UNING
The entire networking Linux kernel architecture is
quite complex and has numerous aspects and parameters
that can be tuned for system optimization. In particular, in
this environment, since the OS has been developed to act
HW Interrupt
interrupt
handler
Device 1
Device 2
Device 3
net_rx_action
Poll_Queue
CPU1
e1000_clean_rx_irq
Kernel
Memory
netif_receive_skb
e1000_alloc_rx_buffers
alloc_skb
eth_type_trans
eth_header
ip_rcv
Rx_Ring
device 3
NAPI
ip_rcv_finish
ip_route_input
rt_hash_code
ip_forward
ip_forward_finish
ip_send
ip_finish_output
ip_output
IP Processing
dev_queue_xmit
hard_start_xmit
e1000_xmit_frame
Root Qdisc
device 2
qdisc_restart
Tx_Ring
device 2
net_tx_action
Completion
queue
kfree
TX-API
Netfilter hook
e1000_clean_tx_irq
DMA engines
Figure 1. Detailed scheme of forwarding code in 2.6 Linux kernel versions.
8
JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007
© 2007 ACADEMY PUBLISHER
as network host (i.e., workstation, server, etc.), it is
natively tuned for “general purpose” network end-node
usage. In this last case, packets are not fully processed
inside kernel-space, but are usually delivered from
network interfaces to applications in user-space, and vice
versa. When the Linux kernel is used in an OR
architecture, it generally works in a different manner, and
should be specifically tuned and customized to obtain the
maximum packet forwarding performance.
As reported in [19] and [25], where a more detailed
description of the adopted tuning actions can be found,
this optimization is very important for obtaining
maximum performance. Some of the optimal parameter
values can be identified by logical considerations, but
most of them have to be empirically determined, since
their optimal value cannot be easily derived from the
software structure and because they also depend on the
hardware components. So we carried out our tuning first
by identifying the critical elements on which to operate,
and, then, by finding the most convenient values with
both logical considerations and experimental measures.
As far as the adopted tuning settings are concerned, we
used the 6.3.9 e1000 driver [24], configured with both the
Rx and Tx ring buffers to 256 descriptors, while the Rx
interrupt generation was not limited. The qdisc size for all
the adapters was dimensioned to 20,000 descriptors,
while the scheduler clock frequency was fixed to 100Hz.
Moreover the 2.6.16.13 kernel images used to obtain
the numerical results in Section VI include two structural
patches that we created to test and/or optimize kernel
functionalities. In particular, those patches are described
in the following discussion.
A. Skbuff Recycling patch
We studied and developed a new version of the skbuff
Recycling patch, originally proposed by R. Olsson [26]
for the “e1000” driver. In particular, the new version is
stabilized for the 2.6.16.13 kernel version and extended to
the “sundance” driver.
This patch intercepts the skbuff descriptors of
transmitted packets before they are de-allocated, and re-
uses them for new incoming packets. As shown in [19],
this architectural change significantly reduces the
computation weight of the memory management
operations, thus attaining a very high performance level
(i.e., about 150-175% of the maximum throughput of
standard kernels).
B. Performance Counter patch
To further analyze the OR’s internal behavior, we
decided to introduce a set of counters in the kernel source
code in order to understand how many times a certain
procedure is called, or how many packets are kept per
time. Specifically, we introduced the following counters:
• IRQ: number of interrupt handlers generated by a
network card;
• Tx/Rx IRQ: number of tx/rx IRQ routines per
device;
• Tx/Rx SoftIRQ: number of tx/rx software IRQ
routines;
• Qdiscrun and Qdiscpkt: number of times the
output buffer (Qdisc) is served, and number of
served packets per time.
• Pollrun and Pollpkt: number of times the rx ring
of a device is served, and the number of served
packets per time.
• tx/rx clean: number of times the tx/rx procedures
of the driver are activated.
The values of all these parameters have been mapped
in the Linux “proc” file system.
V. B
ENCHMARKING
S
CENARIO
To benchmark the OR forwarding performance, we
used a professional device, known as Agilent N2X Router
Tester [27], which can be used to obtain throughput and
latency measurements with high availability and accuracy
levels (i.e., the minimum guaranteed timestamp
resolution is 10 ns). Moreover, with two dual Gigabit
Ethernet cards and one 16 Fast Ethernet card, we can
analyze the OR behavior with a large number of Fast and
Gigabit Ethernet interfaces.
To better support the performance analysis and to
identify the OR bottlenecks, we also performed some
internal measurements using specific software tools
(called profilers) placed inside the OR which trace the
percentage of CPU utilization for each software module
running on the node. The problem is that with many of
these profilers the relevant computational effort required
perturbs system performance, thus generating what are
not very meaningful the results. We verified with many
different tests that one of the best is Oprofile [28], an
open source tool that continuously monitors system
dynamics with frequent and quite regular sampling of
CPU hardware registers. Oprofile effectively and
profoundly evaluate CPU utilization of each software
application and each single kernel function running in the
system with very low computational overhead.
With regard to the benchmarking scenario, we decided
to start by defining a reasonable set of test setups (with
increasing level of complexity) and for each selected
setup to apply some of the tests defined in the RFC 2544
[29]. In particular, we chose to perform these activities by
using both a core and an edge router configuration: the
former consists of a few high-speed (Gigabit Ethernet)
network interfaces, while the latter utilizes a high-speed
gateway interface and a large number of Fast Ethernet
cards which collect traffic from the access networks.
More specifically, we performed our tests by using the
following setups (see Figure 2):
1) Setup A: a single mono directional flow crosses the OR
from one Gigabit port to another one;
2) Setup B: two full duplex flows cross the OR, each one
using a different pair of Gigabit ports;
3) Setup C: a full-meshed (and full-duplex) traffic matrix
applied on 4 Gigabit Ethernet ports;
4) Setup D: a full-meshed (and full-duplex) traffic matrix
applied on 1 Gigabit Ethernet port and 12 Fast Ethernet
interfaces.
In greater detail, each OR forwarding benchmarking
session essentially consists of three test sets, and namely:
JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007
9
© 2007 ACADEMY PUBLISHER
a) Throughput and latency: this test set is performed by
using constant bit rate traffic flows, consisting of fixed
size datagrams, to obtain: a) the maximum effective
throughput (in Kpackets/s and as a percentage with
respect to the theoretical value) versus different IP
datagram sizes; b) the average, maximum and minimum
latencies versus different IP datagram sizes;
b) Back-to-back: these tests are carried out by using burst
traffic flows and by changing both the burst dimension
(i.e., the number of the packets comprising the burst)
and the datagram size. The main results for this kind of
test are: a) zero loss burst length versus different IP
datagram sizes; b) average, maximum and minimum
latencies versus different sizes of IP datagram
comprising the burst (“zero loss burst length” is the
maximum number of packets transmitted with
minimum inter-frame gaps that the System Under Test
(SUT) can handle without any loss).
c) Loss Rate: this kind of test is carried out by using CBR
traffic flows with different offered loads and IP
datagram sizes; the obtainable results can be
summarized in throughput versus both offered load and
IP datagram sizes.
Note that all these tests have been performed by using
different IP datagram sizes (i.e., 40, 64, 128, 256, 512,
1024 and 1500 bytes) and both CBR and burst traffic
flows.
Setup A
Setup B
Setup C
Setup D
Figure 2. Benchmarking setups.
VI. N
UMERICAL
R
ESULTS
A selection of the experimental results is reported in
this section. In particular, the results of the benchmarking
setups shown in Figure 2 are reported in Subsections A,
B, C and D. In all such cases, the tests were performed
with the “old” hardware architecture described in Section
III (i.e., 32-bit Xeon and PCI-X bus).
With regard to Software architecture, we decided to
compare different 2.6.16 Linux kernel configurations and
a Click Modular Router. In particular, we used the
following versions of the 2.6.16 Linux kernel:
• single-processor 2.6.16 optimized kernel (a
version based on the standard one with single
processor support that includes the descriptor
recycling patch).
• dual-processor 2.6.16 standard kernel (a standard
NAPI kernel version similar to the previous one
but with SMP support);
Note that we decided not take into account the SMP
versions of both the optimized Linux kernel and the Click
Modular Router, since they lack a minimum acceptable
level of stability.
Subsection E summarizes the results obtained in the
previous tests by showing the maximum performance for
each benchmarking setup. Finally, the performance of the
two hardware architectures described in Section III are
reported in Subsection F, in order to evaluate how HW
evolution affects forwarding performance.
A. Setup A numerical results
In the first benchmarking session, we performed the
RFC 2544 tests by using setup A (see Figure 2) with both
the single-processor 2.6.16 optimized kernel and Click.
As we can observe in Figs. 3, 4 and 5, which report the
numerical results of the throughput and latency tests, both
software architectures cannot achieve the maximum
theoretical throughput in the presence of small datagram
sizes.
As demonstrated by the profiling measurements
reported in Fig. 6, obtained with the single processor
optimized 2.6.16 kernel and with 64 Bytes sized
datagrams, this effect is clearly caused by the
computational CPU capacity that limits the maximum
forwarding rate of the Linux kernel to about 700
Kpackets/s (40% of the full Gigabit speed). In fact, even
if the CPU idle goes to zero at 40% of full load, the CPU
occupancies of all the most important function sets
appear to adapt their contributions up to 700 Kpackets/s;
after this point their percentage contributions to CPU
utilization remains almost constant.
0
20
40
60
80
100
0 200 400 600 800 1000
1200
1400
1600
Throughput [%]
IP Packet Size [Bytes]
2.6.16
Click
Figure
3. Throughput and latencies test, testbed A: effective
throughput results for the single-processor 2.6.16 optimized kernel and
Click.
0
50
100
150
200
250
300
350
400
0 200 400 600 800 1000
1200
1400
1600
Latency [us]
IP Packet Size [Bytes]
2.6.16 min
2.6.16 max
Click min
Click max
Figure 4. Throughput and latencies test, testbed A: minimum and
maximum latencies for both the single-processor 2.6.16 optimized
kernel and Click.
More expressly, Fig. 5 shows that the computational
weight of memory management operations (like sk_buff
allocations and de-allocations) is substantially limited,
thanks to the descriptor recycling patch, to less than 25%.
In other our works, such as [19], we have shown that this
patch can be used to save a CPU time share equal to
about 20%.
10
JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007
© 2007 ACADEMY PUBLISHER
0
50
100
150
200
250
300
350
400
0 200 400 600 800 1000
1200
1400
1600
Latency [us]
IP Packet Size [Bytes]
2.6.16 avg
Click avg
Figure 5. Throughput and latencies test, testbed A: average latencies
for both the single-processor 2.6.16 optimized kernel and Click.
0
10
20
30
40
50
60
0
10
20
30
40
50
60
70
80
CPU Utilization [%]
Offered Load [%]
idle
scheduler
memory
IP processing
NAPI
Tx API
IRQ
Eth processing
oprofile
Figure 6. Profiling results of the optimized Linux kernel obtained with
testbed setup A.
0
0.5
1
1.5
2
2.5
0
10
20
30
40
50
60
70
80
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Occurrences [# / Pkts]
Occurrences [# / Pkts]
Offered Load [%]
IRQ
Poll Run
rxSoftIrq
Figure 7. Number of IRQ routines, polls and Rx SoftIRQ (second y-
axis) for the RX board for the skbuff recycling patched kernel, in the
presence of an incoming traffic flow with only 1 IP source address.
0
0.001
0.002
0.003
0.004
0.005
0.006
0
10
20
30
40
50
60
70
80
0.02
0.04
0.06
0.08
0.1
0.12
0.14
Occurrences [# / Pkts]
Occurrences [# / Pkts]
Offered Load [%]
IRQ
Wake
Func
Figure 8. Number of IRQ routines for the TX board, of Tx Ring
cleaned by TxSoftIRQ (“func”) and by RxSoftIRQ (“wake”) for the
skbuff recycling patched kernel, in the presence of an incoming traffic
flow with only 1 IP source address. The second y axis refers to “wake”.
The behavior of the IRQ management operations
would appear to be rather strange: in fact, their CPU
utilization level decreases with an increase in input rate.
There are mainly two reasons for such a behavior related
to the packet grouping effect in the Tx and in the RxAPI:
in particular, when the ingress packet rate rises, NAPI
tends to moderate the IRQ rate by causing it to operate
more like a polling than an interrupt-like mechanism (and
thus we have the first interrupt number reduction), while
TxAPI, under the same conditions, can better exploit the
packet grouping mechanism by sending more packets at
time (and then the number of interrupts for successful
transmission confirmations decreases). When the IRQ
weight becomes zero, the OR reaches the saturation
point, and operates like a polling mechanism.
With regard to all the other operation sets (i.e., IP and
Ethernet processing, NAPI and TxAPI), their behaviour is
clearly bound by the number of forwarded packets: the
weight of almost all the classes increases linearly up to
the saturation point, and subsequently remains more or
less constant.
This analysis is confirmed also by the performance
counters reported in Figs. 7 and 8, in which both the Tx
and Rx boards reduce their IRQ generation rates, while
the kernel passes from polling the Rx Ring twice per
received packet, to about 0.22 times. The number of Rx
SoftIRQ per received packet also decreases as offered
traffic load rises. For what concerns transmission
dynamics, Fig. 8 shows very low function occurrences: in
fact, the Tx IRQ routines decrease their occurrences up to
saturation, while the “wake” function, which represents
the number of times that the Tx Ring is cleaned and the
Qdisc buffer is served during an Rx SoftIRQ, exhibits a
mirror-like behavior: this occurs because when the OR
reaches the saturation, all the tx functionalities are
activated when the Rx SoftIRQ starts.
1
10
100
1000
10000
100000
0 200 400 600 800
1000
1200
1400
1600
Burst Length [pkt]
IP Packet Size [Bytes]
2.6.16
Click
Figure 9. Back-to-back test, testbed A: maximum zero loss burst
lengths.
TABLE I.
B
ACK
-
TO
-
BACK TEST
,
TESTBED
A:
LATENCY VALUES
FOR BOTH THE SINGLE
-
PROCESSOR OPTIMIZED KERNEL AND
C
LICK
.
optimized 2.6.16 Kernel
Click
PktLength Latency
Latency
[Byte]
Min
[us]
Average
[us]
Max
[us]
Min
[us]
Average
[us]
Max
[us]
40
16.16
960.08 1621.47 23.47 1165.53 1693.64
64
14.95
929.27 1463.02 23.52 1007.42 1580.74
128
16.04
469.9
925.93 19.34
54.88
53.45
256
16.01
51.65
58.84 22.49
52.62
47.67
512
18.95
54.96
61.51 20.72
62.92
59.95
768
23.35
100.76
164.56 22.85 116.61
155.59
1024
25.31
123.68
164.21 32.02 128.85
154.72
1280
28.6
143.43
166.46 24.77 151.81
178.45
1500
30.38
142.22
163.63 32.01 154.79
181.43
Similar considerations can also be made for the Click
modular router: the performance limitations in the
presence of short-sized datagrams continue to be caused
by a computational bottleneck, but the simple Click
packet receive API based on the polling mechanism
improves throughput performance by lowering the weight
of IRQ management and RxAPI functions. For the same
reasons, as shown in Figs. 4 and 5, the receive
mechanism included in Click introduces higher packet
latencies. According to the previous results, the back-to-
back tests, as reported in Fig. 9 and Table I, also
demonstrate that the optimized 2.6.16 Linux kernel and
Click continue to be affected by small-sized datagrams.
In fact, while when using 256 Byte or higher sized
datagrams the measured zero-loss burst length is quite
JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007
11
© 2007 ACADEMY PUBLISHER
close to the maximum burst length used in the tests
carried out, it appears to be heavily limited in the
presence of 40, 64 and, only for what concerns the Linux
kernel, 128 Byte-sized packets. Exception is made for the
single 128-Byte case, in which the computational
bottleneck starts to affect NAPI while the Click
forwarding rate continues to be very close to the
theoretical one. The Linux kernel provides a better
support for burst traffic than Click. As a result, zero-loss
burst lengths are longer and associated latency times are
smaller. The loss rate test results are reported in Fig. 10.
0
20
40
60
80
100
0
20
40
60
80
100
Throughput [%]
Offered Load [%]
2.6.16 40B
2.6.16 64B
2.6.16 128B
Click 40B
Click 64B
Figure 10. Loss Rate test, testbed A: maximum throughput.
B. Setup B numerical results
In the second benchmarking session we analyzed the
performance achieved by the optimized single processor
Linux kernel, the SMP standard Linux kernel and the
Click modular router with testbed setup B (see Fig. 2).
Fig. 11 reports the maximum effective throughput in
terms of forwarded packets per second for a single router
interface. From this figure it is clear that, in the presence
of short-sized packets, the performance level of all three
software architectures is not close to the theoretical one.
More specifically, while the best throughput values are
achieved by Click, the SMP kernel seems to provide
better forwarding rates with respect to the optimized
kernel. In fact, as outlined in [25], if no explicit CPU-
interface bounds are present, the SMP kernel processes
the received packets (using, if possible, the same CPU for
the entire packet elaboration) and attempts to dynamically
distribute the computational load among the CPUs.
0
5
10
15
20
25
30
0 200 400 600 800 1000
1200
1400
1600
Throughput [%]
IP Packet Size [Bytes]
2.6.16
Click
SMP
Figure 11. Throughput and latencies test, testbed setup B: effective
throughput.
Thus, in this particular setup, computational load
sharing attempts to manage the two interfaces, to which a
traffic pair is applied, with a single fixed CPU, fully
processing each received packet with only one CPU, thus
avoiding any memory concurrency problems. Figs. 12
and 13 report the minimum, the average and the
maximum latency values according to different datagram
sizes obtained for all three software architectures. In
particular, we note that both Linux kernels, which in this
case provide very similar results, ensure minimum
latencies lower than Click. Instead, Click provides better
average and maximum latency values for short-sized
datagrams.
10
100
1000
10000
0 200 400 600 800 1000
1200
1400
1600
Latency [us]
IP Packet Size [Bytes]
2.6.16 min
2.6.16 max
Click min
Click max
SMP min
SMP max
Figure 12. Throughput and latencies test, testbed B: minimum and
maximum latencies.
10
100
1000
10000
0 200 400 600 800
1000
1200
1400
1600
Latency [us]
IP Packet Size [Bytes]
2.6.16 avg
Click avg
SMP avg
Figure 13. Throughput and latencies test, testbed B: average latencies.
0
100
200
300
400
500
600
700
800
0
200
400
600
800
1000 1200 1400 1600
Burst Length [pkt]
IP Packet Size [Bytes]
2.6.16
Click
SMP
Figure 14. Back-to-back test, testbed B: maximum zero-loss burst
lengths.
The back-to-back results, reported in Fig. 14 and Table
II, show that the performance level of all analyzed
architectures is nearly comparable in terms of zero-loss
burst length, while as far as latencies are concerned, the
Linux kernels provide better values. By analyzing Fig.
15, which reports the loss rate results, we note how the
performance values obtained with Click and the SMP
kernel are better, especially for low-sized datagrams, than
the one obtained by the optimized single processor
kernel. Moreover, Fig. 15 also shows that all three OR
software architectures do not achieve the full Gigabit/s
speeds even for large datagrams, with a maximum
forwarding rate of about 650 Mbps per interface. To
improve the readability of these results, we reported in
Fig. 15 and in all the following loss rate tests only the OR
behavior with the minimum and maximum datagram
sizes since they are, respectively, the performance lower
and upper bound.
12
JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007
© 2007 ACADEMY PUBLISHER
TABLE II.
B
ACK
-
TO
-
BACK TEST
,
TESTBED
B:
LATENCY VALUES
FOR ALL THREE SOFTWARE ARCHITECTURES
.
Optimized 2.675 Kernel
Click
2.6.16 SMP Kernel
Pkt
Length
Latency [us]
Latency [us]
Latency [us]
[Byte] Min
Average Max Min
Average
Max Min
Average
Max
40 122.3 2394.7
5029.4
27.9 5444.6 13268 70.48 1505 3483
64 124.8 1717.1
3320.4
30.1 2854.7 16349 89.24 1577 3474
128 212.1 1313.1
2874.2
46.9 2223.5 7390.0 67.02 1202 3047
256 139.6 998.8
2496.9
45.4 1314.9 5698.6 37.86 1005 2971
512 70.8 688.2
2088.7
21.2 574.5 2006.4 31.77
728 2085
768 55.7 585.3
2122.7
28.2 480.0 1736.5 35.01
587 1979
1024 71.0 373.8
1264.5
33.8 458.3 1603.3 37.19
361 1250
1280 58.6 427.7
1526.6
45.0 426.7 1475.5 38.58
482 1868
1500 66.4 485.6
1707.8
38.3 462.3 1524.3 36.68
478 1617
0
20
40
60
80
100
0
20
40
60
80 100
Throughput [%]
Offered Load [%]
2.6.16 40B
2.6.16 1500B
Click 40B
Click 1500B
SMP 40B
SMP 1500B
Figure 15. Loss Rate test, testbed B: maximum throughput versus both
offered load and IP datagram sizes.
C. Setup C numerical results
In this benchmarking session, the three software
architectures were tested in the presence of four Gigabit
Ethernet interfaces with a full-meshed traffic matrix (Fig.
2). By analyzing the maximum effective throughput
values in Fig. 16, we note that Click appears to achieve a
better performance level with respect to the Linux kernels
while, unlike the previous case, the single processor
kernel provides maximum forwarding rates larger than
the SMP version with small packets.
In fact, the SMP kernel tries to share the computational
load of the incoming traffic among the CPUs, resulting in
an almost static assignment of each CPU to two specific
network interfaces. Since, in the presence of a full-
meshed traffic matrix, about half of the forwarded
packets cross the OR between two interfaces managed by
different CPUs, this decreases performance due to
memory concurrency problems [19]. Figs. 17 and 18
show the minimum, the maximum and the average
latency values obtained during this test set. In observing
the last results, we note how the SMP kernel, in the
presence of short-sized datagrams, continues to undergo
memory concurrency problems which lowers OR
performance while considerably increasing both the
average and the maximum latency values.
By analyzing Fig. 19 and Table III, which report the
back-to-back test results, we note that all three OR
architectures achieve a similar zero-loss burst length,
while Click reaches very high average and maximum
latencies with respect to the single-processor and SMP
kernels when small packets are used.
The loss-rate results in Fig. 20 highlight the
performance decay of the SMP kernel, while a fairly
similar behavior is achieved by the other two
architectures. Moreover, as in the previous benchmarking
session, the maximum forwarding rate for each Gigabit
network interface is limited to about 600/650 Mbps.
0
5
10
15
20
25
30
0 200 400 600 800 1000
1200
1400
1600
Throughput [%]
IP Packet Size [Bytes]
2.6.16
Click
SMP
Figure 16. Throughput and latencies test, setup C: effective throughput
results.
10
100
1000
10000
0 200 400 600 800 1000
1200
1400
1600
Latency [us]
IP Packet Size [Bytes]
2.6.16 min
2.6.16 max
Click min
Click max
SMP min
SMP max
Figure 17. Throughput and latencies test, testbed C: minimum and
maximum latencies.
10
100
1000
10000
0 200 400 600 800
1000
1200
1400
1600
Latency [us]
IP Packet Size [Bytes]
2.6.16 avg
Click avg
SMP avg
Figure 18. Throughput and latencies test, results for testbed C: average
latencies.
0
100
200
300
400
500
600
700
800
0 200 400 600 800 1000
1200
1400
1600
Burst Length [pkt]
IP Packet Size [Bytes]
2.6.16
Click
SMP
Figure 19. Back-to-back test, testbed C: maximum zero loss burst
lengths.
0
20
40
60
80
100
0
20
40
60
80
100
Throughput [%]
Offered Load [%]
2.6.16 40B
2.6.16 1500B
Click 40B
Click 1500B
SMP 40B
SMP 1500B
Figure 20. Loss Rate test, testbed C: maximum throughput versus both
offered load and IP datagram sizes
JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007
13
© 2007 ACADEMY PUBLISHER
TABLE III.
B
ACK
-
TO
-
BACK TEST
,
TESTBED
C:
LATENCY VALUES
FOR THE SINGLE
-
PROCESSOR
2.6.16
OPTIMIZED KERNEL
,
THE
C
LICK
M
ODULAR
R
OUTER AND THE
SMP
2.6.16
KERNEL
optimized 2.6.16
Kernel
Click
2.6.16 SMP Kernel
Pkt
Length
Latency [us]
Latency [us]
Latency [us]
[Byte] Min Average Max
Min Average Max Min
Average Max
40 92.1 2424.3
5040.3 73.5 6827.0 15804.9 74.8 3156.8 6164.2
64 131.3 1691.7 3285.1 176.7 6437.6 16651.1 66.5 2567.5 5140.6
128 98.1 1281.0
2865.9 60.2 3333.8 9482.1
77.1 1675.1 3161.6
256 19.6 915.9
2494.6 16.7 1388.9 3972.2
44.1 790.8 1702.8
512 23.9 666.9
2138.9 15.9 649.2 2119.3
23.2 815.8 2189.0
768 22.3 571.3
2079.7 22.5 543.7 2002.6
23.6 737.3 2193.0
1024 22.0 353.7
1232.2 36.3 382.2 1312.7
30.0 411.7 1276.8
1280 25.9 436.4
1525.4 34.6 443.0 1460
29.8
447.7 1469.5
1500 27.4 469.5
1696.7 36.7 457.5 1525.7
30.0 482.6 1719.6
D. Setup D numerical results
In the last benchmarking session, we applied setup D,
which provides a full-meshed traffic matrix between one
Gigabit Ethernet and 12 Fast Ethernet interfaces, to the
single-processor Linux kernel and to the SMP version.
We did not use Click in this last test since, at the moment
and for this software architecture, there are no drivers
with polling support for the D-Link interfaces.
By analyzing the throughput and latency results in
Figs. 21, 22 and 23, we note how, in the presence of a
high number of interfaces and a full-meshed traffic
matrix, the performance of the SMP kernel version drops
significantly: the maximum measured value for the
effective throughput is limited to about 2400 packets/s
and the corresponding latencies would appear to be much
higher with respect to those obtained with the single
processor kernel. However, the single processor kernel
also does not support the maximum theoretical rate: it
achieves 10% of full speed in the presence of short-sized
datagrams and about 75% for high datagram sizes.
0
10
20
30
40
50
60
70
80
0 200 400 600 800 1000
1200
1400
1600
Throughput [%]
IP Packet Size [Bytes]
2.6.16
SMP
Figure 21. Throughput and latencies test, setup D: effective throughput
results for both Linux kernels.
10
100
1000
10000
0 200 400 600 800 1000
1200
1400
1600
Latency [us]
IP Packet Size [Bytes]
2.6.16 min
2.6.16 max
SMP min
SMP max
Figure
22. Throughput and latencies test, results for testbed D:
minimum and maximum latencies for both Linux kernels.
To better understand why the OR does not attain full-
speed with such a high number of interfaces, we decided
to perform several profiling tests. In particular, these tests
were carried out using two simple traffic matrices: the
first (Fig. 24) consists of 12 CBR flows that cross the OR
from the Fast Ethernet interfaces to the Gigabit one,
while the second (Fig. 25) still consists of 12 CBR flows
that cross the OR in the opposite direction (e.g., from the
Gigabit to the Fast Ethernet interfaces). These simple
traffic matrices allow us to separately analyze the
reception and transmission operations.
10
100
1000
10000
0 200 400 600 800
1000
1200
1400
1600
Latency [us]
IP Packet Size [Bytes]
2.6.16 avg
SMP avg
Figure 23. Throughput and latencies test, testbed D: average latencies
for both the Linux kernels.
0
10
20
30
40
50
60
70
80
10
20
30
40
50
60
70
80
90
100
CPU Percentage [%]
Offered Load [Kpackets/s]
idle
scheduler
memory
IP processing
NAPI
Tx API
IRQ
Eth processing
oprofile
control
Figure 24. Profiling results obtained by using 12 CBR flows that cross
the OR from the Fast Ethernet interfaces to the Gigabit one.
0
10
20
30
40
50
60
70
10
20
30
40
50
60
70
80
90
100
CPU Percentage [%]
Offered Load [Kpackets/s]
idle
scheduler
memory
IP processing
NAPI
Tx API
IRQ
Eth processing
oprofile
control
Figure 25. Profiling results obtained by using 12 CBR flows that cross
the OR from a Gigabit interface to 12 FastEthernet ones.
Thus, Figs. 24 and 25 report the profiling results
corresponding to the two traffic matrices. The internal
measurements shown in Fig. 24 highlight that fact that
the CPUs are overloaded by the very high computational
load of the IRQ and TX API management operations.
This is due to the fact that during the transmission
process each interface must signal the state of both the
transmitting packets and the transmission ring to the
associated driver instance through interrupts. More
specifically, and again referring to Fig. 24, we note that
IRQ CPU occupancy decreases by up to 30% of the
offered load, and afterwards, while the OR reaches
saturation, it remains constantly at about 50% of the
computational resources. The initial decreasing behavior
is due to the fact that by increasing the offered load
traffic, the OR can better exploit packet grouping effects.
Instead, the constant behavior is due to the fact that the
OR manages the same packet quantity. Referring to Fig.
14
JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007
© 2007 ACADEMY PUBLISHER
25, we note how the presence of traffic incoming from
many interfaces increases the computational weights of
both the IRQ and the memory management operations.
The decreasing behavior of the IRQ management
computational weight is not due, as in the previous case,
to the packet grouping effect, but to the typical NAPI
structure that passes from an IRQ based mechanism to a
polling one. The high memory management values can be
explained quite simply by the fact that the recycling patch
is not operating with the Fast Ethernet driver.
0
5
10
15
20
25
30
35
0 200 400 600 800 1000
1200
1400
1600
Burst Length [pkt]
IP Packet Size [Bytes]
2.6.16
SMP
Figure 26. Back-to-back test, testbed D: maximum zero loss burst
lengths.
TABLE IV.
B
ACK
-
TO
-B
ACK TEST
,
TESTBED
D:
LATENCY VALUES
FOR THE SINGLE
-
PROCESSOR
2.6.16
OPTIMIZED KERNEL AND THE
SMP
2.6.16
KERNEL
.
Optimized 2.6.16
Kernel
2.6.16 SMP Kernel
Pkt
Length
Latency [us]
Latency [us]
[Byte]
Min Average Max Min Average Max
40 285.05 1750.89 2921.23 37.04 1382.96 2347.89
64 215.5 1821.81
2892.13
38.07 1204.43 1963.7
128 216.15 1847.76 3032.22 34.52 1244.87 1984.4
256 61.83 1445.15
2353.12
30.61 2082.76 3586.6
512 57.73 2244.68
4333.44
32.97 908.19
1661.18
768 101.78 1981.5 3497.81 50.64 1007.27 1750.45
1024 108.17 1386.19 2394.4 52.14 819.98 1642.13
1280 73.15 1662.13
3029.54
58.11 981.92
1953.62
1500 109.24 1149.76 2250.78 70.92 869.36 1698.68
0
20
40
60
80
100
0
20
40
60
80
100
Throughput [%]
Offered Load [%]
2.6.16 40B
2.6.16 1500B
SMP 40B
SMP 1500B
Figure 27. Loss Rate test, testbed D: maximum throughput versus both
offered load and IP datagram sizes.
The back-to-back results, reported in Fig. 26 and Table
IV, show a very particular behavior: in fact, even if the
single processor kernel can achieve longer zero-loss burst
lengths than the SMP kernel, the latter appears to ensure
lower minimum, average and maximum latency values.
In the end, Fig. 27 reports the loss rate test results, which,
compatible with the previous results, show that a single
processor kernel can sustain a higher forwarding
throughput than the SMP version.
E. Maximum Performance
In order to effectively synthesize and improve the
evaluation of the proposed performance results, we report
in Figs. 28 and 29 the aggregated
2
maximum values for
each testbed of, respectively, the effective throughput and
the maximum throughput (obtained in the loss rate test).
By analyzing Fig. 28, we note that in the presence of
more network interfaces, the OR generates values higher
than 1 Gbps and, in particular, that it reaches maximum
values equal to 1.6 Gbps with testbed D. We can also
point out that the maximum effective throughput of
setups B and C are almost the same: in fact, these very
similar testbeds have only one difference (i.e., the traffic
matrix), which has an effect only on the performance
level of the SMP kernel, but practically no effect on the
behaviors of the single processor kernel and Click.
0
200
400
600
800
1000
1200
1400
1600
1800
0
200
400
600
800 1000 1200 1400 1600
Effective Throughput [Mbps
]
IP Packet Size [Bytes]
setup A
setup B
setup C
setup D
Figure 28. Maximum effective throughput values obtained in the
implemented testbeds.
0
500
1000
1500
2000
2500
3000
3500
0
200
400
600
800 1000 1200 1400 1600
Throughput [Mbps]
IP Packet Size [Bytes]
setup A
setup B
setup C
setup D
Figure 29. Maximum throughput values obtained in the implemented
testbeds.
The aggregated maximum throughput values, as
reported in Fig. 29, are obviously higher than the ones in
Fig. 28. This highlights the fact that the maximum
forwarding rates sustainable by the OR are achieved in
setups B and C with 2.5 Gbps. Moreover, while in setup
A the maximum theoretical rate is achieved for packet
sizes larger than 128, in all the other setups the maximum
throughput values are not much higher than half the
theoretical ones.
F. Hardware Architecture Impact
In the final benchmarking session, we decided to
compare the performance of the two hardware
architectures introduced in Section III, which represent
the current and the state-of-the-art of server architectures
four years ago. The benchmarking scenario is the one
used in testbed A (with reference to Fig. 2), while the
selected software architecture is the single processor
optimized kernel.
2
In this case, “aggregated” refers to the sum of the forwarding rates of
all the OR network interfaces.
JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007
15
© 2007 ACADEMY PUBLISHER
It is clear that the purposes of these tests was to
understand how the continuous evolution of COTS
hardware affects overall OR performance. Therefore,
Figs. 30, 31 and 32 report the results of effective
throughput tests for the “old” architecture (i.e., 32-bit
Xeon) and the “new” one (i.e., 64-bit Xeon) equipped
with both PCI-X and PCI-Express busses. The loss rate
results are shown in Fig. 33.
10
20
30
40
50
60
70
80
90
100
100
1000
Throughput [%]
IP Packet Size [Bytes]
Old
PCI-Ex
PCI-X
Figure 30. Throughput and latencies test, setup A with the “old” HW
architecture and the “new” one equipped with PCI-X and PCI-Express
busses: effective throughput results for the single processor optimized
kernel. Note that the x-axis is in the logarithmic scale.
0
50
100
150
200
250
300
0
200
400
600
800 1000 1200 1400 1600
Latency [us]
IP Packet Size [Bytes]
Min Old
Max Old
Min PCI-Ex
Max PCI-Ex
Min PCI-X
Max PCI-X
Figure 31. Throughput and latencies test, results for testbed A with the
“old” HW architecture and the “new” one equipped with PCI-X and
PCI-Express busses: minimum and maximum latencies for the single
processor optimized kernel.
0
50
100
150
200
250
0
200
400
600
800 1000 1200 1400 1600
Latency [us]
IP Packet Size [Bytes]
Avg Old
Avg PCI-Ex
Avg PCI-X
Figure 32. Throughput and latencies test, results for testbed A with the
“old” HW architecture and the “new” one equipped with PCI-X and
PCI-Express busses: average latencies for the single processor
optimized kernel.
By observing the comparisons in Figs. 30 and 31, it is
clear that the new architecture generally provides better
performance values than the old one: more specifically,
while using the new architecture with the PCI-X bus
slightly improves performance, when the PCI-Express is
used the OR effective throughput is an impressive 88%
with 40 Byte-sized packets, achieving the maximum
theoretical rate for all other packet sizes. All this is
clearly due to the high efficiency of the PCI Express bus.
In fact, with this I/O bus DMA transfers occur with a
very low control overhead (since it behaves like a leased
line), which probably leads to less heavy accesses to the
RAM and, subsequently, to benefits in terms of memory
accesses by the CPU. In other words, this high
performance enhancement is caused by a more effective
memory access of the CPU, thanks to the features of the
PCI Express DMA.
0
20
40
60
80
100
0
200
400
600
800
1000
1200
1400
1600
Throughput [%]
IP Packet Size [Bytes]
Old
PCI-Ex
PCI-X
Figure 33. Loss Rate test, testbed A for the “old” HW architecture and
the “new” one equipped with PCI-X and PCI-Express busses: maximum
throughput versus both offered load and IP datagram sizes.
VII. C
ONCLUSIONS
In this contribution we report the results of the in-depth
optimization and testing carried out on PC Open Router
architecture based on Linux software and, more
specifically, based on the Linux kernel. We have
presented a performance evaluation in some common
working environments of three different data plane
architectures, including the optimized Linux 2.6 kernel,
the Click Modular Router and the SMP Linux 2.6 kernel,
with external (throughput and latencies) and internal
(profiling) measurements. External measurements were
performed in an RFC2544 [29] compliant manner by
using professional devices [27]. Two hardware
architectures were tested and compared for the purpose of
understanding how the evolution in COTS hardware may
affect performance.
The experimental results show that the optimized
version of the Linux kernel with suitable hardware
architectures can achieve such high performance levels to
effectively support several Gigabit interfaces. The results
obtained show that the OR can achieve very interesting
performance levels while attaining aggregated forwarding
rate values of about 2.5 Gbps with relatively low
latencies.
R
EFERENCES
[1] Building Open Router Architectures Based On Router
Aggregation project (BORA-BORA), homepage at
http://www.tlc.polito.it/borabora.
[2] E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F.
Kaashoek, "The Click modular router", ACM Transactions
on Computer Systems 18(3), Aug. 2000, pp. 263-297.
[3] Zebra, http://www.zebra.org/.
[4] M. Handley, O. Hodson, E. Kohler, “XORP: an open
platform for network research”, ACM SIGCOMM
Computer Communication Review, Vol. 33 Issue 1, Jan.
2003, pp. 53-57.
[5] S. Radhakrishnan, “Linux - Advanced networking
overview”, http://qos.ittc.ku.edu/howto.pdf.
[6] M. Rio et al., “A map of the networking code in Linux
kernel 2.4.20”, Technical Report DataTAG-2004-1,
FP5/IST DataTAG Project, Mar. 2004.
[7] FreeBSD, http://www.freebsd.org.[10]
16
JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007
© 2007 ACADEMY PUBLISHER
[8] B. Chen and R. Morris, "Flexible Control of Parallelism in
a Multiprocessor PC Router", Proc. of the 2001 USENIX
Annual Technical Conference (USENIX '01), Boston,
USA, June 2001.
[9] C. Duret, F. Rischette, J. Lattmann, V. Laspreses, P. Van
Heuven, S. Van den Berghe, P. Demeester, “High Router
Flexibility and Performance by Combining Dedicated
Lookup Hardware (IFT), off the Shelf Switches and Linux”,
Proc. of the 2
nd
International IFIP-TC6 Networking
Conference, Pisa, Italy, May 2002, LNCS 2345, Ed E.
Gregori et al, Springer-Verlag 2002, pp. 1117-1122.
[10] A. Barczyk, A. Carbone, J.P. Dufey, D. Galli, B. Jost, U.
Marconi, N. Neufeld, G. Peco, V. Vagnoni, “Reliability of
datagram transmission on Gigabit Ethernet at full link
load”, LHCb technical note, LHCB 2004-030 DAQ, Mar.
2004.
[11] P. Gray, A. Betz, “Performance Evaluation of Copper-
Based Gigabit Ethernet Interfaces”, Proc. of 27
th
Annual
IEEE Conference on Local Computer Networks (LCN'02),
Tampa, Florida, November 2002, pp. 679-690.
[12] A. Bianco, R. Birke, D. Bolognesi, J. M. Finochietto, G.
Galante, M. Mellia, M.L.N.P.P. Prashant, Fabio Neri,
“Click vs. Linux: Two Efficient Open-Source IP Network
Stacks for Software Routers”, Proc. of the 2005 IEEE
Workshop on High Performance Switching and Routing
(HPSR 2005), Hong Kong, May 2005, pp. 18-23.
[13] A. Bianco, J. M. Finochietto, G. Galante, M. Mellia, F.
Neri, “Open-Source PC-Based Software Routers: a Viable
Approach to High-Performance Packet Switching”, Proc.
of the 3
rd
International Workshop on QoS in Multiservice
IP Networks (QOS-IP 2005), Catania, Italy, Feb. 2005, pp.
353-366
[14] A. Bianco, R. Birke, G. Botto, M. Chiaberge, J.
Finochietto, G. Galante, M. Mellia, F. Neri, M. Petracca,
“Boosting the Performance of PC-based Software Routers
with FPGA-enhanced Network Interface Cards”, Proc. of
the 2006 IEEE Workshop on High Performance Switching
and Routing (HPSR 2006), Poznan, Poland, June 2006, pp.
121-126.
[15] A. Grover, C. Leech, “Accelerating Network Receive
Processing: Intel I/O Acceleration Technology”, Proc. of
the 2005 Linux Symposium, Ottawa, Ontario, Canada, Jul.
2005, vol. 1, pp. 281-288.
[16] R. McIlroy, J. Sventek, “Resource Virtualization of
Network Routers”, Proc. of the 2006 IEEE Workshop on
High Performance Switching and Routing (HPSR 2006),
Poznan, Poland, June 2006, pp. 15-20.
[17] R. Bolla, R. Bruschi, “The IP Lookup Mechanism in a
Linux Software Router: Performance Evaluation and
Optimizations”, Proc. of the 2007 IEEE Workshop on High
Performance Switching and Routing (HPSR 2007), New
York, USA.
[18] K. Wehrle, F. Pählke, H. Ritter, D. Müller, M. Bechler,
“The Linux Networking Architecture: Design and
Implementation of Network Protocols in the Linux Kernel”,
Pearson Prentice Hall, Upper Saddle River, NJ, USA,
2004.
[19] R. Bolla, R. Bruschi, “A high-end Linux based Open
Router for IP QoS networks: tuning and performance
analysis with internal (profiling) and external
measurement tools of the packet forwarding capabilities”,
Proc. of the 3
rd
International Workshop on Internet
Performance, Simulation, Monitoring and Measurements
(IPS MoMe 2005), Warsaw, Poland, Mar. 2005.
[20] J. H. Salim, R. Olsson, A. Kuznetsov, “Beyond Softnet”,
Proc. of the 5
th
annual Linux Showcase & Conference,
Nov. 2001, Oakland, California, USA.
[21] A. Cox, "Network Buffers and Memory Management"
Linux Journal, Oct. 1996, http://www2.linuxjournal.com/
lj−issues/issue30/1312.html.
[22] The Intel PRO 1000 XT Server Adapter,
http://www.intel.com/network/connectivity/products/pro10
00xt.htm.
[23]
The D-Link DFE-580TX quad network adapter,
http://support.dlink.com/products/view.asp?productid=DF
E%2D580TX#.
[24] J. A. Ronciak, J. Brandeburg, G. Venkatesan, M. Williams,
“Networking Driver Performance and Measurement –
e1000 A Case Study”, Proc. of the 2005 Linux Symposium,
Ottawa, Ontario, Canada, July 2005, vol. 2, pp. 133-140.
[25] R. Bolla, R. Bruschi, “IP forwarding Performance
Analysis in presence of Control Plane Functionalities in a
PC-based Open Router”, Proc. of the 2005 Tyrrhenian
International Workshop on Digital Communications
(TIWDC 2005), Sorrento, Italy, June 2005, and in F.
Davoli, S. Palazzo, S. Zappatore, Eds., “Distributed
Cooperative Laboratories: Networking, Instrumentation,
and Measurements”, Springer, Norwell, MA, 2006, pp.
143-158.
[26] The descriptor recycling patch, ftp://robur.slu.se/pub/
Linux/net-development/skb_recycling/.
[27] The Agilent N2X Router Tester, http://advanced.
comms.agilent.com/n2x/products/.
[28] Oprofile, http://oprofile.sourceforge.net/news/.
[29] Request for Comments 2544 (RFC 2544), http://www.faqs.
org/rfcs/rfc2544.html.
Raffaele Bolla was born in Savona (Italy) in 1963. He
received his Master of Science degree in Electronic Engineering
from the University of Genoa in 1989 and his Ph.D. degree in
Telecommunications at the Department of Communications,
Computer and Systems Science (DIST) in 1994 from the same
university. From 1996 to 2004 he worked as a researcher at
DIST where, since 2004, he has been an Associate Professor,
and teaches a course in Telecommunication Networks and
Telematics. His current research interests focus on resource
allocation, Call Admission Control and routing in Multi-service
IP networks, Multiple Access Control, resource allocation and
routing in both cellular and ad hoc wireless networks. He has
authored or coauthored over 100 scientific publications in
international journals and conference proceedings. He has been
the Principal Investigator in many projects in the
Telecommunication Networks field.
Roberto Bruschi was born in Genoa (Italy) in 1977. He
received his Master of Science degree in Telecommunication
Engineering in 2002 from the University of Genoa and his
Ph.D. in Electronic Engineering in 2006 from the same
university. He is presently working with the Telematics and
Telecommunication Networks Lab (TNT) in the Department of
Communication, Computer and System Sciences (DIST) at the
University of Genoa. He is also a member of CNIT, the Italian
inter-university Consortium for Telecommunications. Roberto is
an active member of various Italian research projects in the
networking area, such as BORA-BORA, FAMOUS, TANGO
and EURO. He has co-authored over 10 papers in international
conferences and journals. His main interests include Linux
Software Router, Network processors, TCP and network
modeling, VPN design, P2P modeling, bandwidth allocation,
admission control and routing in multiservice QoS IP/MPLS
networks.
JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007
17
© 2007 ACADEMY PUBLISHER