Upload
hoangnhu
View
213
Download
0
Embed Size (px)
Citation preview
Diplomarbeit
Netflow analysis of
SAP R/3 traffic
in an enterprise environment
Jan Bankstahl
Angefertigt unter der Leitung von Dr. PD Christoph Weidenbach am
Max-Planck-Institut fur Informatik in Saarbrucken
Merzig, den xx.xx.xxxx
Ich erklare an Eides statt, daß ich diese Arbeit selbstandig verfaßt und keine anderen als
die im Literaturverzeichnis angegebenen Quellen benutzt habe.
Jan Bankstahl
Merzig, den xx.xx.xxxx
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Structure of the Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 A word about data privacy . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 3
2.1 SAP R/3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Application Architecture . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Infrastructure Architecture . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3 SAP GUI Protocol Characteristics . . . . . . . . . . . . . . . . . . 8
2.2 Traffic Measurement Techniques . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Packet Level Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 SNMP and RMON . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.3 NetFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.4 Application Logfiles . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.5 Assessment of Traffic Measurement Techniques . . . . . . . . . . . 13
2.3 NetFlow in detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Flow definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.2 Flow Expiration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.3 NetFlow and TCP Sessions . . . . . . . . . . . . . . . . . . . . . . 16
2.3.4 NetFlow on Routers . . . . . . . . . . . . . . . . . . . . . . . . . . 17
i
CONTENTS ii
2.3.5 NetFlow on Switches . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.6 Aggregation and Sampling . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.7 NetFlow Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Data Collection 21
3.1 Network Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 SAP R/3 Server Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 NetFlow Data Export . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 NetFlow Collector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4 Data Preparation 27
4.1 Data Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.1 Prefiltering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.2 Server Identification . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1.3 User Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Reconstruction of TCP Sessions . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.1 Flow Defragmentation . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.2 TCP Session Reassembly . . . . . . . . . . . . . . . . . . . . . . . 37
5 Data Analysis 38
5.1 SAP Message Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1.1 Session duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1.2 Data volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.1.3 Packet size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 SAPGUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.1 Session duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2.2 Time of day profile . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2.3 Day of week profile . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.4 Data volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.5 Traffic symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
CONTENTS iii
5.2.6 Packet size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.7 Bandwidth usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.8 Load distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2.9 User ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3 Printer/LPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3.1 Session duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.2 Data volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3.3 Packet size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6 Conclusions, Open Problems, and Outlook 56
A Patch for perl module ’Cflow’ 58
B Source code for ’create-filter’ 60
C Source code for ’defragment’ 63
D Source code for ’reassemble’ 64
Bibliography 66
Acknowledgments 69
List of Figures
2.1 SAP R/3 modules (source SAP AG) . . . . . . . . . . . . . . . . . . . . . 4
2.2 SAP Three-Tier Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Data flow through SAP processes . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Connection process of the SAP GUI to an SAP system (source SAP AG) 9
2.5 SPAN port operation (source Cisco Systems) . . . . . . . . . . . . . . . . 10
2.6 NetFlow enabled network device . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 SAP transaction ST03N (source SAP AG) . . . . . . . . . . . . . . . . . . 13
2.8 The packet train model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.9 Flow timeout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Network Schematic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 SAP R/3 Server infrastructure . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Serverfarm switch configuration . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Verifying the NDE operation on switches . . . . . . . . . . . . . . . . . . 24
3.5 Raw data collected with flow-capture . . . . . . . . . . . . . . . . . . . . . 26
4.1 High-level process of data filtering . . . . . . . . . . . . . . . . . . . . . . 28
4.2 filter-definition tag for prefiltering . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Process of server identification . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4 Excerpt from negated filter . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Filter for interactive user traffic . . . . . . . . . . . . . . . . . . . . . . . . 33
4.6 Filtered data volume (#flows, #packets, bytes) . . . . . . . . . . . . . . . 34
4.7 Histogram of flow duration and CCDF of distance between fragments . . 36
iv
LIST OF FIGURES v
4.8 Histogram of flow duration (defragmentated with t=215 & t=3600s) . . . 37
5.1 CCDF and density plot of flow duration Message Server (log scales) . . . 39
5.2 Volume of Message Server traffic . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 Average packet sizes of Message Server sessions . . . . . . . . . . . . . . . 41
5.4 Density plots of Message Server session length (pkts) . . . . . . . . . . . . 41
5.5 CCDF and density plot of flow duration SAPGUI (log scales) . . . . . . . 43
5.6 Time of day distribution of SAPGUI sessions . . . . . . . . . . . . . . . . 44
5.7 Histograms of distribution over week-days . . . . . . . . . . . . . . . . . . 45
5.8 Time of day distribution of SAPGUI sessions on different weekdays . . . . 45
5.9 CCDF plot of volume distribution . . . . . . . . . . . . . . . . . . . . . . 46
5.10 Ratio of client-to-server and server-to-client traffic . . . . . . . . . . . . . 47
5.11 Density plot of average packet sizes . . . . . . . . . . . . . . . . . . . . . . 48
5.12 Bandwidth usage of SAPGUI sessions . . . . . . . . . . . . . . . . . . . . 49
5.13 Load distribution over all servers . . . . . . . . . . . . . . . . . . . . . . . 50
5.14 Subnet ranking by session frequency (log scaled) . . . . . . . . . . . . . . 51
5.15 CCDF and density plot of flow duration printers (log scales) . . . . . . . . 53
5.16 Volume of printer traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.17 Average packet sizes of print sessions . . . . . . . . . . . . . . . . . . . . . 55
List of Tables
2.1 SAP TCP Ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Comparison between SAP GUI and HTTP characteristics . . . . . . . . . 8
2.3 RMON Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Overview of Traffic Measurement Techniques . . . . . . . . . . . . . . . . 13
2.5 NetFlow v1 data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 OSU flow-tools collection . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1 Results from Prefiltering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Unidentified traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Filtered traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1 Distribution of session volume . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 Distribution of session duration . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3 Distribution of in-out ratio . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.4 Distribution of average packet sizes . . . . . . . . . . . . . . . . . . . . . . 48
5.5 Distribution of data volume on printer sessions (in bytes) . . . . . . . . . 54
vi
Chapter 1
Introduction
1.1 Motivation
Communication and IP networks have become a key part of todays IT infrastructures.
The Internet Protocol (IP) is not only used in the Internet context but also in enterprise
network environments. The ability of large private networks has enabled new applica-
tions.
In the Internet the applications web, email and peer-to-peer are currently dominating.
Each one of these three applications has special characteristics that lead to different
workload models.
Enterprise environments differ significantly from the public Internet. Beside web and
email1, there are central file-servers and business applications like SAP R/3 and People-
soft. Furthermore there are still several custom mainframe programs that are accessed
via the TN3270 protocol.
Workload models of each individual application are essential for performance analysis
and troubleshooting. They are also inevitable input for network capacity planning and
network simulation.
SAP R/3 is the leading software for enterprise resource planning systems (ERP) with a
market share of 43%.
1in contrast to the Internet, email systems in companies are commonly based on Lotus Notes or
Microsoft Exchange
1
CHAPTER 1. INTRODUCTION 2
1.2 Related Work
Although several studies have been published over the past years about traffic models
in the Internet ([UB01a], [WPT98], or [CB96]), enterprise networks are rarely analyzed.
The only work that characterizes the network behavior of SAP R/3 that has been found
is [Jan05], but this paper takes a synthetic approach to describe the network load of SAP
R/3.
1.3 Structure of the Work
This paper is structured into six chapters. After this Introduction, the Background
chapter explains about SAP R/3, its architecture, different traffic measurement methods
(in particular NetFlow), and the statistical approaches, that have been used.
The chapter Data Collection describes the environment, where the data has been
collected, as well as details about the setup. Data Preparation explains the filtering
of the collected data, and the preparation for the statistical analysis which is described
in the chapter Data Analysis. The last chapter Conclusion summarizes the results,
points out open questions and gives an outlook to further work.
1.4 A word about data privacy
By nature during traffic data analysis, one has to have data privacy in mind as well.
There are two major aspects of data privacy:
• individual-related data: in Germany, all kind of data, which is related to indi-
viduals underlies special regulation. There are data privacy laws and works council
agreements.
• confidential data: in an enterprise environment most information may not be
available for the public. The information could be used by competitors to take
advantage of it, or by criminals to use the specific information about infrastructure
for intrusion or denial of service attempts.
To avoid any conflicts with these items, this work does not relate any collected data to
individuals, but evaluates only on a per subnet basis2. Furthermore all IP addresses,
system names and so on are replaced, in order not to release information about the
enterprise infrastructure.
2in particular the traffic data is not being correlated to DHCP logs or usernames
Chapter 2
Background
2.1 SAP R/3
During the last decades virutally mid-size and large companies introduced so called
Enterprise Resource Planning (ERP) applications. These ERP applications support the
back-office operation in areas such as financial accounting, controlling, logistics, and
human resources.
The German software development company SAP is one of the market leaders for in-
tegrated business administration systems. Its product, SAP R/3, is one of the most
popular ERP applications. The SAP R/3 ERP solution consists of multiple core mod-
ules: [MH00]
• Financials: Financial Accounting (FI), Controlling (CO), Asset Management (AM),
Project System (PS)
• Logistics: Plant Maintenance (PM), Material Management (MM), Quality Man-
agement (QM), Production Planning (PP), Sales and Distribution (SD)
• Human Resources (HR)
• Industry Solutions (IS)
• Workflow (WF)
SAP R/3 integrates business processes and enables the exchange of data between various
business units or divisions of a company. The standardized approach of SAP R/3 requires
customizing of R/3 to meet the business needs. Therefore two R/3 implementations as
used by two different companies will never look the same. However a customized R/3
installation should be able to follow updates and new releases of R/3 that address new
tax or law regulations or new functionalities without additional development effort.
3
CHAPTER 2. BACKGROUND 4
Figure 2.1: SAP R/3 modules (source SAP AG)
Around this traditional ERP modules the mySAP business suite emerged, which is based
on SAP’s Netweaver technology. mySAP tries to complete the software collection to cover
also business aspects that do not belong to R/3. The mySAP suite includes the following
components:
• Enterprise Resource Planning (ERP)
• Customer Relationship Management (CRM)
• Supply Chain Management (SCM)
• Supplier Relationship Management (SRM)
• Product Lifecycle Management (PLM)
Although these modules can run stand-alone, they have still a tight integration into SAP
R/3, one of their primary advantages.
With all these modules, SAP tries to provide a set of applications for mid-size companies
up to large multinational groups to control all areas of their operation. Today SAP has a
market share of 43% with R/3 in the ERP software market[Bed05] and is used by many
Fortune 500 and other large companies such as American Airlines, BASF, Chevron, IBM,
DaimlerCrysler, Microsoft, and EDS.
CHAPTER 2. BACKGROUND 5
2.1.1 Application Architecture
SAP R/3 is build as an online transaction processing system (OLTP). Such a application
modifies data in groups of logical operations that succeed or fail as a group, so called
transactions. A transaction must fulfill the ACID criteria [Wik05a]: Atomicity, Consis-
tency, Isolation, Durability. OLTP applications interact with the user in real time (in
contrast to reporting applications).
The architecture of SAP R/3 is client-server based and organized in a 3-tier model:
• Presentation Layer: The SAP GUI is a thin client that serves as user interface.
There are multiple versions of SAP GUI, a Windows and a Java executable, and a
HTTP version using the ITS (Internet Transaction Server)1. ITS is not considered
in this work. The Windows and Java SAP GUI are executed on a Desktop PC.
• Application Layer: It holds the business specific logic and has its own propri-
etary programming language ABAP (Advanced Business Application Program-
ming). ABAP programs are executed by the SAP R/3 kernel, which is written in
C and detaches the business logic from platform specific implementations and the
Database Layer. It provides OpenSQL as language to interact with the database
and presents the data in a so called data dictionary which is mapped to the tables
of the relational database. The Application Layer may run on Windows or Unix.
• Database Layer: SAP R/3 uses a relational database. The database is not part of
R/3. SAP supports different commercial databases, such as Oracle, Microsoft SQL
Server, IBM DB2, or MaxDB (which is licensed under GPL).
Figure 2.2: SAP Three-Tier Architecture
1in newer versions, ITS may also run as kernel function rather than on a dedicated server
CHAPTER 2. BACKGROUND 6
One main advantage of this 3-tier architecture is scalability. As explained in more detail
below, the application layer can be distributed over multiple servers.
The SAP kernel itself is structured into several processes (see Figure 2.3). After the
user started a request or update, the SAP GUI transmits the request to the dispatcher
process ( 1©). There is one Dispatcher process on every R/3 server, that coordinates all
processes and interprocess communication. It holds a state for all established SAPGUI
sessions, queues incoming requests ( 2©), then selects, depending on load the Dialog work
process2 that shall fulfill the request ( 3©). The dialog work process holds a state during
dialog masks (in the Dynpro processor) and processes the business logic (in the ABAP
processor). It interacts with the database ( 4©) and with other processes via the dispatcher
to lock datasets (using the Enqueue process3), update the database (synchronous updates
are done by the Dialog work process itself, for asynchronous updates, the Update work
process is used), or print reports (Spool work process). External systems may interact
with the Remote Function Call (RFC) interface of the Gateway process.
It is important to note, that in contrast to web servers, there is no 1:1 relationship between
user sessions and Unix processes. In SAP R/3, a single dispatcher process serves multiple
users simultanously. Furthermore the activity of one user may be scattered over multiple
Dialog work processes (although this rarely happens).
Figure 2.3: Data flow through SAP processes
Beyond online user activity there are also Batch processes to perform background jobs.
For SAP R/3 installations there may exist multiple Application servers. One outstanding
server is called Central Instance (CI). The CI server runs the Message server process.
The message server is used by SAP GUI to assign the session to an application server
for load balancing purposes. Furthermore the message server is used for interprocess
2there are multiple instances of the dialog work process running at the same time3SAP has its own engine to ensure transaction integrity for Logical Units of Work (LUW) which
reside on a higher level than the database engine transactions
CHAPTER 2. BACKGROUND 7
communication if the processes are running on different servers4, which is extensively
used by the enqueue process to ensure dataset locking over all application servers.
Although it is possible to run SAP R/3 also in a geographically distributed manner with
local application servers, most installations have all servers centralized (including the
central database server) running in the same server farm LAN environment.
2.1.2 Infrastructure Architecture
As just mentioned, SAP R/3 installations are usually implemented within a server farm
LAN. As the application servers highly interact with the database server, the low dis-
tance, and the resulting low network latency, enhances the overall system performance.
SAP recommends to implement a dedicated server LAN for all backend communication,
in addition to the access LAN, where users log on to the system. This internal LAN is
used to access the database or for server-to-server interaction.
Within the application architecture there are some single points of failure. The entire
R/3 system will be unavailable, if the CI server, or the central database server would be
down. As the ERP is business critical in most companies, they require high availability
(HA) for their R/3 systems. To avoid extensive downtimes in an event of failure, HA
cluster products, such as HP ServiceGuard, SUN/Veritas Cluster, or Microsoft Windows
Cluster, are used. They group a number of servers (usually two servers) to a cluster.
Through monitoring services, the cluster nodes check the ’health’ of each other. If a
cluster node fails, another one is standing in.
ERP systems need to adopt new business requirements permanently. Due to this fact,
it is best practice, to have separate environments for development, testing, and quality
assurance. Large companies may also have training systems, as well as instances for
development and testing (multiple instances may run on the same hardware as distinct
logical instances).
The typical SAP R/3 installation includes beside the production system at least a de-
velopment system and a test system [MH00]. For the production system, it is highly
recommended to run the database on a dedicated server. Often the database and the
CI run together in a cluster. During normal operation, the database is running on one
server, the CI on the other. In case of failure of one node, the other one will carry on
with both.
In order not to interfere the production with development or testing activities, these
instances commonly do not run on the production server hardware, but still within the
same server farm LAN. In addition testing or development usually do not have HA
4the message server is only used for internal communication within a single R/3 installation distrib-
uted over multiple servers; external systems may use Remote Function Calls (RFC), Electronic Data
Interchange (EDI), Business API (BAPI), or other interfaces
CHAPTER 2. BACKGROUND 8
requirements and thus are not installed on clusters for cost reasons.
To distinguish between the different instances, an instance number is used. This instance
number is also used to assign TCP port numbers to the individual processes (see Table
2.1). Thus the dispatcher process of the instance #10 would listen to TCP port 3210.
TCP Port Process/Service
3200+instance number Dispatcher process
3300+instance number Gateway process
3600+instance number Message Server
Table 2.1: SAP TCP Ports
2.1.3 SAP GUI Protocol Characteristics
For the presentation layer, the SAP GUI executable runs on the users desktop PC5. The
SAP GUI protocol is dialog based, transaction centric and requires extensive state on
the application server. This is very much different from the most popular ’user interface’
in the Internet, the Web browser and its HTTP protocol (see Table 2.2).
SAP GUI HTTP
State stateful and dialog based
application server
most commonly state is hold on
the client; stateless web server6
IP connection single TCP connection multiple TCP connections to
transmit all components of a
web page independently
TCP connection during from
logon until logoff
almost short connections just to
transfer data7
Table 2.2: Comparison between SAP GUI and HTTP characteristics
If a message server is used for load balancing, the SAP GUI will start making a TCP
connection to the message server to determine the most suitable application server. Sub-
sequently, the SAP GUI establishes a TCP connection to the dispatcher process on the
application server (see also Figure 2.4) [SAP02]. All further activities during the session
use the same TCP connection.
When the user’s SAP GUI session ends, the SAP system needs to release all resources
that were used by this session. It is important, that the system releases data records
that have been processed by the session and are possibly locked. If there are incomplete
5there is not differentiation made between the Java and Windows version of SAP GUI, since both
show the same behavior regarding network protocols and interaction with the application server6some web applications may also hold state on the webserver or in the backend7HTTP/1.1 introduces persistent TCP connections for efficiency improvements
CHAPTER 2. BACKGROUND 9
Figure 2.4: Connection process of the SAP GUI to an SAP system (source SAP AG)
transactions in progress, when the session ends, they will be rolled back in order to ensure
a consistent state of the database. If the SAP GUI process is terminated by the front-end,
the TCP connection will be closed down and the system is automatically informed.
The situation is different, if the front-end computer is turned off, fails, or the network
connection is broken (for instance by a WAN failure). Since the SAP system normally
waits for queries from the SAP GUI and does not send data by itself to the front-end, it
would not notice that the session is dead. The session would remain in the system until
the auto-logout is triggered, or it is manually terminated. To avoid this situation, the
application server sends a keepalive packet to the SAP GUI, if it has not sent any data
for a period.
In addition to the dialog masks on the screen, users want to print datasets, tables or
reports. SAP R/3 provides different ways of printing:
• Remote printing: The R/3 spool process can access any printer in the network that
can handle the lpd8 protocol. The spool process may access the printer directly
or via a print server. This method is preferable for time consuming reports, as
they can be generated within the SAP system asynchronously and printed without
requiring the user’s desktop PC.
• Front-end Printing: The output data is transmitted directly over the SAP GUI
TCP connection to the front-end PC. The SAP GUI starts the SAPlpd process on
the PC and transmits the print data to it. SAPlpd can send the output to any
printer defined on the PC. Using this way of printing, the user does not rely on the
set of printers defined in the SAP system.
8see also RFC 3036
CHAPTER 2. BACKGROUND 10
2.2 Traffic Measurement Techniques
There are numerous ways to measure traffic volumes. They all differ in the volume of data
which is recorded, the necessary infrastructure and effort to deploy them, and the data
granularity, they provide. Because the different methods provide very different aspects
about the measured traffic, the choice has an significant impact on the further analysis.
This section gives a general review over different techniques, and explains, why NetFlow
has eventually been chosen for the analysis. Although multiple methods can be combined
to achieve additional results, this would be clearly beyond the intention of this work.
2.2.1 Packet Level Trace
The most straight forward method is to record all network traffic. This can be done
with standard PC hardware9. Packet level traces can be captured using the libpcap
interface [JLM]. These traces contain all information, that is exchanged between client
and server: packet header with IP addresses, IP protocol, and port numbers, as well as
the transported payload.
In order to capture network traffic, the so called sniffer needs to be either on the path of
the traffic or needs to get a copy. As measurement should not impact normal operation,
the latter option is commonly used. The copy is either made by a wire-tap or using the
SPAN functionality of a switch on the path of the traffic (see Figure 2.5).
Figure 2.5: SPAN port operation (source Cisco Systems)
In contrast to many protocols in the Internet, like the HTTP protocol, there is no doc-
umentation about the proprietary SAP GUI protocol and thus the payload cannot be
decoded. This limits the value of a packet level trace, as the analysis does not allow
to draw conclusions about the intra-session activity (such as used transactions, dialog
masks, etc.), as this could be done for the HTTP protocol [Fel00]. However, even with-
out the capability to decode the payload, it could be possible to identify a profile of the
intra-session behavior in regards of idle periods, burstiness, or packet size distribution.
9though 100MBit/s could be recorded with low-cost hardware, for gigabit wire speed connections a
high-end PC would be required
CHAPTER 2. BACKGROUND 11
One aspect on packet level traces is the volume of data, which is generated. The intended
environment generates more than 250 GB in roughly 600 millions packets per day. This
clearly challenges the capture infrastructure and further analysis. Another problem is
data privacy, as all content, which is not encrypted, could be reconstructed. Last but not
least, the implementation of packet level capturing requires dedicated hardware (maybe
multiple sniffers to cover a redundant or load balanced infrastructure).
2.2.2 SNMP and RMON
Another option of traffic measurement is the usage of the Simple Network Management
Protocol (SNMP) as specified by the IETF in RFC 1157. SNMP provides access to inter-
nal counters of network devices in a standardized manner. The values most commonly
considered are the interface ’ifInOctets/ifOutOctets’ counters that represent the link uti-
lization. SNMP data provides a good impression about link utilization. However it is
impossible to find out which kind of network traffic is filling up the lines.
To identify traffic patterns or build models for workload prediction it is very important
to take a more detailed look at the links and break down the traffic by ip addresses and
protocols/ports.
1 ethernet statistics statistics measured for each monitored ethernet interface
2 history periodic statistical samples from the ethernet network
3 alarm periodic statistical samples are compared to thresholds, and
an event is generated when a threshold is crossed
4 host statistics for each discovered host
5 hostTopN the top N hosts, as ordered by one of their statistics
6 matrix statistics about conversations (for example, traffic volume)
between pairs of hosts
7 filter packet patterns which can be matched against traffic, either
for packet capture or event generation
8 packet capture the actual packets captured per the filter group
9 event defines events, what to do when an event occurs
Table 2.3: RMON Groups
More granular data is available with RMON. RMON defines a number of groups (as
shown in Table 2.3) that hold historical data about the network device and can be queried
by a central server. As such, RMON provides network administrators with comprehensive
network-fault diagnosis, planning, and performance-tuning information. RMON became
a draft standard in 1995 as RFC 1757, which has been obsoleted by RFC 2819.10
10the description here does not address the additional functionality of the RMON-MIB2 as defined in
RFC 2021 as this would not provide additional insights
CHAPTER 2. BACKGROUND 12
However in current network devices, not all RMON groups are implemented11. Therefore
these so called MiniRMON probes do not provide a detailed look into traffic patterns.
To use the full RMON functionality, RMON probes need to deployed. RMON probes
are used in the same way as sniffers, but provide a standardized set of analysis and
filtering. The deployment of an RMON infrastructure requires significant investment for
the probes and for the centralized management station, which collects the data from the
probes and does the further analysis.
2.2.3 NetFlow
Another alternative is provided via NetFlow [Cis05]. NetFlow is a feature on network
devices that provides statistics on packets flowing through the routing devices in the
network. When NetFlow is activated on a network device, it creates an internal record
for every flow that is forwarded. A flow is a unidirectional sequence of packets that have
common attributes, such as source & destination IP address, source & destination port
(see also Figure 2.6). Several counters of this record are updated whenever a packet
belonging to the particular flow is observed. These records can be exported to a server
which is referred to as NetFlow collector.
Figure 2.6: NetFlow enabled network device
2.2.4 Application Logfiles
Among methods on the network layer, most applications provide internal log files as well.
These logs commonly do not have network characteristics in mind, but may still provide
valuable data. In addition, the application logfiles provide insights about user activity in
the application.12 SAP R/3 in particular provides a comprehensive set of logfiles. These
can be retrieved within R/3 by system administrators with appropriate privileges via the
transactions ’STAD’ and ’ST03N’ [Jan05] (see also Figure 2.7).
11Cisco Catalyst switches support only the ’ethernet statistics’, ’history’, ’alarm’ and ’event’ groups12because this information could also be (mis-)used to create a profile about specific users, access
underlies works council and legal regulation
CHAPTER 2. BACKGROUND 13
Figure 2.7: SAP transaction ST03N (source SAP AG)
2.2.5 Assessment of Traffic Measurement Techniques
Table 2.4 summarizes the different methods that have been mentioned in this chapter.
Packet
level
trace
SNMP RMON NetFlow App.
Logfiles
Data volume high low moderate moderate moderate
Data
granularity
high low medium medium medium
Data
admission
SPAN port
or wire tap
via
network
devices
via RMON
probe
via
network
devices
only for
application
sysadmin
Intra-Session
behavior
analysis
possible not
possible
not
possible
not
possible
usually not
available
Per TCP
session data
possible13 not
available
not
available
available limited
Application
specific
analysis
only for
decodable
protocols
not
possible
not
possible
not
possible
possible
Required
resources
dedicated
hardware
can run on
a shared
server
dedicated
server and
probes
can run on
a shared
server
just disk
space
Table 2.4: Overview of Traffic Measurement Techniques
Considering the cost-benefit ratio, NetFlow is most suitable to create a profile of the
traffic characteristics. Raw packet level traces require dedicated hardware, plenty of disk
space and CPU time to store and process the captured data. [SF02] has shown, that
NetFlow provides quite similar TCP connection summaries.
13requires remarkable effort to reconstruct TCP sessions
CHAPTER 2. BACKGROUND 14
For the purpose of application specific network load analysis SNMP is inadequate. SNMP
can only be used to obtain a general load profile, which does not regard the different
applications used over the network. Also per-session data is not available via SNMP.
RMON does not provide any data, which could not be retrieved from packet level traces as
well. Also its implementation to analyze an application load profile does not provide any
benefits compared to sniffers. The main advantage of RMON is to work in a scalable and
standardized way. Its capabilities can be used as part of proactive network management
or capacity planning.
The only sensible alternative for NetFlow would be the application logfiles. However this
data was not accessible for this analysis. The application logfiles would provide much
more insight into the user behavior in respect to used transactions. In this sense the data
is also much more sensitive as already mentioned above.
After all NetFlow allows quite granular and accurate traffic measurements and high-level
aggregated traffic collection. NetFlow provides extensive data per TCP session. It can
be implemented without additional invest for new servers or probes and the volume of
data is still in a manageable order of magnitude, even for longer periods of observation.
CHAPTER 2. BACKGROUND 15
2.3 NetFlow in detail
2.3.1 Flow definitions
One of the first models of traffic flows was the packet train model by Jain [JR86] (see
also Figure 2.8). He defines a packet train as a burst of packets arriving from the same
source and heading to the same destination. If the spacing between two packets exceeds
some inter-train gap, they are said to belong to different trains.
Figure 2.8: The packet train model
Jain considers the inter-car gap as a system parameter, that depends on network hardware
and software. The packet train model reflects the fact that network sessions consist of
a sequence of bursts and idle periods. In the following, Claffy et al. have introduced a
more generalized form of a time-out based flow model on the IP layer [CBP95] which
was inspired by the packet train model. This model describes a flow as ”a sequence of
packets matching certain criteria, exchanged between two entities on a network”.
The Cisco NetFlow implementation matches that flow definition quite well. Cisco defines
a flow as a sequence of packets that are identical in seven key attributes [Cis05]:
• Source IP address
• Destination IP address
• Source port
• Destination port
• Layer 3 protocol
• TOS byte (DSCP14)
• Input interface
In contrast to Claffy’s definition, the Cisco does not work solely timeout-based, but also
resource dependent.
14DiffServ Code Point
CHAPTER 2. BACKGROUND 16
2.3.2 Flow Expiration
The primary job of network devices is to forwarding of payload. Thus the collection of
statistics like NetFlow should not become a burden. In order to free up resources in a
timely manner, Cisco has defined several rules, when flows expire:
• Flows which have been idle for a specified time are expired and removed from the
cache. This is called the inactive time-out.
• Long lived flows are expired and removed from the cache. Flows are not allowed
to live more than this active time-out, the underlying packet conversation remains
undisturbed.
• As the cache becomes full a number of heuristics are applied to aggressively age
groups of flows simultaneously. Although it makes sense that the router preserves
its resources to remain fully functional, it is unfortune, that these heuristics are
rarely documented by Cisco.
• TCP connections which have reached the end of byte stream (FIN) or which have
been reset (RST) will be expired.
Figure 2.9 illustrates the active and inactive time-out. In this example three flow records
are exported, the first because of an active timeout, the second because on an inactive
timeout and the final one because of the session end (FIN set).
Figure 2.9: Flow timeout
2.3.3 NetFlow and TCP Sessions
Following Cisco’s definition, a flow is unidirectional. This implies, that a TCP session
results in a minimum of two flow records, one for each direction. As explained in the
previous section, time-outs may also fragment one direction’s TCP channel into multiple
records.
As Sommer et al. [SF02] have shown, these fragments can be reassembled to TCP con-
nection summaries.
CHAPTER 2. BACKGROUND 17
2.3.4 NetFlow on Routers
Routers use an extra memory region - the NetFlow cache - to keep account for each active
flow if NetFlow is enabled. In the past, the NetFlow cache was also used to accelerate
the packet forwarding process. With the new switching mode CEF15, the NetFlow cache
is used for traffic statistics only. For every packet the usage counters (number of packets
& bytes) are updated in the NetFlow cache upon forwarding.
Beside the key attributes and the usage counters, additional information about the flow
is stored in the cache as well. Table 2.5 shows all data, which is stored in the NetFlow
cache for v1 flows16.
Source IP address
Destination IP address
Next hop’s router IP address
Ingress interface SNMP ifIndex
Exgress interface SNMP ifIndex
Packets in the flow
Bytes in the flow
SysUptime at the start of the flow
SysUptime at the time the last packet of the flow was received
Layer 4 source port number or equivalent
Layer 4 destination port number or equivalent
Layer 4 protocol (6=TCP, 17=UDP)
IP type-of-service byte
Cumulative OR of TCP flags
Table 2.5: NetFlow v1 data
When flows expire multiple flows are packed together into an UDP packet17, which is
then sent to the NetFlow collector station. This is called NetFlow Data Export (NDE).
As there is no well-known UDP port for NDE, the port can chosen arbitrarily by the
network administrator.
The default setting for the active timeout on routers is 30 minutes, for the inactive
timeout 15 seconds.
15CEF (Cisco Express Forwarding) works with a prebuild Forwarding Information Base (FIB) and
does no longer rely on caching.16the differences between the several NetFlow versions are shown in the section NetFlow Versions
17for v1, each flow record requires 44 bytes; an UDP packet may contain up to 24 records
CHAPTER 2. BACKGROUND 18
2.3.5 NetFlow on Switches
Switches work differently than routers. The obvious difference is that switches work at
layer 2, whereas routers work on layer 3. This difference is more and more disappearing
since layer 3 switches have been introduced. The real distinction behind the scenes is
that switches primarily forward packets on hardware level, whereas routers forward on
software level.18
In order to provide NetFlow statistics, a switch needs to be at least aware of layer 3, i.e.,
the switch needs to look into the layer 3 and layer 4 header (for UDP/TCP port numbers)
to extract the keys as well as additional information, which might be exported. For this
reason only a few Cisco switch models with special daughter cards support NetFlow:
• Cisco Catalyst 5000/5500 (end-of-sales since 2003) supports NetFlow if a NFFC
(NetFlow Feature Card) or NetFlow Service Card is installed
• Cisco Catalyst 4500 with Supervisor Engine IV or Supervisor Engine V
• Cisco Catalyst 6000/6500 if the Supervisor Engine has a PFC (Policy Feature Card)
Initially, NetFlow was only supported for traffic that was routed by the layer 3 switch.
To accelerate the packet forwarding, only the first packet was sent to an extra router19.
The forwarding decision was stored in the MLS (MultiLayer Switching) cache. The same
cache was also used to keep account of NetFlow data. One limitation of NetFlow on
switches is that not all attributes that are exported by routers are available on switches
as well (e.g., switches do not export the TCP flags).
In newer versions of the operating system CatOS, Catalyst 6000/6500 switches support
in also to export NetFlow data about layer 2 switched traffic. However, enabling this
feature can result in significant load caused by NDE.
2.3.6 Aggregation and Sampling
In an high-bandwidth, high-volume environment NetFlow can consume significant re-
sources, in terms of CPU, memory, network bandwidth, or disk space. In order to reduce
required resources to enable NetFlow in such areas, there two modifications came up:
• Aggregation: rarely, people are interested in the raw flow records themselves, but
in statistics about traffic characteristics. The raw flow data is classified by different
18However, with the most recent routers, the distinction between routers and switches is more and
more vanishing.19that might reside on a special daughter card like the MSFC (Multilayer Switching Feature Card) for
Catalyst 6000/6500
CHAPTER 2. BACKGROUND 19
keys and afterwards summarized. Alternatively some summarization can already
be done on the network device.
• Sampling: Instead of considering every single flow, only a statistical spread is taken.
2.3.7 NetFlow Versions
• Version 1
This is the initially released export format as described above. In its core, it is still
incorporated into all subsequent versions.
• Version 5
Additional fields for BGP autonomous system (AS) information has been added,
as well as sequence numbers.
• Version 7
This version is only supported by switches, not by routers. The data fields are
similar to version 5.
• Version 8
Version 8 allows export datagrams to contain a subset of the usual Version 5 export
data in an aggregated format.
• Version 9
This format accommodates new NetFlow-supported technologies such as multicast,
Multiprotocol Label Switching (MPLS), and Border Gateway Protocol (BGP) next
hop. The distinguishing feature of the NetFlow Version 9 format is that it is tem-
plate based. Templates provide a means of extending the record format, a feature
that should allow future enhancements to NetFlow services without requiring con-
current changes to the basic flow-record format. Internet Protocol Information
Export (IPFIX) was based on the Version 9 export format.
In this work NetFlow version 1 is used. The main reason for this was support by the
intended network devices and by the collector software. Version 1 was the least common
denominator in the environment. However, today at least version 5 would be favorable.
The most significant improvement in version 5 was the introduction of the sequence
number for NDE datagrams. Using this sequence number, the data quality can be
analyzed as well.
CHAPTER 2. BACKGROUND 20
2.4 Statistics
The design of robust and reliable networks and network services has become an in-
creasingly challenging task in today’s world. To achieve this goal, understanding the
characteristics of Internet traffic plays a more and more critical role. Empirical studies
of measured traffic traces have led to the wide recognition of self-similarity in network
traffic [ZYD].
In earlier times the Poisson process with a memoryless waiting-time distribution was
used to model traditional telephony networks. This model says that the intervals ’T’
between call arrivals and departures are intervals between independent, identically dis-
tributed random events. It can be shown that these intervals have a negative exponential
distribution, i.e.: P [T ≥ t] = e−t/h where ’h’ is the mean holding time (MHT) [Wik06].
Heavy-tailed distributions have properties that are qualitatively different to commonly
used distributions such as the Poisson distribution. A distribution is said to have a heavy-
tail if: P [X > x] ∼ x−α, as x → ∞, 0 < α < 2 Regardless of the distribution for small
values of the random variable, if the asymptotic shape of the distribution is hyperbolic,
it is heavy-tailed. The simplest heavy-tailed distribution is the Pareto distribution which
is hyperbolic over its entire range. A characteristic of heavy-tailed distributions is that
the log-log plot of the tail of such a distribution is approximately linear over many orders
of magnitude. If the logarithm of the range of an exponential distribution is found, the
resulting plot is linear.
The Pareto distribution, named after the Italian economist Vilfredo Pareto, is a power
law probability distribution found in a large number of real-world situations. Pareto
originally used this distribution to describe the allocation of wealth among individuals
since it seemed to show rather well the way that a larger portion of the wealth of any
society is owned by a smaller percentage of the people in that society. This idea is
sometimes expressed more simply as the Pareto principle or the ’80-20 rule’ which says
that 20% of the population owns 80% of the wealth. In the context of ranking a special
form of the Pareto distribution is used, which is named Zipf’s law. Zipf’s law states that
the size of the rth largest occurance is inversely proportional to its rank.
Since (unlike traditional telephony traffic) packetised traffic exhibits self-similar or fractal
characteristics, conventional traffic models do not apply to networks which carry self-
similar traffic. With the convergence of voice and data, the future multi-service network
will be based on packetised traffic, and models which accurately reflect the nature of
self-similar traffic will be required to develop, design and dimension future multi-service
networks.
Chapter 3
Data Collection
The data for this analysis has been collected in an live enterprise environment over a
period of three month. The observed SAP R/3 installation serves roughly 40.000 users
located in the EMEA region (Europe, Middle East and Africa). The R/3 system is used
for all kinds of financial accounting, ordering, billing and payment processes as well as
human resources administration. Beside staff members of finance or HR administrators,
who use R/3 quite intensively, it is used by almost all employees at least occasionally
(e.g., for vacation planning or expense reports). The data has been captured over three
months between beginning of September 2005 and end of November 2005.
This chapter describes the network & SAP R/3, where the NetFlow data has been gath-
ered, the configuration modifications on the network devices and how the collection has
been set up.
3.1 Network Setup
As explained earlier, SAP R/3 is a vital application for most companies. Thus the in-
frastructure is architected in a redundant manner. For maximum redundancy, the servers
are located in two separate data centers in several kilometers distance. The network spans
over both data centers using a DWDM (Dense Wavelength Division Multiplex) system.
Figure 3.1 shows a high-level overview of the network topology. The network is structured
in several layers. The servers are attached to a couple of layer 2 switches. These switches
are connected to a pair of layer 3 switches - the serverfarm switches. These switches
concentrate all internal servers (including non-SAP servers) in the data centers. Another
pair of layer 3 switches - the core switches - connects the office LAN environment, the
server farm and the wide area backbone. The WAN connects the locations within the
EMEA region as well as the US and Asia Pacific region.
21
CHAPTER 3. DATA COLLECTION 22
Figure 3.1: Network Schematic
The NetFlow Data Export (NDE) has been set up on the serverfarm switches. The
serverfarm switches have been chosen as all external traffic1 has to pass this pair of
switches. The layer 2 switches within the serverfarm do not support NDE. An alternative
would be to run NDE on the core switches. However, all traffic related to the SAP R/3
server that passes the core switches needs also to pass the serverfarm switches. However
the core switches transport significant more payload and this would result in a higher
data volume without getting more insights.
3.2 SAP R/3 Server Infrastructure
The analyzed R/3 implementation runs on a serverfarm of four SunFire 15k[Mic]. Two
systems are located in the primary data center, the other two in the backup data center.
The SF15k is partitioned into several domains. All domains are separated from each
other. Each domain is running its own instance of the operating system Solaris. The
four SF15k machines are partitioned into 38 domains (plus 8 spare domains) as shown
in Figure 3.2.
The production system spans over six domains. All six production domains are running
1i.e., all traffic which is not internal to the serverfarm, like the connection to the database server
CHAPTER 3. DATA COLLECTION 23
Figure 3.2: SAP R/3 Server infrastructure
in the primary data center. The database server and the R/3 Central Instance (CI) are
running on a cluster, i.e., if there is a problem on either of the two domains, the other
domain takes the workload of both. This is essential, as the database server and the R/3
CI server are single-points-of-failure in the SAP R/3 architecture. The failure of either
of them would bring down the whole R/3 installation. To avoid that both domains are
affected by the same defect simultaneously, the domains are hosted on different SF15k
machines.
Similarly, each two of the four R/3 dialog instances are running on different machines.
Using the message server the SAPGUI on the user’s PC chooses one of these servers to
connect to. The dialog instances are interacting with the SAPGUI, generates dynamic
screen masks, and executes the ABAP code.
Although it is generally possible, to use the R/3 CI server in the same way as the dialog
instances, normal end-users are not allowed to logon to the CI server respecting the fact,
that the load of CI server affects the performance of the entire R/3 system. Only special
users like system administrators are allowed to logon to the CI server.
For vital applications as SAP R/3 in large enterprises, the high-availability achieved by
clustering is not enough. These companies want to protect such applications also against
natural disasters, fire, or terrorist attacks. In order to meet these concerns, a backup
data center is used as disaster recovery (DR) site. The DR site provides with a second
pair of SF15k machines enough capacity to handle the full workload in case of a disaster.
Side the production instance of R/3 there is a number of supplementary instances, mainly
used in the development cycle for testing, consolidation, or quality assurance. There are
also instances for training and monitoring.
CHAPTER 3. DATA COLLECTION 24
3.3 NetFlow Data Export
As explained above the server farm switches have been chosen for NDE. These switches
are Cisco Catalyst 6500 switches with the Supervisor Engine 1A and MSFC. The switches
run the CatOS 6.3 operating system.
The configuration on the switches is quite simple. Figure 3.3 shows the commands
required to activate NetFlow a switch. The command set mls flow full sets the flow
mask, i.e., which attributes are hold in the MLS cache for NetFlow. set mls nde <ip
address> <port> defines the NetFlow collector, where the NDE datagrams are sent to.
#mls
set mls flow full
set mls nde version 1
set mls nde <ip address> <port>
set mls nde enable
Figure 3.3: Serverfarm switch configuration
As shown in Figure 3.4, the command show mls nde is used to verify that NDE runs
properly. The number of exported flows should increase if the command is executed
multiple times. The switch is now sending UDP packets with NetFlow records to the
collector station.
Switch (enable) sh mls nde
Netflow Data Export version: 1
Netflow Data Export enabled
Netflow Data Export configured for port <port> on host <ip address>
Total packets exported = 420
Switch (enable)
Figure 3.4: Verifying the NDE operation on switches
The Catalyst 6500 supports several versions in CatOS 6.3. Versions 1, 7, and 8 are
supported. Version 1 has been chosen, as it provides all data fields of interest that are
available2. In addition, version 1 is supported by switches, as well as by routers, which
makes the work transferable to different environments. A drawback of this decision was
identified during the data analysis. As NetFlow version 1 does not support sequence
numbers, it is not possible to verify if NDE datagrams are lost on their way to the
collector.
2in addition the TCP flags would be interesting, but they are not supported in any NetFlow version
on this switch
CHAPTER 3. DATA COLLECTION 25
3.4 NetFlow Collector
The UDP NetFlow packets are received by the NetFlow collector. The collector is a
server that runs a special software which listens to the configured UDP port for NetFlow
datagrams. During the last years several NetFlow collector software packages emerged.
The OSU flow-tools [RFL00] is an open-source implementation of a NetFlow collector.
Other than most commercial collectors or the CAIDA.org cflowd the flow-tools package
does not aggregate or process the flow records, it receives. It rather dumps all records as
compressed files onto disk. This provides maximum flexibility for the further evaluation.
flow-capture Collect, compress, store, and manage disk space for exported flows
from a router.
flow-cat Concatenate flow files. Typically flow files will contain a small
window of 5 or 15 minutes of exports. Flow-cat can be used to
append files for generating reports that span longer time periods.
flow-fanout Replicate NetFlow datagrams to unicast or multicast destinations.
Flow-fanout is used to facilitate multiple collectors attached to a
single router.
flow-report Generate reports for NetFlow data sets. Reports include
source/destination IP pairs, source/destination AS, and top talkers.
flow-tag Tag flows based on IP address or AS #. Flow-tag is used to group
flows by customer network. The tags can later be used with
flow-fanout or flow-report to generate customer based traffic reports.
flow-filter Filter flows based on any of the export fields. Flow-filter is used
in-line with other programs to generate reports based on flows
matching filter expressions.
flow-import Import data from ASCII or cflowd format.
flow-export Export data to ASCII or cflowd format.
flow-send Send data over the network using the NetFlow protocol.
flow-receive Receive exports using the NetFlow protocol without storing to disk
like flow-capture.
flow-gen Generate test data.
flow-dscan Simple tool for detecting some types of network scanning and Denial
of Service attacks.
flow-merge Merge flow files in chronoligical order.
flow-xlate Perform translations on some flow fields.
flow-expire Expire flows using the same policy of flow-capture.
flow-header Display meta information in flow file.
flow-split Split flow files into smaller files based on size, time, or tags.
Table 3.1: OSU flow-tools collection
CHAPTER 3. DATA COLLECTION 26
The OSU flow-tools provide a library and a collection of programs used to collect, send,
process, and generate reports from NetFlow data. The tools can be used together on a
single server or distributed to multiple servers for large deployments. Table 3.1 shows
the programs included in the flow-tools.
To capture the NetFlow data for this analysis the flow-capture program is running on
a Solaris based server with sufficient disk space. Disk space is essential, as every day
roughly 120-170MB (compressed) are written to disk - during the weekends significantly
less. The flow-capture program is quite simple. It receives the NDE datagrams on a
configurable UDP port and writes them in a structured manner3 onto disk as shown in
Figure 3.5. Every 15 minutes flow-capture starts a new file. These chunks are used for
the further processing as their size is much better to handle than a single file per day
would be.
Due to normal maintenance, the NetFlow collector was rebooted on September, 12th.
After the reboot the collector process was not restarted, as it is completely running in the
user-space without any root privileges. This caused loss of NetFlow records over a period
of 42 hours until it has been discovered. To aviod similar incidents, a cronjob is regularly
checking, whether the flow-capture process is running and restarts it, if necessary.
$ du -sk DATA/raw/*/*/*
131297 2005/2005-09/2005-09-01
127281 2005/2005-09/2005-09-02
70065 2005/2005-09/2005-09-03
73905 2005/2005-09/2005-09-04
138265 2005/2005-09/2005-09-05
137655 2005/2005-09/2005-09-06
136103 2005/2005-09/2005-09-07
136009 2005/2005-09/2005-09-08
130817 2005/2005-09/2005-09-09
69321 2005/2005-09/2005-09-10
67969 2005/2005-09/2005-09-11
120919 2005/2005-09/2005-09-12
81779 2005/2005-09/2005-09-14
139077 2005/2005-09/2005-09-15
129701 2005/2005-09/2005-09-16
72085 2005/2005-09/2005-09-17
[...]
Figure 3.5: Raw data collected with flow-capture
3there are individual subdirectories per year, month, and day
Chapter 4
Data Preparation
This chapter describes how the raw data is prepared for the analysis. The preparation
falls into two major parts, the filtering and the reconstruction of TCP session data.
The raw data has been captured over three months in 2005 starting September, 1st 0:00
and ending November 30th 23:59. Overall during this period 668 million flows have been
recorded, consisting of 54 billion packets and a payload of 23.5 TB.
The collector was - as mentioned earlier - down for 42 hours starting September 12th as
the flow-capture daemon was not restarted after a reboot. As this is less than 2% of the
total flow collection time, this should not impact the overall expressiveness.
In order to access statistical data from the raw NetFlow files, the perl module ’CFlow’
is used. It provides a simple way to access files written by flow-tools. CFlow’s core is
written in C and calls the flow-tools library functions. The CFlow module provides the
Perl function ’CFlow::find’, which requires references to two functions as argument:
• perfile - a function that is invoked for every raw flow file. It allows to initialize data
structures before processing each new file.
• wanted - a function that is called to process each flow. It provides access to the
flow attributes.
In order to allow access to the counters in the header (such as the total number of flows, or
the timestamps, of the first and last recorded flow) of the raw-flow files, the modifications
shown in appendix A have been made.
27
CHAPTER 4. DATA PREPARATION 28
4.1 Data Filtering
As the NetFlow data is captured on a pair of central server farm switches it includes
much more traffic than just the SAP R/3 traffic. Furthermore, it also includes traffic
to the test and development instances of R/3 which are out of scope for this analysis.
Therefore the data has to be filtered, before the analysis can proceed.
The filtering is a multi step process (see Figure 4.1). The filters are derived primarily from
the captured raw data; servers and services are identified from the traffic observed in the
NetFlow data. One advantage of this methodology is that it is not lead by expectations
or assumptions.
Figure 4.1: High-level process of data filtering
During the ’prefiltering’ all external traffic (i.e., all traffic that obviously does not belong
to the SAP R/3 system) is removed. This eliminates a significant amount of traffic
not relevant for this analysis. Afterwards the traffic is examined to identify servers and
services. The identified servers and services are matched against our knowledge about
the infrastructure. All identified services are rated whether they are related to end user
traffic, which is subject of this analysis. This traffic is finally filtered out.
4.1.1 Prefiltering
The aim of ’prefiltering’ is to remove all traffic which is obviously not interesting for this
analysis. Such traffic falls into one of the following three categories:
• all non-TCP traffic
• all traffic that does neither target nor originate from the SAP server subnet
• administrative traffic (such as ssh, smtp, etc.)
CHAPTER 4. DATA PREPARATION 29
The flow-tools suite provides several tools which have been listed in Section 3.4. The
flow-nfilter executable uses an external filter definition file to select NetFlow records.
Each filter has a filter-definition tag which consists of a DNF (disjunctive normal form)
of the filter-primitives.
Each filter-primitive defines an actual filter criteria. A primitive specifies a single at-
tribute that is compared to the processed NetFlow record. The primitive is simply
processed as list: as soon as the first permit or deny statement matches the processed
record, the corresponding match statement in the filter-definition is evaluated as true or
false. Figure 4.2 shows the filter used for prefiltering.
filter-definition just-sap-data
match ip-source-address sap-subnet
match ip-protocol tcp
match ip-source-port no-admin-ports
match ip-destination-port no-admin-ports
or
match ip-destination-address sap-subnet
match ip-protocol tcp
match ip-source-port no-admin-ports
match ip-destination-port no-admin-ports
filter-primitive sap-subnet
type ip-address-prefix
permit 192.168.151/24
filter-primitive tcp
type ip-protocol
permit tcp
filter-primitive no-admin-ports
type ip-port
deny 22
deny 25
# [...]
default permit
Figure 4.2: filter-definition tag for prefiltering
As Table 4.1 shows, the prefiltering reduces the data volume that has to be processed by
any subsequent task significantly, even though the filtering criteria are quite simple.
CHAPTER 4. DATA PREPARATION 30
Raw data Prefiltered data Percentage
payload bytes 25.301.246.554.606 585.503.729.230 2.31%
#packets 54.296.169.003 1.016.542.915 1.87%
#flows 667.937.669 15.403.357 2.31%
Table 4.1: Results from Prefiltering
4.1.2 Server Identification
The server identification is an iterative process in itself (see Figure 4.3). First the top-
talkers are extracted from the examined dataset. The number of connections has been
chosen to identify the top-talkers. Alternatively the number of packets or the volume of
payload could be used, but the usage of the number connections is sufficient in this case.
Figure 4.3: Process of server identification
The top-talkers can be easily extracted by some piped shell commands like ‘| sort |
uniq -c | sort -rn | head’. Whenever an IP address/TCP port tuple occurs with a
high frequency in the collected NetFlow records, it can be a service. Since the NetFlow
records do not unveil, who initiated a connection, it may also be the client that connects
to a service. Commonly the clients use random TCP ports as source when connecting to
a server. However, sometimes clients also use fixed ports or accumulation points occur at
CHAPTER 4. DATA PREPARATION 31
random. For this reason, every identified service needs to be verified. This has been done
by crosschecking against knownledge about the analysed infrastructure or by connecting
to this service - if the SYN is answered by a SYN/ACK, the server listens to this TCP
port. Otherwise a RST is send or the SYN packet simply times out.
filter-primitive ip-192.168.151.10
type ip-address
permit 192.168.151.10
filter-primitive port-3210
type ip-port
permit 3210
filter-definition sapp10-0-a
invert
match ip-source-address ip-192.168.151.10
or
match ip-source-port port-3210
filter-definition sapp10-0-b
invert
match ip-destination-address ip-192.168.151.10
or
match ip-destination-port port-3210
[...]
Shell-Script ’proc.negative-filter’:
[...]
| $FLOW_NFILTER -f$FILTER -Fsapp10-0-a \
| $FLOW_NFILTER -f$FILTER -Fsapp10-0-b \
| $FLOW_NFILTER -f$FILTER -Fsapp10-1-a \
| $FLOW_NFILTER -f$FILTER -Fsapp10-1-b \
[...]
Figure 4.4: Excerpt from negated filter
Whenever a service is detected, it is removed from the data which then represents the
unidentified portion of the traffic. In order to simplify the filtering process, the Perl
script ’create-filter’ listed in appendix B generates the filters for the flow-nfilter program.
The create-filter program uses a condensed syntax for filter expressions. Each filter-
line has the format ’source-ip:source-port:destination-ip:destination-port’.
If an attribute is ommited, it matches any ip address or TCP port. Once a service is
identified, it is appended as an additional line to the create-filter input file. create-filter
then generates the following files:
CHAPTER 4. DATA PREPARATION 32
• filter.out - specifies the filter definitions for flow-nfilter
• proc.positive-filter - is a shell script, which applies the filter definitions of filter.out
to extract the selected traffic. It is used in a piped context, i.e., it expects the raw
NetFlow data from STDIN and writes the filtered data to STDOUT
• proc.negative-filter - is a shell script similar to proc.positive-filter, but it provides
the complementary data. Only those NetFlow records that do not match any of
the filter definitions is written to STDOUT.
One of the issues with the way, flow-nfilter defines filters is that it supports only DNF.
However for the complementary filtering it is benefitial to use a CNF as negation of the
DNF of the filter, as the transformation into the corresponding DNF might result in an
exponential growth of the expression. Therefore a trick has been used to support CNF as
well: the clauses of the CNF are processed in an separated run of flow-nfilter which are
tied together by shell pipes (see Figure 4.4. The drawback of this approach to filter the
NetFlow data is that shell pipes and the creation of the context for numerous instances
of flow-nfilter cause significant overhead.
A total of 82 services have been identified. As Table 4.2 shows, the filters classify more
than 99.995% of the collected NetFlow data regarding all the measures (number of flows,
packets, or bytes). However, in order to review the accuracy it is more interesting to
put the unidentified NetFlow records into relation with the filtered traffic (excluding the
traffic which has been considered as being not interesting for this analysis). Regarding
this measure, the filters still cover more than 99.5%.
Unidentified traffic Percentage Percentage of
of raw data filtered traffic
payload bytes 328.539.438 0.0013% 0.09%
#packets 1.571.397 0.0029% 0.25%
#flows 23.984 0.0036% 0.39%
Table 4.2: Unidentified traffic
4.1.3 User Traffic
The last step restricts the traffic to the end user traffic that accesses the production
systems. Server to server communication (such as the SAP R/3 RFC calls to the gateway
server) is not included in this analysis, as well as users logging on to non-production
systems such as training or test. This traffic might have very different characteristics,
and therefore is filtered.
The SAP R/3 installation runs on four production domains on the SF15k (see Section
3.2. One serves as central instance (CI), the other three are installed as application
CHAPTER 4. DATA PREPARATION 33
servers. The CI server in this installation listens to two TCP ports that are of interest
for end users:
• 3610: is the message server which is used by SAPGUI on the end user’s PC to
decide which application server it shall use.
• 3210: is the dispatcher service; a single TCP session is established between the
SAPGUI on the user’s PC to this port as long as the user is logged on.
The majority of the users only uses the message server service of the CI server. In order
to log on to the SAP R/3 system, they connect to one of the four application servers
on port 3211. In addition to the CI server and the four application servers, two print
servers are also accounted to interactive traffic, as their activity is triggered by the end
user. The print servers establish outbound connections to the printers (or intermediate
print servers that serve as hub). These connections use the LPD protocol [McL90] and
target TCP port 515. Figure 4.5 shows the filter that is used by create-filter to extract
interactive user traffic.
host sapc03
filter 192.168.151.8:3211
host sapd03
filter 192.168.151.9:3211
host sapp10
filter 192.168.151.10:3210
filter 192.168.151.10:3610
host sapc02
filter 192.168.151.11:3211
host sapd02
filter 192.168.151.12:3211
host print
filter 192.168.151.35:::515
filter 192.168.151.81:::515
Figure 4.5: Filter for interactive user traffic
Table 4.3 and Figure 4.6 summarize the results from filtering.
Filtered traffic Percentage of raw data
payload bytes 367.416.452.139 1.45%
#packets 636.655.006 1.17%
#flows 6.085.131 0.91%
Table 4.3: Filtered traffic
CHAPTER 4. DATA PREPARATION 34
payl
oad
(GB
)
02
46
8
mill
ion
pack
ets
02
46
810
12
1 5 9 14 19 24 29 34 39 44 49 54 59 64 69 74 79 84 89
day
thou
sand
flow
s
020
4060
8010
012
0
Figure 4.6: Filtered data volume (#flows, #packets, bytes)
CHAPTER 4. DATA PREPARATION 35
4.2 Reconstruction of TCP Sessions
For each user, there is a single TCP session that starts, when the user connects to the
SAP R/3 system and ends, when the user logs off - or may be logged off by the system
itself, when the inactivity time-out expires. As the TCP sessions are congruent with the
sessions on the application layer, the TCP session characteristics are particularly suitable
to analyze the network behavior of SAP R/3.
Yet NetFlow records do not directly correspond to TCP sessions. There are two major
reasons for this:
• time-outs - in order to save resources on network devices that create NetFlow
statistics, a data stream of a particular TCP session may be broken into multiple
flow records by the active or inactive timeout (see Section 2.3.2).
• uni-directional flow definition - the definition of flows in the context of NetFlow
only covers a unidirectional stream of packets, but a TCP session is bidirectional.
Thus every TCP session results in separate NetFlow records for each direction.
The section Flow Defragmentation describes, how NetFlow records that belong to the
same data stream are put together to a single records, where TCP Session Reassembly
shows, how the two uni-directional data streams are recomposed.
4.2.1 Flow Defragmentation
In order to save resources on network devices that create NetFlow statistics, a data
stream may be broken into multiple flow records. This happens upon time-outs or when
the internal NetFlow cache is almost full. Cisco devices have two different time-outs.
The inactivity time-out occurs, when there was no packet seen for a while, that belongs
to the particular flow. The active time-out happens when the flow is in the NetFlow
cache for a certain time, regardless whether there are still packets seen that belong to
this flow, or not.
The script ’defragment’, which is listed in Appendix C, puts flow records that belong to
the same data stream together to a single record. The approach taken is similar to what is
described in [SF02]. For each NetFlow record, a connection identifier is computed. The
connection identifier is a 4-tuple: (source-IP, destination-IP, source-port, destination-
port). All flows with identical connection identifier potentially belong to the same TCP
session. However, two different TCP sessions can share the same connection identifier,
when the TCP endpoints choose to reuse the same socket. Usually the socket on the
server is bound to the service and does not change, where the TCP client shall not
reuse a socket for the TIME WAIT interval after a connection tear-down (see [Pos81]).
Commonly the client chooses a different port for the next connection. Nevertheless, it
CHAPTER 4. DATA PREPARATION 36
happens that clients reuse port numbers, maybe after a reboot or because there is only
a finite number of ports and the port number wrapped around.
Other than on routers, NetFlow does not provide TCP flags on switches. Thus there
are no indicators for session start or session end. The heuristic for the defragmentation
works on timers: if two NetFlow records share the same connection identifier and the
distance between the end time of the first flow and the start time of the second flow is
less than some timer t, both records are combined and counters are added up.
Figure 4.7 shows in the left plot a histogram of the duration of the NetFlow records.
The plot illustrates the occurance of both time-outs: the spikes in the histogram at an
interval of roughly 200 seconds are mainly caused by the inactivity time-out, the peak
between 1800 and 1920, as well as the cut at 1920 seconds result from the active time-out.
The right plot shows a CCDF the distance between NetFlow records that share the same
connection identifier.
Flow Duration (seconds)
Fre
quen
cy
0 500 1000 1500 2000 2500 3000
020
000
4000
060
000
8000
0
time (seconds)
P[X
> x
]
0.00
10.
010.
11
1 10 100 1000
Distance between flows with same connection identifier
Figure 4.7: Histogram of flow duration and CCDF of distance between fragments
As indicated in the right plot, many flows have a distance of roughly 200 seconds. For
the SAPGUI sessions and the message server sessions, t = 2151 has shown to work
effectively. Especially the SAPGUI sessions do not show longer periods without sending
any packets as the application continuously sends keepalives.
Figure 4.8 shows the same histogram as 4.7 after defragmentation with different parame-
ters. The left plot shows that the spikes are significantly smaller, but do not completely
disappear. Still a high number of sessions does not exceed 1800, but this has other rea-
sons, as described in Chapter 5. The spikes almost disappear in the plot on the right.
Nevertheless the parameter t is only set to 3600s for the print server connections. It
seems that some of the print server connections have longer idle periods, where no pack-
1the same value has also been used by Sommer et al.
CHAPTER 4. DATA PREPARATION 37
flow length (seconds)
Fre
quen
cy
0 500 1000 1500 2000 2500 3000
020
000
4000
060
000
8000
0
flow length (seconds)
Fre
quen
cy
0 500 1000 1500 2000 2500 3000
020
000
4000
060
000
8000
0
Figure 4.8: Histogram of flow duration (defragmentated with t=215 & t=3600s)
ets are sent. For the SAPGUI sessions 3600s is considered to be to high, as this would
also recompose two different TCP sessions with a reboot of the PC in between which
would reinitialize the port number.
The defragmentation composes from 6.1 million collected NetFlow records a total of 5.1
million flows.
4.2.2 TCP Session Reassembly
The next step is to recompose the bi-directional TCP session from the uni-directional
flows provided by defragmentation. This is done by the script ’reassemble’ as shown in
Appendix D.
For the reassembly first the connection end-points of the flows are classified into client
and server. This can be either done by the specification of a prefix to detect the server
subnet, or the end-point with the lower port number is considered to be the server. A
session identifier is computed as the following 4-tuple: (client-IP, client-IP, client-port,
server-port). All flows with the same session identifier that overlap regarding flow start
and end time are considered to belong to the same session. It has shown to be useful
extend this to flows which do not overlap but have a gap of less than ten seconds.
The reassembly reconstructs from 5.1 million flows provided by defragmentation a total
of 2.2 million TCP sessions.
Chapter 5
Data Analysis
This chapter describes the statistical analysis. The data will be examined for character-
istic distributions, or remarkable phenomena. The analysis concentrates only on traffic
which is directly related to end user interaction and is split into the following classes:
• SAP Message Server
• SAPGUI
• Printer
The SAP Message Server is usually contacted first when a user wants to log on to the
SAP R/3 system. It advises the SAPGUI client software which application server it shall
connect to. Then the SAPGUI connects to the dispatcher process of the application
server, which is in the following simply called a SAPGUI connection. When users print a
report, this may either be done through the PC’s operating system (this requires the user
to stay online, until the print job is completed), or asynchronously through the SAP R/3
printing subsystem which sends the report directly to the printer. In the latter case, the
LPD protocol is used and the traffic is analyzed in the Printer section of this chapter. If
the user chooses to print synchronously, the print data is tunneled through the SAPGUI
session, but cannot be distinguished from other activities on a TCP session level. Thus
it is included in the SAPGUI traffic analysis.
5.1 SAP Message Server
The Message Server is responsible for the load distribution over multiple application
servers. The client software contacts first the Message Server when the user wants to
connect to SAP R/3. The Message Server then advises the client, which application
server to use.
38
CHAPTER 5. DATA ANALYSIS 39
For SAP Message Server traffic, the following aspects are examined:
• Session duration
• Data volume
• Packet size
The analyzed dataset consists of 998607 TCP sessions, which have been reassembled as
described in the previous chapter. These TCP sessions have a volume of 14.2 million
packets or 2.6 gigabytes.
5.1.1 Session duration
CCDF of session duration
flow duration (seconds)
P[X
> x
]
1e−
061e
−05
1e−
040.
001
0.01
0.1
1
1 10 100
message server TCP connections
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Density plot of session duration
duration (seconds)
Den
sity
1 2 4 8
Figure 5.1: CCDF and density plot of flow duration Message Server (log scales)
Both plots in Figure 5.1 show the distribution of Message Server session durations. The
left plot is a CCDF plot. On the x-axis the flow duration is shown. It is log scaled in
order to have a good representation for short sessions and longer sessions at the same
time. The y-axis shows the complementary cumulative distribution and is log scaled as
well. The right plot shows the density function of the same dataset.
The bars in the CCDF plot for very short session durations (5 seconds and below) are
caused by discretization. NetFlow does not provide a resolution of fragments of seconds
for timestamps. Both plots show, that the vast majority of is extremely short: 94% of
all sessions does not take longer than one second, 99.2% finished within five seconds.
CHAPTER 5. DATA ANALYSIS 40
5.1.2 Data volume
CCDF of session volume
flow volume (bytes)
P[X
> x
]
1e−
061e
−05
1e−
040.
001
0.01
0.1
1
100 1000 10000
to message serverfrom message server 0
2040
6080
Density plot of session volume
volume (bytes)D
ensi
ty600 700 800 1000 1200 1400 1700 2000 2400 2800 3300
to message serverfrom message server
Figure 5.2: Volume of Message Server traffic
The plots in Figure 5.2 show the distribution of data volume caused by Message Server
session. The left plot is a CCDF plot. On the x-axis the flow volume is shown in bytes. It
is log scaled in order to have a good representation for high volume sessions and as well
as low volume sessions. The y-axis shows the complementary cumulative distribution
and is log scaled as well. The right plot shows the density function of the same dataset.
The distribution of data volume shows a clear profile for Message Server sessions: in
81.2% of all session 880 bytes are transmitted by the client to the Message Server. For
the opposite direction, the profile is not such distinct, but still for 52.8% of all session,
the server sends 1976 bytes to the client. As Table 5.1 shows, the standard deviation of
the volume sent by the server significantly exceeds the standard deviation of the volume
sent to the server.
to server from server
minimum 40 40
1st quartile 880 1976
median 880 1976
3rd quartile 880 1976
maximum 9314 11950
average 875.3 1867
standard deviation 80.5 351.0
Table 5.1: Distribution of session volume
CHAPTER 5. DATA ANALYSIS 41
5.1.3 Packet size
0 100 200 300 400
0.0
0.1
0.2
0.3
0.4
Density plot of packet sizes sent to server
average packet size (bytes)
Den
sity
0 100 200 300 400
0.00
0.02
0.04
0.06
0.08
0.10
0.12
Density plot of packet sizes sent by server
average packet size (bytes)
Den
sity
Figure 5.3: Average packet sizes of Message Server sessions
Figure 5.3 shows density plots of the packet size distribution. As NetFlow does not
provide information on packet level granularity, the average packet size per session is
shown. The average packet size is shown on the x-axis and the density on the y-axis.
Similar to the data volume, the average packet size shows also a clear profile: 99.3% of
all sessions have an average packet size between 100 and 150 bytes for packets sent to
the Message Server. At the same time 91.7% of the sessions have an average packet size
of 240 and 300 bytes for the packets sent by the Message Server.
As Figure 5.4 shows, most sessions consist of seven packets in each direction, roughly
half of them are caused by the TCP protocol only1. Obviously this impacts the average
packet size significantly, as these packets do not contain any payload.
1TCP handshake, TCP teardown, and pure acks
0 2 4 6 8 10
01
23
45
67
Density plot of number of packets sent to server
packets per session
Den
sity
0 2 4 6 8 10
01
23
45
Density plot of number of packet sent by server
packets per session
Den
sity
Figure 5.4: Density plots of Message Server session length (pkts)
CHAPTER 5. DATA ANALYSIS 42
5.2 SAPGUI
In this section, the traffic regarding SAPGUI is analyzed for the following aspects:
• Load distribution
• Session duration
• Time of day profile
• Day of week profile
• Data volume
• Traffic symmetry
• Packet size
• Bandwidth usage
• Load distribution
• User ranking
During the analysis, two classes of traffic analyzed separately: the SAPGUI sessions on
the four dedicated application servers and the traffic on the R/3 central instance (CI)
server. The majority of users first connects to the Message Server service and connects
afterwards one of the four application servers as advised by the Message Server. If a
user tries to connect to the CI server, first the R/3 system verifies whether the user-id
is flagged as a privileged user who is authorized to work on the central instance. Most
users are not authorized and will be logged off automatically. The CI server is mainly
used by SAP administrators and a small group of power-users.
The analyzed dataset consists of 1.2 million TCP sessions, which have been reassembled
as described in the previous chapter. These TCP sessions have a volume of 577.6 million
packets or 315.7 gigabytes.
CHAPTER 5. DATA ANALYSIS 43
5.2.1 Session duration
CCDF of session duration
flow duration (seconds)
P[X
> x
]
1e−
061e
−05
1e−
040.
001
0.01
0.1
1
1 10 100 1000 10000 1e+05
+priviledged usersregular users
+ ++ ++ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+30m
8h 16h
0.0
0.2
0.4
0.6
0.8
1.0
Density plot of session duration
duration (seconds)D
ensi
ty1 10 100 1000 10000 1e+05
priviledged usersregular users
30m
8h 16h
Figure 5.5: CCDF and density plot of flow duration SAPGUI (log scales)
Both plots in Figure 5.5 show the distribution of SAPGUI session durations. The left
plot is a CCDF plot. On the x-axis the flow duration is shown. It is log scaled in order
to have a good representation for short sessions and longer sessions at the same time.
The y-axis shows the complementary cumulative distribution and is log scaled as well.
The right plot shows the density function of the same dataset.
Both user types - regular users who log on to any of the four R/3 application servers
or privileged users who use the central instance (CI) server - finish their session within
the first 30 minutes in roughly 75% of all cases. On the four application servers users
are logged off automatically after 30 minutes of inactivity; on the CI server, there is no
idle timeout activated. The idle timeout seems to work effectively, as the bend for the
regular users at 30 minutes in the CCDF plot and the peak in the density plot show.
Only 0.015% of all regular user sessions exceed 8 hours, where 1.0% of the priviledged
user sessions last longer than 8 hours. A very small number priviledged user sessions -
215 at total, which makes up 0.6% - exceed 10 hours and 64 sessions (0.2%) have been
observed that persist longer than 24 hours.
regular users privileged users
minimum 0s 0s
1st quartile 1.9m 2.6m
median 7.1m 9.9m
3rd quartile 31.5m 29.1m
maximum 10.5h 75.6h
Table 5.2: Distribution of session duration
CHAPTER 5. DATA ANALYSIS 44
5.2.2 Time of day profile
Time of day distribution (regular user sessions)
TCP session end time of day
TC
P s
essi
on s
tart
tim
e of
day
0 4 8 12 16 20 24
04
812
1620
24
frequency
082566230
Time of day distribution (priviledged user sessions)
TCP session end time of day
TC
P s
essi
on s
tart
tim
e of
day
0 4 8 12 16 20 24
04
812
1620
24
frequency
0216128
0.00
0.02
0.04
0.06
0.08
0.10
0.12
Density plot of logon/logoff time (regular user sessions)
time of day
Den
sity
0 4 8 12 16 20 24
start timesend times
0.00
0.02
0.04
0.06
0.08
0.10
0.12
Density plot of logon/logoff time (priviledged user sessions)
time of day
Den
sity
0 4 8 12 16 20 24
start timesend times
Figure 5.6: Time of day distribution of SAPGUI sessions
The two plots on the left side in Figure 5.6 are for the regular users, the plots on the
right side are for the privileged user sessions. The uppers plot are 2D-histograms: for
each dot the start time of a session is shown on the y-axis and the end time on the x-axis.
The shade of gray represents the number of sessions fall into the corresponding bin. The
plots below are density plots of the start time (straight line) and end time (dotted line).
These plots are projections of the 2D-histogram on the y-axis and x-axis.
All plots show a clear time of day profile: roughly 90% are established during normal
business hours (8-18 o’clock). The window is tighter for the priviledged users than for
the regular users. This could be a result of less statistical spread because of a smaller
number of users. Furthermore the lunch break time is noticable. There seems to be an
asymmetry between morning and afternoon for the regular users, which is not observed
for the priviledged users (possible explaination in section 5.2.3).
Another interesting effect can be found in the 2D-histogram for the privileged users:
around 17:30 there is an accumulation of session ends, independently of the start time
(a slight vertical line can be identified in the plot). An explanation could be that many
users leave the office simultaneously in the late afternoon.
CHAPTER 5. DATA ANALYSIS 45
5.2.3 Day of week profile
Day of week distribution (regular user sessions)
day of week
Fre
quen
cy
050
000
1500
0025
0000
Mon Tue Wed Thu Fri Sat Sun
Day of week distribution (priviledged user sessions)
day of week
Fre
quen
cy
020
0040
0060
0080
00
Mon Tue Wed Thu Fri Sat Sun
Figure 5.7: Histograms of distribution over week-days
Figure 5.7 shows histograms of the number of sessions per weekday. The left diagram
exhibits the sessions of regular users, the right diagram privileged users.
Both plots show an expectable distribution: most users use the R/3 system from Monday
until Friday, only a very small number of users connects on the weekend. However on
Mondays there is a peak in the number of sessions of regular users.
The plots in Figure 5.8 show the density of the session start time broken down per
weekday. The left plot clearly shows, that there are more regular users connecting on
Monday morning. All other weekdays are balanced between morning and afternoon. This
explains the peak in diagram 5.7 and is such dominating, that it was still noticeable in
figure 5.6. A possible reason could be that occasional users prefer to do their weekly tasks
in SAP R/3 on Monday morning. However the diagram for the privileged users does not
show the same effect and is also less homogeneous regarding the different weekdays.
0.00
0.04
0.08
0.12
Density plot of logon time (regular user sessions)
time of day
Den
sity
0 4 8 12 16 20 24
MondayTuesdayWednesdayThursdayFriday
0.00
0.02
0.04
0.06
0.08
0.10
0.12
Density plot of logon time (priviledged user sessions)
time of day
Den
sity
0 4 8 12 16 20 24
MondayTuesdayWednesdayThursdayFriday
Figure 5.8: Time of day distribution of SAPGUI sessions on different weekdays
CHAPTER 5. DATA ANALYSIS 46
5.2.4 Data volume
CCDF of session volume
flow volume (bytes)
P[X
> x
]
1e−
061e
−05
1e−
040.
001
0.01
0.1
1
100 1000 10000 1e+05 1e+06 1e+07
+x#
c−>s regular userss−>c regular usersc−>s priviledged userss−>c priviledged users
+++ +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+
x xxx x x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x
x
# ### # ###########################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################
#
#
Figure 5.9: CCDF plot of volume distribution
The plot in Figure 5.9 shows the distribution of data volume. The plot is a CCDF plot:
on the abscissae the flow volume is shown in bytes. It is log scaled in order to have a
good representation for high volume sessions and as well as low volume sessions. The
ordinate shows the complementary cumulative distribution and is log scaled as well.
All curves show a similar behavior. The pair of outer lines represents both directions
of the privileged users, the inner curves represent the regular users. For both, regular
and privileged users, it seems that more traffic is sent by the server than to the server.
However the factor seems to be different. This is examined in more detail in Section
5.2.5.
All four plots seem to be consistent with Pareto’s law and seem to follow a power law
distribution. For all four curves the 25% largest sessions result in 75% (±2%) of the
traffic volume.
CHAPTER 5. DATA ANALYSIS 47
5.2.5 Traffic symmetry
0 2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
Density of input−output ratio (regular users)
Factor between input and output data volume
Den
sity
1st q
uart
ilem
edia
n
3rd
quar
tile
0 10 20 30 40 50
0.00
0.01
0.02
0.03
0.04
0.05
0.06
Density of input−output ratio (priviledged users)
Factor between input and output data volume
Den
sity
1st q
uart
ile
med
ian
3rd
quar
tile
Figure 5.10: Ratio of client-to-server and server-to-client traffic
Figure 5.10 shows the symmetry of the data volume in SAPGUI sessions. The x-axis
represents the quotient of the volume sent by the server and the volume it received in a
particular session. The y-axis shows the density.
As most client-server applications do, SAP R/3 sends a higher volume of data to the users,
than it receives. Commonly, small requests result in responses, that are significantly
larger. The factor seems to depend on the user behavior. As already indicated in the
previous section, there seems to be a significant difference between regular and privileged
users. For regular users, the volume sent by the server is in 50% of all collected sessions
3.7-4.8 times higher than the volume sent to the server. For the priviledged users this
factor is between 9.3 and 18.5 for 50% of all sessions.
regular users privileged users
minimum 1
20069
1
2079
1st quartile 3.7 9.3
median 4.1 13.5
3rd quartile 4.8 18.5
maximum 378569 90069
Table 5.3: Distribution of in-out ratio
CHAPTER 5. DATA ANALYSIS 48
5.2.6 Packet size
0 100 200 300 400 500
0.00
00.
004
0.00
80.
012
Density plot of packet sizes sent to server
average packet size (bytes)
Den
sity
priviledged usersregular users
0 500 1000 1500
0.00
000.
0010
0.00
200.
0030
Density plot of packet sizes sent by server
average packet size (bytes)
Den
sity
priviledged usersregular users
Figure 5.11: Density plot of average packet sizes
Figure 5.11 shows density plots of the packet size distribution. As NetFlow does not
provide information on packet level granularity, the average packet size per session is
shown. The average packet size is shown on the x-axis and the density on the y-axis.
The left plot shows the average packet sizes for the packets sent to the server, where the
right plot shows it for the packets sent by the server.
Not surprisingly the average packet size for packets sent to the server is smaller than for
packets sent by the server. The factor between both directions is in the same order of
magnitude as for the total data volume. Furthermore the average size of packets from
the server to privileged users is larger than packets sent to regular users. It seems like
privileged users tend to do more bulk data transfers.
An interesting effect is that the average size of packets sent by privileged users is smaller
than the size of those packets sent by regular users. This does not necessarily mean that
the requests have different sizes, but may be caused by the TCP protocol itself. During
bulk data transfers, the TCP/IP stack confirms proper reception by sending pure TCP
acknowledge packets, which do not transport any payload.
to server from server to server from server
(regular users) (privileged users)
minimum 40 40 40 40
1st quartile 173 675 102 958
median 203 775 126 1113
3rd quartile 248 879 156 1212
maximum 1384 1427 351 1454
Table 5.4: Distribution of average packet sizes
CHAPTER 5. DATA ANALYSIS 49
5.2.7 Bandwidth usage
bandwidth distribution (regular user sessions)
Volume sent to server (bytes)
Dur
atio
n (s
econ
ds)
1 10 100 1000 10000 1e+06 1e+08
110
100
1000
1000
01e
+05
frequency
082565298
0.1
bit/s
1 bi
t/s
10 b
it/s
0.1
kbit/
s
1 kb
it/s
10 k
bit/s
0.1
Mbi
t/s
1 M
bit/s
10 M
bit/s
bandwidth distribution (priviledged user sessions)
Volume sent to server (bytes)
Dur
atio
n (s
econ
ds)
1 10 100 1000 10000 1e+06 1e+08
110
100
1000
1000
01e
+05
frequency
021629
0.1
bit/s
1 bi
t/s
10 b
it/s
0.1
kbit/
s
1 kb
it/s
10 k
bit/s
0.1
Mbi
t/s
1 M
bit/s
10 M
bit/s
bandwidth distribution (regular user sessions)
Volume sent by server (bytes)
Dur
atio
n (s
econ
ds)
1 10 100 1000 10000 1e+06 1e+08
110
100
1000
1000
01e
+05
frequency
082565495
0.1
bit/s
1 bi
t/s
10 b
it/s
0.1
kbit/
s
1 kb
it/s
10 k
bit/s
0.1
Mbi
t/s
1 M
bit/s
10 M
bit/s
bandwidth distribution (priviledged user sessions)
Volume sent by server (bytes)
Dur
atio
n (s
econ
ds)
1 10 100 1000 10000 1e+06 1e+08
110
100
1000
1000
01e
+05
frequency
021622
0.1
bit/s
1 bi
t/s
10 b
it/s
0.1
kbit/
s
1 kb
it/s
10 k
bit/s
0.1
Mbi
t/s
1 M
bit/s
10 M
bit/s
Figure 5.12: Bandwidth usage of SAPGUI sessions
The plots in Figure 5.12 are 2D-histograms that show the distribution of session duration
in relation to the data volume. For each dot the session volume is shown on the x-axis
and corresponding duration on the y-axis. The shade of gray represents the number
of sessions fall into the corresponding bin. Both axes are log scaled to have a good
representation for short sessions and longer sessions at the same time. As the average
bandwidth used by a session is the quotient of data volume and session duration, all
sessions with the same average bandwidth are on the same line with a slope of 1 as
indicated in the plots.
All plots show, that the bandwidth usage is quite homogeneous, although some users use
have a high-speed LAN connection to the servers. However there is a difference in the
used bandwidth depending on the direction of the connection: 50% of all users (regular
as well as priviledged) did not exceed 0.5 kbit/s for data sent to the server (90% are
below 2.1 kbit/s), as well as 50% of all users did not receive data from the server with a
higher rate than 3.0 kbit/s (90% of the regular users are below 8.3 kbit/s, where 90% of
the priviledged are below 19.8 kbit/s).
CHAPTER 5. DATA ANALYSIS 50
5.2.8 Load distribution
CI App1 App2 App3 App4
Load distribution over all servers (sessions)
Server
Avg
ses
sion
s pe
r da
y
050
010
0020
0030
00
CI App1 App2 App3 App4
Load distribution over all servers (volume)
Server
Avg
MB
per
day
020
040
060
080
0
Figure 5.13: Load distribution over all servers
The plots in Figure 5.13 show the load on the different server, the ordinate shows the
number of sessions in the left plot and the data volume (sum of both directions) in the
right plot. Each bar represents the load of a particular server, where the first server is
the central instance (CI), which may only be used by privileged users, and the following
bars App1-App4 the separate application servers which are used by the regular users.
As both plots show, the load balancing mechanism of SAP R/3 distributes the load
evenly over all servers in regards of both, number of sessions and transfered data volume.
CHAPTER 5. DATA ANALYSIS 51
5.2.9 User ranking
1 2 5 10 20 50 100 200
510
2050
100
200
User ranking by number of sessions
Rank (log scale)
num
ber
of s
essi
ons
norm
aliz
ed b
y su
bnet
siz
e (lo
g sc
ale)
Figure 5.14: Subnet ranking by session frequency (log scaled)
Figure 5.14 shows a ranking of subnets in regards of the number of sessions which are
initiated from that subnet. The ranking was not done on an IP address level, because
of instability in the assignments caused by DHCP. Instead, the IP addresses have been
assigned to their subnet according to the IP routing table of the layer 3 switches. As the
subnets are not equally sized, the number of sessions has been normalized by the subnet
size. The rank is shown on the abscissae, where the normalized number of sessions is
shown on the ordinate. Both axes have been log scaled.
As indicated in the plot, all points from the subnet with the highest frequency until the
subnet ranked at 100, are close to a straight line. This observation is consistent with
Zipf’s law. Beyond the 100st frequent user, the curve plunges down. The existence of
some boundary is still in line with other observations of Zipf’s law.
CHAPTER 5. DATA ANALYSIS 52
5.3 Printer/LPD
In this section, the traffic between the two print servers and the printers is analyzed for
the following aspects:
• Session duration
• Data volume
• Load distribution
The analyzed dataset consists of 26485 TCP sessions, which have been reassembled as
described in the previous chapter. These TCP sessions have a volume of 5.0 million
packets or 2.9 gigabytes.
CHAPTER 5. DATA ANALYSIS 53
5.3.1 Session duration
CCDF of session duration
flow duration (seconds)
P[X
> x
]
1e−
040.
001
0.01
0.1
1
1 10 100 1000 10000 1e+05
printer TCP connections
0.0
0.2
0.4
0.6
0.8
1.0
Density plot of session duration
duration (seconds)D
ensi
ty1 10 100 1000 10000 1e+05
Figure 5.15: CCDF and density plot of flow duration printers (log scales)
Both plots in Figure 5.15 show the distribution of print server session durations. The
left plot is a CCDF plot. On the x-axis the flow duration is shown. It is log scaled in
order to have a good representation for short sessions and longer sessions at the same
time. The y-axis shows the complementary cumulative distribution and is log scaled as
well. The right plot shows the density function of the same dataset.
The print server sessions seem to fall into two classes, very short sessions and sessions
with a duration of several minutes, as the density plot indicates. A significant portion
of the sessions is extremely short: 60% of all sessions have a duration of less than 10
seconds. Furthermore 25% of the sessions have a duration between 5 and 60 minutes,
where less than 4% are in the range between 10 seconds and 5 minutes.
A possible explanation is that some printers are directly connected by the print server
(which results in the short sessions that represent a single print job), where others are
daisy chained behind remote print servers that work as hub. Sessions to these remote
print servers may run idle over some time and be used sporadically for different print jobs
without establishing new connections. However this hypothesis has not been verified as
the data does not unveil whether a session is used only for a single printer/print job or
the session is used multiple times.
CHAPTER 5. DATA ANALYSIS 54
5.3.2 Data volume
CCDF of session volume
flow volume (bytes)
P[X
> x
]
1e−
040.
001
0.01
0.1
1
100 1000 10000 1e+05 1e+06 1e+07 1e+08
print queue server −> printerprinter −> print queue server
0.0
0.5
1.0
1.5
Density plot of session volume
volume (bytes)D
ensi
ty10 100 1000 10000 1e+05
print queue server −> printerprinter −> print queue server
Figure 5.16: Volume of printer traffic
The plots in Figure 5.16 show the distribution of data volume caused by print server
sessions. The left plot is a CCDF plot. On the x-axis the flow volume is shown in bytes.
It is log scaled in order to have a good representation for high volume sessions and as well
as low volume sessions. The y-axis shows the complementary cumulative distribution and
is log scaled as well. The right plot shows the density function of the same dataset.
The plots and the summary in Table 5.5 show that the data volume sent by the print
server is significantly higher than the volume it receives. This is not surprising, as a
print server does not receive much feedback from the printers. In general printers just
send some status messages (like ’ready’, ’printer offline’, or ’out of paper’), or solely send
TCP acknowledge packets to confirm that the data has been received properly.
The CCDF plots of both directions follow approximately a straight line from a certain
point. This observation is consistent with a Pareto distribution. One of the implications
of Pareto’s law is that low-frequency events can cumulatively outweigh the high-frequency
events.
to printer from printer
minimum 40 40
1st quartile 8k 445
median 16k 605
3rd quartile 37k 1250
maximum 140M 3.3M
Table 5.5: Distribution of data volume on printer sessions (in bytes)
CHAPTER 5. DATA ANALYSIS 55
5.3.3 Packet size
0 20 40 60 80
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Density plot of packet sizes sent to server
average packet size (bytes)
Den
sity
0 500 1000 1500
0.00
000.
0010
0.00
20
Density plot of packet sizes sent by server
average packet size (bytes)
Den
sity
Figure 5.17: Average packet sizes of print sessions
Figure 5.3 shows density plots of the packet size distribution. As NetFlow does not
provide information on packet level granularity, the average packet size per session is
shown. The average packet size is shown on the x-axis and the density on the y-axis.
The left plot shows the average packet size of packets received by the server, where the
right plot shows the average packet size for packets sent by the server.
As mentioned in the previous section, printer as pure output devices rarely send data
to the server. Most of the time, they just respond with TCP acknowledgment packets
without any payload. This is reflected by an average packet size of 40 bytes in the left
plot.
The right plot shows that the average packet size of packets sent by the server is widely
spread with two peaks around 420 bytes/packet and 775 bytes/packet. The reasons for
these two accumulation points are unclear. Although a correlation to the session length
suggests itself, this could not confirmed by the data.
Chapter 6
Conclusions, Open Problems,
and Outlook
Enterprise environments have some characteristics that differ from observations in the
public Internet. The IT infrastructure tends to be more centralized in enterprises and
runs under a higher grade of control (which is often a result of corporate policies). Several
business applications run their own proprietary protocols, which are rarely seen in the
Internet. Although these applications may be vital for the operation of the company, the
network traffic characteristics of these applications are rarely analyzed.
This work provides some insights into the characteristics of SAP R/3 as one of the
most popular ERP systems. Even the volume of collected raw data was tremendous,
it was possible to extract the aimed portion with small, but effective set of filters. The
reassembly unveiled, that there is not one time-out, which fits for all traffic. Especially, if
supporting information as the TCP flags are not available, an adjustment of the time-out
per protocol may be necessary.
The analysis of the Message Server sessions does not raise many surprises. The sessions
have a short duration and low volume. If all data would fit into one packet, UDP would
be an alternative to TCP. This would save some protocol overhead and improve the
performance at least by one Round-Trip-Time for the end user. But the Message Server
is just contacted once when an user tries to establish a session to SAP R/3, so the
performance benefit would hardly be noticed.
More interesting results have been discovered during the analysis of the SAPGUI sessions.
In general the SAPGUI sessions show a strong time-of-day and day-of-week profile. Typ-
ical events, such as Peak hours, lunch break time, or weekends can be clearly identified.
The higher session volume during Monday morning is interesting. One explanation may
be that occasional users prefer to perform their weekly tasks (such as time recording) on
56
CHAPTER 6. CONCLUSIONS, OPEN PROBLEMS, AND OUTLOOK 57
Monday morning.
Only a very low number of sessions is longer than 8 hours, virtually no sessions last longer
than 10 hours, except for privileged users. The time-out seems to be very effective: a
significant number of sessions of regular users seems to end after 30 minutes.
It seems that SAP R/3 uses the network bandwidth very economically. The utilized
bandwidth was quite low and although there are users that have a high-speed connection
to the R/3 system, it seems that this capacity is - at least on a long term - not utilized.
For SAPGUI sessions, as well as for print sessions, the volume of transfered data seems
to be consistent with Pareto’s law. This indicates long-range dependencies as described
in [UB01b]. A similar behavior has been described in several studies about web traffic in
the Internet, or file sizes on FTP servers. This has significant impact on load modeling,
events with a low frequency may dominate the overall behavior. Furthermore the ranking
of users seems to follow Zipf’s law, which is also a power-law distribution, that is tightly
related to Pareto’s law.
The framework provided in this work can be reused to preprocess collected raw data in
similar studies. The filtering, flow defragmentation, and reassembly is very generic and
has shown to work effectively.
Although some interesting profiles and phenomena could be showed, additional data and
further analysis would be required to provide evidence. For future work, it would also be
of substancial interest to compare the observed pattern with other protocols or SAP R/3
systems in different environments. Another aspect is the intra-session behavior, which
can be examined on a packet level or using the application log files.
Appendix A
Patch for perl module ’Cflow’
∗∗∗ o r i g /Cflow .pm Thu Jan 31 02 :08 : 21 2002
−−− Cflow .pm Tue Dec 13 12 :49 : 58 2005
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
∗∗∗ 344 ,349 ∗∗∗∗
−−− 344 ,356 −−−−
$ICMPCode
$dura t i on s e c s
$durat ion msecs
+ $cap s ta r t
+ $cap end
+ $f lows count
+ $ f l ow s l o s t
+ $f lows misordered
+ $pkts cor rupt
+ $ s e q r e s e t
) ] ,
t c p f l a g s => [ qw(
∗∗∗ o r i g /Cflow . xs Thu Jan 31 07 :07 : 19 2002
−−− Cflow . xs Tue Dec 13 15 :58 : 22 2005
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
∗∗∗ 178 ,184 ∗∗∗∗
# de f i n e END MSECS 27
# de f i n e DURATION SECS 28
# de f i n e DURATION MSECS 29
! # de f i n e RESERVED 30 /∗ p laceho lde r f o r l a s t var ∗/
s t a t i c SV ∗vars [RESERVED] ,
∗wanted ;
−−− 178 ,191 −−−−
# de f i n e END MSECS 27
# de f i n e DURATION SECS 28
# de f i n e DURATION MSECS 29
! # de f i n e CAP START 30
! # de f i n e CAP END 31
! # de f i n e FLOWS COUNT 32
! # de f i n e FLOWS LOST 33
! # de f i n e FLOWS MISORDERED 34
! # de f i n e PKTS CORRUPT 35
! # de f i n e SEQ RESET 36
! # de f i n e RESERVED 37 /∗ p laceho lde r f o r l a s t var ∗/
s t a t i c SV ∗vars [RESERVED] ,
∗wanted ;
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
∗∗∗ 593 ,599 ∗∗∗∗
i f (2 > i tems | |
!SvROK(wanted ) | |
SVt PVCV != SvTYPE(SvRV(wanted ) ) ) {
! croak (”Usage : f i nd (CODEREF, [CODEREF] , FILE [ . . . ] ) ” ) ;
}
i f (SvROK(ST( arg ) ) &&
58
APPENDIX A. PATCH FOR PERL MODULE ’CFLOW’ 59
−−− 600 ,608 −−−−
i f (2 > i tems | |
!SvROK(wanted ) | |
SVt PVCV != SvTYPE(SvRV(wanted ) ) ) {
! i f ( wanted != &PL sv undef ) { /∗ al low c a l l s without ’wanted ’ ∗/
! croak (”Usage : f i nd (CODEREF, [CODEREF] , FILE [ . . . ] ) ” ) ;
! }
}
i f (SvROK(ST( arg ) ) &&
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
∗∗∗ 635 ,640 ∗∗∗∗
−−− 644 ,657 −−−−
vars [ICMPTYPE] = p e r l g e t s v (” Cflow : : ICMPType” , TRUE) ;
vars [ICMPCODE] = p e r l g e t s v (” Cflow : : ICMPCode” , TRUE) ;
+ vars [CAP START] = p e r l g e t s v (” Cflow : : c ap s t a r t ” , TRUE) ;
+ vars [CAP END] = p e r l g e t s v (” Cflow : : cap end ” , TRUE) ;
+ vars [FLOWS COUNT] = p e r l g e t s v (” Cflow : : f l ows count ” , TRUE) ;
+ vars [FLOWS LOST] = p e r l g e t s v (” Cflow : : f l ow s l o s t ” , TRUE) ;
+ vars [FLOWS MISORDERED] = p e r l g e t s v (” Cflow : : f l ows miso rdered ” , TRUE) ;
+ vars [PKTS CORRUPT] = p e r l g e t s v (” Cflow : : pkt s co r rupt ” , TRUE) ;
+ vars [SEQ RESET] = p e r l g e t s v (” Cflow : : s e q r e s e t ” , TRUE) ;
+
f o r ( ; arg < i tems ; arg++) {
s i z e t l en ;
char ∗namep ;
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
∗∗∗ 681 ,686 ∗∗∗∗
−−− 698 ,711 −−−−
sv se tuv ( vars [ENGINE TYPE] , 0 ) ;
sv s e tuv ( vars [ ENGINE ID ] , 0 ) ;
/∗ } ∗/
+ /∗ make flow−t o o l s header data ava i l a b l e ∗/
+ sv se tuv ( vars [CAP START] , htonl ( f s . f th . c ap s t a r t ) ) ;
+ sv se tuv ( vars [CAP END] , htonl ( f s . f th . cap end ) ) ;
+ sv se tuv ( vars [FLOWS COUNT] , htonl ( f s . f th . f l ows count ) ) ;
+ sv se tuv ( vars [FLOWS LOST] , htonl ( f s . f th . f l ow s l o s t ) ) ;
+ sv se tuv ( vars [FLOWS MISORDERED] , htonl ( f s . f th . f l ows miso rdered ) ) ;
+ sv se tuv ( vars [PKTS CORRUPT] , htonl ( f s . f th . pkt s co r rupt ) ) ;
+ sv se tuv ( vars [SEQ RESET] , htonl ( f s . f th . s e q r e s e t ) ) ;
}
} e l s e { /∗ FIXME − i f d e f OSU, reading cf lowd from std in ? ∗/
/∗ assume the f i l e i s in cflowd ’ s v5 format : ∗/
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
∗∗∗ 903 ,908 ∗∗∗∗
−−− 928 ,935 −−−−
# i f d e f OSU /∗ [ ∗/
}
# end i f /∗ ] ∗/
+ i f ( wanted == &PL sv undef ) { break ; }
+
to t a l++;
ENTER;
Appendix B
Source code for ’create-filter’
#!/opt/ pe r l / bin / pe r l
use FindBin ;
use Getopt : : Std ;
i f ( ! getopts ( ’ o : hv ’ ) | | ! $opt o | | $opt h | | !(−d $opt o ) ) {
pr in t STDERR << EOF
usage : $FindBin : : S c r i p t [−h ] [−v ] −o o u t p u t f i l e
−h − shows t h i s usage in format ion (mnemonic : ’h ’ e lp )
−v − verbose − show warnings (mnemonic : ’v ’ erbose )
−o outd i r − output d i r e c t o r y ; the f o l l ow ing f i l e s w i l l be
created here
o f i l t e r . out
o proc . po s i t i v e− f i l t e r
o proc . negat ive− f i l t e r
EOF
;
ex i t ( $opt h ?0: −1);
}
whi le (<>) {
i f ( /ˆ host\ s+(\S+)\ s+$/ )
{ $hostname = $1 ; }
e l s i f ( /ˆ\ s+f i l t e r \ s+(\d+\.\d+\.\d+\.\d+(:(\d+)?( : (\d+\.\d+\.\d+\.\d+)?(:\d+)?)?)?)\ s+$/ )
{
my @ f i l t e r = s p l i t / : / , $1 ;
$ f i l t e r {$hostname } [ $ f i l t e r i d x {$hostname } ]{ l o c a l i p } = $ f i l t e r [ 0 ] ;
$ f i l t e r {$hostname } [ $ f i l t e r i d x {$hostname } ]{ l o c a l p o r t } = $ f i l t e r [ 1 ] ;
$ f i l t e r {$hostname } [ $ f i l t e r i d x {$hostname } ]{ remoteip} = $ f i l t e r [ 2 ] ;
$ f i l t e r {$hostname } [ $ f i l t e r i d x {$hostname } ]{ remoteport} = $ f i l t e r [ 3 ] ;
$ f i l t e r i d x {$hostname}++;
}
e l s i f ( /ˆ\ s ∗(#.∗)? $/ )
{ }
e l s e
{ pr in t STDERR ”Could not i n t e r p r e t : $ ” i f $opt v ; e x i t (−1); }
}
open FILTER OUT, ”>$opt o / f i l t e r . out ” ;
open PROC POSITIVE, ”>$opt o /proc . po s i t i v e− f i l t e r ” ;
open PROC NEGATIVE, ”>$opt o /proc . negat ive− f i l t e r ” ;
$ f i r s t = 1 ; # f i r s t statement in PROC NEGATIVE
foreach $hostname ( keys %f i l t e r ) {
f o r each $ i ( 0 . . s c a l a r @{ $ f i l t e r {$hostname}}−1) {
i f ( ! e x i s t s $ i p f i l t e r { $ f i l t e r {$hostname } [ $ i ]{ l o c a l i p }}
&& $ f i l t e r {$hostname } [ $ i ]{ l o c a l i p } ) {
pr in t FILTER OUT ” f i l t e r −pr im i t i v e ip−$ f i l t e r {$hostname } [ $ i ]{ l o c a l i p }\n”
.” type ip−address\n permit $ f i l t e r {$hostname } [ $ i ]{ l o c a l i p }\n\n ” ;
$ i p f i l t e r { $ f i l t e r {$hostname } [ $ i ]{ l o c a l i p }}++;
}
i f ( ! e x i s t s $ i p f i l t e r { $ f i l t e r {$hostname } [ $ i ]{ remoteip}}
&& $ f i l t e r {$hostname } [ $ i ]{ remoteip} ) {
pr in t FILTER OUT ” f i l t e r −pr im i t i v e ip−$ f i l t e r {$hostname } [ $ i ]{ remoteip}\n”
.” type ip−address\n permit $ f i l t e r {$hostname } [ $ i ]{ remoteip}\n\n ” ;
60
APPENDIX B. SOURCE CODE FOR ’CREATE-FILTER’ 61
$ i p f i l t e r { $ f i l t e r {$hostname } [ $ i ]{ remoteip}}++;
}
i f ( ! e x i s t s $ p o r t f i l t e r { $ f i l t e r {$hostname } [ $ i ]{ l o c a l p o r t }}
&& $ f i l t e r {$hostname } [ $ i ]{ l o c a l p o r t } ) {
pr in t FILTER OUT ” f i l t e r −pr im i t i v e port−$ f i l t e r {$hostname } [ $ i ]{ l o c a l p o r t }\n”
.” type ip−port\n permit $ f i l t e r {$hostname } [ $ i ]{ l o c a l p o r t }\n\n ” ;
$ p o r t f i l t e r { $ f i l t e r {$hostname } [ $ i ]{ l o c a l p o r t }}++;
}
i f ( ! e x i s t s $ p o r t f i l t e r { $ f i l t e r {$hostname } [ $ i ]{ remoteport}}
&& $ f i l t e r {$hostname } [ $ i ]{ remoteport} ) {
pr in t FILTER OUT ” f i l t e r −pr im i t i v e port−$ f i l t e r {$hostname } [ $ i ]{ remoteport}\n”
.” type ip−port\n permit $ f i l t e r {$hostname } [ $ i ]{ remoteport}\n\n ” ;
$ p o r t f i l t e r { $ f i l t e r {$hostname } [ $ i ]{ remoteport}}++;
}
$ f i l t e r d e f i n i t i o n {$hostname } [ $ i ] .=
” match ip−source−address ip−$ f i l t e r {$hostname } [ $ i ]{ l o c a l i p }\n”
i f $ f i l t e r {$hostname } [ $ i ]{ l o c a l i p } ;
$ f i l t e r d e f i n i t i o n {$hostname } [ $ i ] .=
” match ip−source−port port−$ f i l t e r {$hostname } [ $ i ]{ l o c a l p o r t }\n”
i f $ f i l t e r {$hostname } [ $ i ]{ l o c a l p o r t } ;
$ f i l t e r d e f i n i t i o n {$hostname } [ $ i ] .=
” match ip−des t ina t i on−address ip−$ f i l t e r {$hostname } [ $ i ]{ remoteip}\n”
i f $ f i l t e r {$hostname } [ $ i ]{ remoteip } ;
$ f i l t e r d e f i n i t i o n {$hostname } [ $ i ] .=
” match ip−des t ina t i on−port port−$ f i l t e r {$hostname } [ $ i ]{ remoteport}\n”
i f $ f i l t e r {$hostname } [ $ i ]{ remoteport } ;
$ f i l t e r d e f i n i t i o n {$hostname } [ $ i ] .= ” or\n ” ;
$ f i l t e r d e f i n i t i o n {$hostname } [ $ i ] .=
” match ip−des t ina t i on−address ip−$ f i l t e r {$hostname } [ $ i ]{ l o c a l i p }\n”
i f $ f i l t e r {$hostname } [ $ i ]{ l o c a l i p } ;
$ f i l t e r d e f i n i t i o n {$hostname } [ $ i ] .=
” match ip−des t ina t i on−port port−$ f i l t e r {$hostname } [ $ i ]{ l o c a l p o r t }\n”
i f $ f i l t e r {$hostname } [ $ i ]{ l o c a l p o r t } ;
$ f i l t e r d e f i n i t i o n {$hostname } [ $ i ] .=
” match ip−source−address ip−$ f i l t e r {$hostname } [ $ i ]{ remoteip}\n”
i f $ f i l t e r {$hostname } [ $ i ]{ remoteip } ;
$ f i l t e r d e f i n i t i o n {$hostname } [ $ i ] .=
” match ip−source−port port−$ f i l t e r {$hostname } [ $ i ]{ remoteport}\n”
i f $ f i l t e r {$hostname } [ $ i ]{ remoteport } ;
p r in t FILTER OUT ” f i l t e r −d e f i n i t i o n $hostname−$ i \ n $ f i l t e r d e f i n i t i o n {$hostname } [ $ i ]\n ” ;
p r in t FILTER OUT ” f i l t e r −d e f i n i t i o n $hostname−$i−a\n inve r t \n ” ;
p r in t FILTER OUT ” match ip−source−address ip−$ f i l t e r {$hostname } [ $ i ]{ l o c a l i p }\n”
i f $ f i l t e r {$hostname } [ $ i ]{ l o c a l i p } ;
p r in t FILTER OUT ” or\n”
i f ( $ f i l t e r {$hostname } [ $ i ]{ l o c a l i p } && $ f i l t e r {$hostname } [ $ i ]{ l o c a l p o r t } ) ;
p r in t FILTER OUT ” match ip−source−port port−$ f i l t e r {$hostname } [ $ i ]{ l o c a l p o r t }\n”
i f $ f i l t e r {$hostname } [ $ i ]{ l o c a l p o r t } ;
p r in t FILTER OUT ” or\n”
i f ( ( $ f i l t e r {$hostname } [ $ i ]{ l o c a l i p } | | $ f i l t e r {$hostname } [ $ i ]{ l o c a l p o r t })
&& $ f i l t e r {$hostname } [ $ i ]{ remoteip } ) ;
p r in t FILTER OUT ” match ip−des t ina t i on−address ip−$ f i l t e r {$hostname } [ $ i ]{ remoteip}\n”
i f $ f i l t e r {$hostname } [ $ i ]{ remoteip } ;
p r in t FILTER OUT ” or\n”
i f ( ( $ f i l t e r {$hostname } [ $ i ]{ l o c a l i p } | | $ f i l t e r {$hostname } [ $ i ]{ l o c a l p o r t }
| | $ f i l t e r {$hostname } [ $ i ]{ remoteip }) && $ f i l t e r {$hostname } [ $ i ]{ remoteport } ) ;
p r in t FILTER OUT ” match ip−des t ina t i on−port port−$ f i l t e r {$hostname } [ $ i ]{ remoteport}\n”
i f $ f i l t e r {$hostname } [ $ i ]{ remoteport } ;
p r in t FILTER OUT ”\ n f i l t e r −d e f i n i t i o n $hostname−$i−b\n inve r t \n ” ;
p r in t FILTER OUT ” match ip−des t ina t i on−address ip−$ f i l t e r {$hostname } [ $ i ]{ l o c a l i p }\n”
i f $ f i l t e r {$hostname } [ $ i ]{ l o c a l i p } ;
p r in t FILTER OUT ” or\n”
i f ( $ f i l t e r {$hostname } [ $ i ]{ l o c a l i p } && $ f i l t e r {$hostname } [ $ i ]{ l o c a l p o r t } ) ;
p r in t FILTER OUT ” match ip−des t ina t i on−port port−$ f i l t e r {$hostname } [ $ i ]{ l o c a l p o r t }\n”
i f $ f i l t e r {$hostname } [ $ i ]{ l o c a l p o r t } ;
p r in t FILTER OUT ” or\n”
i f ( ( $ f i l t e r {$hostname } [ $ i ]{ l o c a l i p } | | $ f i l t e r {$hostname } [ $ i ]{ l o c a l p o r t })
&& $ f i l t e r {$hostname } [ $ i ]{ remoteip } ) ;
p r in t FILTER OUT ” match ip−source−address ip−$ f i l t e r {$hostname } [ $ i ]{ remoteip}\n”
i f $ f i l t e r {$hostname } [ $ i ]{ remoteip } ;
p r in t FILTER OUT ” or\n”
i f ( ( $ f i l t e r {$hostname } [ $ i ]{ l o c a l i p } | | $ f i l t e r {$hostname } [ $ i ]{ l o c a l p o r t }
| | $ f i l t e r {$hostname } [ $ i ]{ remoteip }) && $ f i l t e r {$hostname } [ $ i ]{ remoteport } ) ;
p r in t FILTER OUT ” match ip−source−port port−$ f i l t e r {$hostname } [ $ i ]{ remoteport}\n”
i f $ f i l t e r {$hostname } [ $ i ]{ remoteport } ;
p r in t FILTER OUT ”\n ” ;
i f ( $ f i r s t ){
pr in t PROC NEGATIVE ”#!/ bin /sh\n\n ” ;
p r in t PROC NEGATIVE ” i f [ ! −x \”\$FLOW NFILTER\” ] ; ”
APPENDIX B. SOURCE CODE FOR ’CREATE-FILTER’ 62
. ” then echo ’ va r i ab l e \$FLOW NFILTER i s not s e t proper ly !\\n ’ >&2 ;”
. ” e x i t ; f i \n ” ;
p r in t PROC NEGATIVE ” i f [ ! −f \”\$FILTER\” ] ; ”
. ” then echo ’ va r i ab l e \$FILTER i s not s e t proper ly !\\n ’ >&2 ;”
. ” e x i t ; f i \n ” ;
p r in t PROC NEGATIVE ”\$FLOW NFILTER −f \$FILTER −F$hostname−$i−a \\\n |”
. ” \$FLOW NFILTER −f \$FILTER −F$hostname−$i−b ” ;
$ f i r s t = 0 ;
} e l s e {
pr in t PROC NEGATIVE ” \\\n | \$FLOW NFILTER −f \$FILTER −F$hostname−$i−a \\\n |”
. ” \$FLOW NFILTER −f \$FILTER −F$hostname−$i−b ” ;
}
}
}
$ f i r s t = 1 ; # f i r s t statement in p o s i t i v e f i l t e r
p r in t FILTER OUT ” f i l t e r −d e f i n i t i o n a l l \n ” ;
fo reach $hostname ( keys %f i l t e r d e f i n i t i o n ) {
f o r each $ i ( 0 . . s c a l a r @{ $ f i l t e r d e f i n i t i o n {$hostname}}−1) {
i f ( $ f i r s t ) {
$ f i r s t = 0 ;
p r in t PROC POSITIVE ”#!/ bin /sh\n\n ” ;
p r in t PROC POSITIVE ” i f [ ! −x \”\$FLOW NFILTER\” ] ; ”
. ” then echo ’ va r i ab l e \$FLOW NFILTER i s not s e t proper ly !\\n ’ >&2 ;”
. ” e x i t ; f i \n ” ;
p r in t PROC POSITIVE ” i f [ ! −f \”\$FILTER\” ] ; ”
. ” then echo ’ va r i ab l e \$FILTER i s not s e t proper ly !\\n ’ >&2 ;”
. ” e x i t ; f i \n ” ;
p r in t PROC POSITIVE ”\$FLOW NFILTER −f \$FILTER −Fa l l \n ” ;
} e l s e {
pr in t FILTER OUT ” or\n ” ;
}
pr in t FILTER OUT $ f i l t e r d e f i n i t i o n {$hostname } [ $ i ] ;
}
}
Appendix C
Source code for ’defragment’
#!/opt/ pe r l / bin / pe r l
use FindBin ;
use Cflow qw ( : f l owvars : t c p f l a g s : icmptypes : icmpcodes 1 . 0 4 1 ) ;
use Getopt : : Std ;
use IO : : F i l e ;
use F i l e : : Basename ;
getopts ( ’ t : p : v ’ ) ;
$ i na c t t ou t = $opt t ? $opt t : 600 ;
$ f i l e r o t a t i o n p e r i o d = $opt p ? $opt p : 15∗60;
Cflow : : verbose ( $opt v ) ;
p r in t ” s r c IP \ td s t IP \ tp r o t o co l \ t s r c p o r t \ t d s t po r t ” .
”\ tbytes\ tpkts\ t f l ows \ t s t a r t t ime \ t i d l e t im e \ tend t ime\n ” ;
Cflow : : f i nd ( \&wanted , \&pe r f i l e , (−1 != $#ARGV)? @ARGV : ’ − ’) ;
f o r each $ i ( keys %f lows ) { exp i r e ( $ i ) ; }
ex i t 0 ;
sub wanted {
$key = ” $ s r c i p \ t $d s t i p \ t $p ro toco l \ t $ s r cpo r t \ t$ds tpor t ” ;
$ i d l e = ( e x i s t s ( $ f lows{$key}{end} ) ? $star t ime − $ f lows{$key}{end} : 0 ) ;
$ i d l e = ( $ i d l e > 0 ? $ i d l e : 0 ) ;
i f ( $ i d l e > $ ina c t t ou t ) {
exp i r e ( $key ) ;
$ i d l e = 0 ;
}
$ f lows{$key}{bytes} += $bytes ;
$ f lows{$key}{pkts} += $pkts ;
$ f lows{$key}{ f l ows} ++;
$f lows{$key}{ s t a r t } = $start ime i f ( ! e x i s t s ( $ f lows{$key}{ s t a r t } ) ) ;
$ f lows{$key}{ i d l e } += $ id l e ;
$ f lows{$key}{end} = $endtime ;
$ f lows{$key}{ f i l e i d } = $ f i l e i d ;
p r in t STDERR ” $ i d l e \ t ” . ( $endtime−$star t ime ) . ”\n ” ;
}
sub p e r f i l e {
f o r each $ i ( keys %f lows ) {
exp i r e ( $ i ) i f ( ( $ f i l e i d −$ f lows{ $ i }{ f i l e i d }−2)∗ $ f i l e r o t a t i o n p e r i o d > $ ina c t t ou t ) ;
}
$ f i l e i d ++;
}
sub exp i r e {
$key = s h i f t @ ;
p r in t ”$key\ t$ f l ows {$key}{bytes}\ t$ f l ows {$key}{pkts}\ t$ f l ows {$key}{ f l ows }\ t ” .
” $ f lows{$key}{ s t a r t }\ t $ f l ows {$key}{ i d l e }\ t$ f l ows {$key}{end}\n ” ;
d e l e t e $ f lows{$key } ;
}
63
Appendix D
Source code for ’reassemble’
#!/opt/ pe r l / bin / pe r l
use FindBin ;
use Getopt : : Std ;
use IO : : F i l e ;
use F i l e : : Basename ;
getopts ( ’ t : s : v ’ ) ;
$ i na c t t ou t = $opt t ? $opt t : 24∗60∗60;
$opt s =˜ s /\ . /\\ . / g i f ( $opt s ) ;
p r in t ”IP1\ tIP2\ tp r o t o co l ” .
”\ tport1\ tport2\ tbytes1\ tbytes2 ” .
”\ tpkts1\ tpkts2\ t s t a r t t ime \ tend t ime\n ” ;
# read header l i n e
my $ l i n e = <>;
i f ( ! $ l i n e ) {
die ”Empty input f i l e !\n ” ;
}
chomp $ l i n e ;
my @head = s p l i t /\ s∗\ t\ s ∗/ , $ l i n e ;
# Safety t e s t ! ! ! !
i f ( s c a l a r (@head) < 8) {
die ” I l l e g a l input f i l e !\n ” ;
}
# proce s s data l i n e by l i n e
whi le ( $ l i n e=<> )
{
$n++;
chomp $ l i n e ;
next i f $ l i n e =˜ /ˆ\ s∗$ / ; # sk ip empty l i n e s
my @info = s p l i t /\ s∗\ t\ s ∗/ , $ l ine ,−1;
my %in f o ;
f o r (my $n=0;$n<=$#head ; $n++) { $ in f o {$head [ $n ]} = $in f o [ $n ] ; }
i f ( $n % 10000 == 0 ) {
f o r each $ i ( keys %f lows ) {
i f ( $ i n f o { s t a r t t ime } > $ f lows{ $ i }{ end time} + $ inac t t ou t )
{ e x p i r e h a l f ( $i , ” o ld ” ) ; }
}
}
i f ( ( $opt s && ( $ in f o {dst IP} =˜ $opt s ) )
| | ( $opt s && ! ( ( $ i n f o { s r c IP } =˜ $opt s ) | | ( $ in f o { s r c p o r t } > $ in f o { ds t po r t } ) ) )
| | ( ! $opt s && ( $ in f o { s r c p o r t } < $ in f o { ds t po r t } ) ) )
{
$key = ” $ in f o { s r c IP }\ t $ i n f o {dst IP }\ t $ i n f o {pro toco l }\ t $ i n f o { s r c p o r t }\ t $ i n f o { ds t po r t }”;
i f ( e x i s t s ( $ f lows{$key}{bytes1} ) )
64
APPENDIX D. SOURCE CODE FOR ’REASSEMBLE’ 65
{
e x p i r e h a l f ( $key , ”dup” ) ;
}
i f ( e x i s t s ( $ f lows{$key}{ s t a r t t ime } )
&& ( $f lows{$key}{ s t a r t t ime } > $ in f o {end time}+10
| | $ in f o { s t a r t t ime } > $ f lows{$key}{ end time}+10 ) )
{
e x p i r e h a l f ( $key , ”tim” ) ;
}
$ f lows{$key}{bytes1} = $in f o {bytes } ;
$ f l ows{$key}{pkts1} = $in f o {pkts } ;
i f ( $ f lows{$key}{ s t a r t t ime } ) {
$ f lows{$key}{ s t a r t t ime } = $ in f o { s t a r t t ime }
i f ( $ i n f o { s t a r t t ime } < $ f lows{$key}{ s t a r t t ime } ) ;
$ f lows{$key}{ end time} = $ in f o { end time}
i f ( $ i n f o {end time} > $ f lows{$key}{ end time} ) ;
} e l s e {
$ f lows{$key}{ s t a r t t ime } = $ in f o { s t a r t t ime } ;
$ f l ows{$key}{ end time} = $ in f o { end time } ;
}
i f ( $ f lows{$key}{bytes2} ) { exp i r e ( $key ) ; }
}
e l s e
{
$key = ” $ in f o {dst IP }\ t $ i n f o { s r c IP }\ t $ i n f o {pro toco l }\ t $ i n f o { ds t po r t }\ t $ i n f o { s r c p o r t }”;
i f ( e x i s t s ( $ f lows{$key}{bytes2} ) )
{
e x p i r e h a l f ( $key , ”dup” ) ;
}
i f ( e x i s t s ( $ f lows{$key}{ s t a r t t ime } )
&& ( $f lows{$key}{ s t a r t t ime } > $ in f o {end time}+10
| | $ in f o { s t a r t t ime } > $ f lows{$key}{ end time}+10 ) )
{
e x p i r e h a l f ( $key , ”tim” ) ;
}
$ f lows{$key}{bytes2} = $in f o {bytes } ;
$ f l ows{$key}{pkts2} = $in f o {pkts } ;
i f ( $ f lows{$key}{ s t a r t t ime } ) {
$ f lows{$key}{ s t a r t t ime } = $ in f o { s t a r t t ime }
i f ( $ i n f o { s t a r t t ime } < $ f lows{$key}{ s t a r t t ime } ) ;
$ f lows{$key}{ end time} = $ in f o { end time}
i f ( $ i n f o {end time} > $ f lows{$key}{ end time} ) ;
} e l s e {
$ f lows{$key}{ s t a r t t ime } = $ in f o { s t a r t t ime } ;
$ f l ows{$key}{ end time} = $ in f o { end time } ;
}
i f ( $ f lows{$key}{bytes1} ) { exp i r e ( $key ) ; }
}
}
f o r each $ i ( keys %f lows ) { e x p i r e h a l f ( $i , ” l e f ” ) ; }
ex i t 0 ;
sub exp i r e {
$key = s h i f t @ ;
p r in t ”$key\ t ”
. ” $ f lows{$key}{bytes1}\ t $ f l ows {$key}{bytes2}\ t ”
. ” $ f lows{$key}{pkts1}\ t $ f l ows {$key}{pkts2}\ t ”
. ” $ f lows{$key}{ s t a r t t ime }\ t$ f l ows {$key}{ end time}\n ” ;
d e l e t e $ f lows{$key } ;
}
sub e xp i r e h a l f {
$key = s h i f t @ ; $reason = s h i f t @ ;
$ f lows{$key}{bytes1} += 0; $ f lows{$key}{bytes2} += 0;
$ f lows{$key}{pkts1} += 0; $ f lows{$key}{pkts2} += 0;
pr in t STDERR ”$key\ t ”
. ” $ f lows{$key}{bytes1}\ t $ f l ows {$key}{bytes2}\ t ”
. ” $ f lows{$key}{pkts1}\ t $ f l ows {$key}{pkts2}\ t ”
. ” $ f lows{$key}{ s t a r t t ime }\ t$ f l ows {$key}{ end time}\ t$reason\n ” ;
d e l e t e $ f lows{$key } ;
}
Bibliography
[Ada00] Lada A. Adamic. Zipf, power-laws, and pareto - a ranking tutorial, 2000.
[Online; accessed 29-November-2005].
[Bed05] Ann Bednarz. Sap digs in as oracle revs up, 2005. [Online; accessed 10-
October-2005].
[BP] Paul Barford and Dave Plonka. Inferring client experience from flow-based
measurements.
[BTI+03] Chadi Barakat, Patrick Thiran, Gianluca Iannaccone, Christophe Diot, and
Philippe Owezarski. Modeling internet backbone traffic at the flow level.
IEEE Transactions on Signal Processing, 51(8), August 2003.
[BTID] Chadi Barakat, Patrick Thiran, Gianluca Iannaccone, and Christophe Diot.
On internet backbone traffic modeling.
[CB96] Mark Crovella and Azer Bestavros. Self-Similarity in World Wide Web Traf-
fic: Evidence and Possible Causes. In Proceedings of SIGMETRICS’96: The
ACM International Conference on Measurement and Modeling of Computer
Systems., Philadelphia, Pennsylvania, May 1996. Also, in Performance eval-
uation review, May 1996, 24(1):160-169.
[CBP95] Kimberly C. Claffy, Hans-Werner Braun, and George C. Polyzos. A para-
meterizable methodology for internet traffic flow profiling. IEEE Journal of
Selected Areas in Communications, 13(8):1481–1494, 1995.
[Cis05] Cisco Systems. Netflow overview, 2005.
[Com88] Douglas E. Comer. Internetworking with TCP/IP. Principles, Protocols, and
Architecture. Prentice Hall, 1988.
[EM] Paul Embrechts and Makoto Maejima. An introduction to the theory of
selfsimilar stochastic processes.
[ERVW02] Ashok Erramilli, Matthew Roughan, Darryl Veitch, and Walter Willinger.
Self-similar traffic and network dynamics. Proc. of the IEEE, 90(5), 2002.
66
BIBLIOGRAPHY 67
[Fel00] Anja Feldmann. BLT: Bi-layer tracing of HTTP and TCP/IP. WWW9 /
Computer Networks, 33(1-6):321–335, 2000.
[FGL+00] Anja Feldmann, Albert G. Greenberg, Carsten Lund, Nick Reingold, Jen-
nifer Rexford, and Fred True. Deriving traffic demands for operational IP
networks: methodology and experience. In SIGCOMM, pages 257–270, 2000.
[Jan05] Susanne Janssen. Sizing guide - front-end network requirements for mysap
business solutions. Technical report, SAP AG, 2005.
[JLM] V. Jacobson, C. Leres, and S. McCanne. pcap - the packet capture library.
[JR86] R. Jain and S. A. Routhier. Packet trains - measurement and a new model
for computer network traffic. IEEE Journal on Selected Areas in Communi-
cations, 4(6):986 – 995, 1986.
[LCD04] A. Lakhina, M. Crovella, and C. Diot. Characterization of network-wide
anomalies in traffic flows, 2004.
[Lei] S. Leinen. Flow-based traffic analysis at switch.
[Loe97] Siegfried Loeffler. Verwendung von flows zur analyse und messung von
internet-verkehr, 1997.
[LTWW93] Will E. Leland, Murad S. Taqq, Walter Willinger, and Daniel V. Wilson.
On the self-similar nature of Ethernet traffic. In Deepinder P. Sidhu, editor,
ACM SIGCOMM, pages 183–193, San Francisco, California, 1993.
[McL90] Leo J. McLaughlin. Rfc 1179: Line printer daemon protocol, 1990.
[MH00] Michael Missbach and Uwe M. Hoffmann. SAP Hardware Solutions: Server,
Storage, and Networks for mySAP.com. Prentice Hall PTR, 2000.
[Mic] Sun Microsystems. Sunfire 15k overview.
[MS99] M. Maejima and K. Sato. Semi-selfsimilar processes. J. Theoret. Probab.,
12:347–383, 1999.
[MZ98] Florian Matthes and Stephan Ziemer. Understanding SAP R/3:A tutorial
for computer scientists. Technical report, Technical University Hamburg-
Harburg, Germany, 1998.
[PF95] Vern Paxson and Sally Floyd. Wide area traffic: the failure of Poisson
modeling. IEEE/ACM Transactions on Networking, 3(3):226–244, 1995.
[Pos81] John Postel. Rfc 793: Transmission control protocol, 1981.
[RFL00] Steve Romig, Mark Fullmer, and Ron Luman. The OSU Flow-tools package
and CISCO NetFlow logs. In Proceedings of the 14th Systems Administration
Conference (LISA2000), 2000.
BIBLIOGRAPHY 68
[SAP02] SAP AG. Sap gui technical infrastructure, 2002.
[SF02] Robin Sommer and Anja Feldmann. Netflow: Information loss or win? In
Proceedings of ACM SIGCOMM Internet Measurement Workshop (IMW)
2002. ACM Press, 2002.
[TWS97] Murad S. Taqqu, Walter Willinger, and Robert Sherman. Proof of a funda-
mental result in self-similar traffic modeling. ACMCCR: Computer Commu-
nication Review, 27, 1997.
[UB01a] S. Uhlig and O. Bonaventure. The macroscopic behavior of internet traffic:
a comparative study, 2001.
[UB01b] S. Uhlig and O. Bonaventure. Understanding the long-term self-similarity of
Internet traffic. Lecture Notes in Computer Science, 2156:286+, 2001.
[Wik05a] Wikipedia. Acid — wikipedia, the free encyclopedia, 2005. [Online; accessed
29-December-2005].
[Wik05b] Wikipedia. Pareto distribution — wikipedia, the free encyclopedia, 2005.
[Online; accessed 29-November-2005].
[Wik06] Wikipedia. Long-range dependency — wikipedia, the free encyclopedia,
2006. [Online; accessed 01-March-2006].
[WPT98] W. Willinger, V. Paxson, and M. S. Taqqu. Self-similarity and Heavy Tails:
Structural Modeling of Network Traffic. A Practical Guide to Heavy Tails:
Statistical Techniques and Applications, 1998.
[ZYD] Xiaoyun Zhu, Jie Yu, and John Doyle. Heavy-tailed distributions, general-
ized source coding and optimal web layout design.
Acknowledgments
69