Index Group
Illinois Network Design and Experimentation
People
Projects
Publications
Software
Demos
Links
Seminar
Facility
News
Members Only
Home

Intelligent Network Diagnostics

  1. Overview
  2. Research Agenda
  3. Participants
  4. Major Publications
  5. Funding Source
  6. Related Links

Overview

The rapid proliferation and integration of computer and network systems have connected infrastructures to one another in a complex network of interdependence. Even the network system itself, e.g., the civilian Internet, is undergoing dramatic changes in the underlying technologies and services provided, in order to keep up with the surpassing growth of demands from new users and applications. For example, when the World Wide Web went into operation it consisted of two types of interacting machines - the web servers and the client machines (besides the DNS servers). When a user typed a URL in, the browser at the client machine fetched the web page and other associated objects directly from the web server. Over the past decade however, several other tiers have been layered in, and connected via the Internet, between the servers and the clients. Today, web browsers have local caches, institutions maintain cache proxies, content distribution networks such as Akamai serve data directly to the clients, and web services are being hosted by server farms. A recently added layer of complexity is peer to peer web caching, enabling a client to fetch a cached copy of a web page from another client on a nearby network.

The upshot of these multiple levels of complexity is that when such a system is deployed, even incrementally, the end-user (e.g., the browsing user) may not be delivered the best possible performance. Under certain circumstances, interdependence and pathological interaction between these multiple levels could result in service outages. These outcomes would seem counter-intuitive to the original motivation behind the development of each of these tiers of complexity - to improve the end-user experience. Worse still, when the delivered performance is not acceptable, it is usually very difficult, if not impossible, for end-users to infer which layer(s) are at fault.

We believe that end-user networking software that instruments itself based on lightweight network measurement and diagnosis can address many of the above concerns. Such intelligent networking software has the potential to provide the end user with the ``best of all available worlds'' experience at all times, while not unduly overloading the network infrastructure itself with probes. In this project, we propose to develop the following software as a means to validate and explore this philosophy -

(i) end-to-end network measurement and diagnostics techniques to diagnose the causes for certain sub-optimal or even abnormal systems behaviors, allowing corrective actions to be taken in a timely fashion, and

(ii) on-line systems control mechanisms that leverage measurement and diagnostics results to configure/tune systems parameters and determine policies used in the communication subsystems.

Our intention in this project is to develop software solutions for a variety of networked applications, without the need to change any part of the established infrastructure.

Research Agenda (click here for details)

In this project, we aim to address key technical challenges of measurement/diagnostics-based systems control. Specifically we will carry out several innovative research tasks along the following two synergistic R&D thrusts:

(1) Network measurement and diagnostics.

(2) On-line systems control based on network measurement/diagnostics.

Participants

  • Dr. Indy Gupta (Affiliated faculty)
  • Srikanth Kandula (Ph.D. student, MIT)
  • Dr. Jong-Kwon Lee (Postdoc)

Major Publications

  • Jennifer Hou and Indy Gupta, "Design, implementation, and application of intelligent network diagnostics software," white paper.

Funding Source


Related Links

  • User-mode Linux(UML) kernel:

    Instrumenting the networking stack of an UML kernel and having the application agents run within the UML kernel is an alternative mechanism for network diagnosis.

  • tcpdump:

    A popular tool used to echo packet information up to and including payload content, to standard out or a file. The packet information is gathered from the local network interfaces after it has been placed in promiscuous mode.


Detailed Description of Research Agenda

(1) Network measurement and diagnostics:

To develop an application-level measurement tool that allows direct access to fine-grained information across the protocol stack, we will make the bulk of the kernel networking stack available at the user level in the form of a library, with instrumentation to deliver notifications about events of interest (Figure 1). The network library will export an API through which either systems controllers or other applications can express their interest in events from a predefined set and provide the corresponding callback functions. Migrating the networking stack to the user space avoids reliance on a particular kernel configuration, or the presence of specific kernel support. It also allows applications to (a) retrieve protocol stack information (such as the round trip time and its variation, packet loss ratio and patterns, various timeout values) at any desirable fine granularity, and to (b) correlate lower-level network events to application behaviors that triggered them. Although the notion of a user-space protocol stack is not new, its use in comprehensive network measurement and diagnostics has not been explored.

Figure 1: Architecture diagram of the user-level protocol stack for network diagnostics.

In conjunction with the development of the network diagnostics library, we will also devise non-intrusive, light-weight measurement methods to track several network attributes, such as the available bandwidth, the amount of cross traffic on an end-to-end path, and packet loss statistics. These methods will be incorporated, along with existing traceroute, ping, tcpdump, and pathrate tools, into the user-space network diagnostics library, in order to provide another dimension of network information.

With the abundant information available from the network library and network attributes measured by the end-to-end measurement methods, we will develop statistics-based strategies for comprehensive network diagnostics, and will describe the components of an open-source software system for networked systems diagnostics.


(2) On-line systems control based on network measurement/diagnostics:

Systems control encompasses a wide spectrum of operations, ranging from parameter turning, to selection of alternative algorithms/mechanisms, and to replacement/reconfiguration of network/systems components/modules. These operations may be performed in different layers across the protocol stack. As a proof of concept, we will demonstrate empirically the use of measurement/diagnostics-based control in computing a scheduling delay in IEEE 802.11 to improve the system capacity, in devising a smart browser that improves the user response time based on network measurement and diagnostics results, and in general, facilitating decentralized distributed system design that is robust, adaptive and responsive. In particular, we will leverage the network measurement and diagnostics system to build on the membership protocol, SWIM, and the distributed resource location and discovery system, Kelips. Protocols for membership maintenance and distributed resource location are required by many distributed applications (e.g., cooperative web caching), and decentralized solutions that are robust and responsive can continually ensure that the user gets the ``best of all worlds'' performance. We will carefully consider mechanisms to ``lock'' the networked systems into a desired equilibrium state in spite of abrupt changes in systems workload. We will carry out experiments on clustered testbeds, e.g., PlanetLab (http://www.planet-lab.org). Representative deliverables will be (i) a cooperative (peer-to-peer) web caching software that clients can use to fetch web objects from each other, as well as to provide feedback to the other tiers (e.g., content distribution networks); and (ii) a smart browser takes performance-related actions (e.g., deploying different versions of HTTP) based on client-perceived Web response time.