Programme
| All day |
Affiliated Workshops
| | 19:30 - 21:30 | Reception
& Registration at the Holiday Inn Hotel
|
Wednesday,
March 21st | 9:00 - 9:15 |
Welcome & Intro
Marc Shapiro Paulo Ferreira Thomas Gross |
| 9:15 - 10:30 |
Profiling, Prediction & Instrumentation - Session Chair: Jim Larus, Microsoft Research
- JIT INSTRUMENTATION - A NOVEL APPROACH TO
DYNAMICALLY INSTRUMENT
OPERATING SYSTEMS. (abstract)
Marek Olszewski, Adam Czajkowski, Keir
Mierle, Angela Demke Brown (University of Toronto)
As modern operating
systems become more complex, understanding their inner workings is
increasingly difficult. Dynamic kernel instrumentation is a well
established method of obtaining insight into the workings of an OS,
with applications including debugging, profiling and monitoring, and
security auditing. To date, all dynamic instrumentation systems for
operating systems follow the probe-based instrumentation paradigm.
While efficient on fixed-length instruction set architectures, probes
are extremely expensive on variable-length ISAs such as the popular
Intel x86 and AMD x86-64. We propose using just-in-time (JIT)
instrumentation to overcome this problem. While common in user space,
JIT instrumentation has not until now been attempted in kernel space.
In this work, we show the feasibility and desirability of kernel-based
JIT instrumentation for operating systems with our novel prototype,
implemented as a Linux kernel module. The protoype is fully SMP
capable. We evaluate our prototype against the popular Kprobes Linux
instrumentation tool. Our prototype outperforms Kprobes, at both micro
and macro levels, by orders of magnitude when applying medium- and
fine-grained instrumentation.
- WHODUNIT: TRANSACTIONAL
PROFILING FOR
MULTI-TIER APPLICATIONS. (abstract)
Anupam Chanda, Alan Cox (Rice University)
Willy Zwaenepoel (EPFL)
This paper is
concerned with performance debugging of multi-tier applications, such
as commonly found in servers and dynamic-content web sites. Existing
tools and techniques for profiling such applications are not general
enough to track and profile transactions in a generic multi-tier
application. We propose transactional profiling that provides a general
solution to this problem. We provide novel algorithms and techniques to
track and profile transactions that flow through shared memory, events,
stages or via interprocess communication using messages. We also
measure interference among concurrent transactions.
We
describe the design and implementation of Whodunit, our prototype
transactional profiler. Using Apache and MySQL we validate our proposed
algorithm for tracking transaction flow through shared memory. We
demonstrate the use of Whodunit in obtaining the transactional profile
of web servers, a web proxy cache and a bookstore application.
Whodunit-inspired optimizations increased the peak throughput of the
bookstore by almost 3x.
- EXPLOITING
NONSTATIONARITY FOR PERFORMANCE PREDICTION. (abstract)
Christopher Stewart (HP Labs / University of Rochester)
Terence Kelly, Alex Zhang (HP Labs)Real production
applications ranging from enterprise applications to large e-commerce
sites share a crucial but seldom-noted characteristic: The relative
frequencies of transaction types in their workloads are
\emph{nonstationary}. Accurately predicting application-level
performance in business-critical production applications is an
increasingly important problem. However, transaction mix
nonstationarity casts doubt on the practical usefulness of prediction
methods that ignore this phenomenon.
This paper
demonstrates that nonstationarity \emph{enables} a new approach to
predicting application-level performance as a function of transaction
mix. We exploit nonstationarity to eliminate the need for invasive
instrumentation and controlled benchmarking during model calibration;
our approach relies solely on lightweight passive measurements that are
routinely collected in today's production environments. We evaluate
predictive accuracy on two real business-critical production
applications. The accuracy of our response time predictions ranges from
10\% to 16\% on these applications, and our models generalize well to
workloads very different from those used for calibration.
We
apply our technique to the challenging problem of predicting the impact
of application consolidation on transaction response times. We
calibrate models of two testbed applications running on dedicated
machines, then use the models to predict their performance when they
run together on a shared machine and serve very different workload. Our
predictions are accurate to within 4\% to 14\%. Existing approaches to
consolidation decision support predict post-consolidation resource
utilizations. Our method allows application-level performance to guide
consolidation decisions.
| | 10:30
- 11:00 | Coffee Break
| | 11:00
- 12:15 | Multi-Core/Multi-Processor
Issues - Session Chair: Steven Hand, University of Cambridge -
THREAD CLUSTERING: SHARING-AWARE SCHEDULING ON SMP-CMP-SMT
MULTIPROCESSORS. (abstract)
David Tam, Reza Azimi, Michael Stumm (University
of Toronto)
The major
chip manufacturers have all introduced chip multiprocessing (CMP) and
simultaneous multithreading (SMT) technology into their processing
units. As a result, even low end computing systems and game consoles
have become shared memory multiprocessors with L1 and L2 cache sharing
within a chip. Mid- and large-scale systems will have multiple
processing chips and hence consist of an SMP-CMP-SMT configuration with
non-uniform data sharing overheads. Current operating system schedulers
are not aware of these new cache organizations, and as a result,
distribute threads across processors in a way that causes many
unnecessary, long-latency cross-chip cache accesses.
In
this paper we describe the design and implementation of a scheme to
schedule threads based on sharing patterns detected online using
features of standard performance monitoring units (PMUs) available in
today's processing units. The primary advantage of using the PMU
infrastructure is that it is fine-grained (down to the cache line) and
has relatively low overhead. We have implemented our scheme in Linux
running on an 8-way Power5 SMP-CMP-SMT multiprocessor. For commercial
multi-threaded server workloads (VolanoMark, SPECjbb, and RUBiS), we
are able to demonstrate reductions in cross-chip cache accesses of up
to 70%. These reductions lead to application-reported performance
improvements of up to 7%.
- DRYAD:
DISTRIBUTED DATA-PARALLEL
PROGRAMS
FROM SEQUENTIAL BUILDING
BLOCKS. (abstract)
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell,
Dennis Fetterly (Microsoft Research, Silicon Valley)Dryad is a
general-purpose distributed execution engine for coarse-grained
data-parallel applications. A Dryad application combines computational
“vertices” with communication
“channels” to form a data-flow graph. Dryad runs
the application by executing the vertices of this graph on a set of
available computers, communicating as appropriate through files, TCP
pipes, and shared-memory FIFOs
The vertices provided
by the application developer are quite simple (for example an in-memory
sort or a selective merge of the data items from two input channels),
and are usually written as sequential programs with no thread creation
or locking. Concurrency arises from Dryad scheduling vertices to run
simultaneously on multiple computers, or on multiple CPU cores within a
computer. An application developer might write entirely new vertices
for the application, or might re-use existing vertices in a new
data-flow graph; most applications do some of each.
For
advanced applications the developer can control the choice of channel
type (file, FIFO or pipe) to optimize the application’s
execution. The developer can also discover the structure and placement
of data at run-time, and make a graph self-modifying to use the
available resources more efficiently.
Dryad is
designed to scale from powerful multi-core single computers, through
small clusters of computers, to data centers with thousands of
computers. The Dryad execution engine handles all the difficult
problems of creating a large distributed, concurrent application:
scheduling the use of computers and their CPUs, recovering from
communication or computer failures, and transporting data between
vertices.
We present a description of Dryad,
supported by sample applications (some very simple, others with many
thousands of vertices and handling terabytes of data); we present our
experiences so far in using this system, including its performance and
scaling behavior; and finally, we compare Dryad with some of the
pre-existing related work.
- ENABLING SCALABILITY
AND PERFORMANCE IN A LARGE
SCALE CMP ENVIRONMENT. (abstract)
Bratin Saha, Ali-Reza Adl-Tabatabai, Anwar Ghuloum, Mohan Rajagopalan,
Richard L. Hudson
, Leaf Petersen, Vijay Menon, Brian Murphy, Tatiana Sphiesman, Eric Sprangle,
Anwar Rohillah, Doug Carmean and Jesse Fang (Programming
Systems Lab, Intel Corp)Hardware
trends suggest that large-scale CMP architectures, with tens to
hundreds of processing cores on a single piece of silicon, are imminent
within the next decade. While existing CMP machines have traditionally
been handled in the same way as SMPs, this magnitude of parallelism
introduces several fundamental architectural changes which, in turn,
translate to novel challenges for the design of the software stack on
these platforms. This paper presents the “Many Core Run
Time” (McRT), a software prototype of an integrated language
runtime that was designed to explore configurations of the software
stack for enabling performance and scalability on large scale CMP
platforms. This paper presents the architecture of McRT and discusses
our experiences with the system, including experimental evaluation that
lead to several interesting and non-intuitive findings, providing key
insights about the structure of futuristic system stacks. One key
contribution of this paper is that it demonstrates how McRT can enable
near linear improvements in performance and scalability for desktop
workloads such as the popular XviD encoder and a set of RMS
(recognition, mining, and synthesis) applications. Another significant
contribution is its use of McRT to explore non-traditional
configurations such as a light-weight executive in which McRT runs on
“bare metal” that complements traditional operating
systems and potentially replaces them on a fraction of the CMP cores.
| | 12:15 - 13:45 |
Lunch at the Holiday Inn Hotel | | 13:45 - 15:00 | Networking
Issues - Session Chair: Herbert Bos, Vrije Universiteit Amsterdam -
LATENCY- AND BANDWIDTH-MINIMIZING FAILURE DETECTORS.
(abstract)
Kelvin C. W. So, Emin Gun Sirer (Cornell University)Failure detectors
are fundamental building blocks in distributed systems. {\em Multi-node
failure detectors}, where the detector is tasked with monitoring $N$
other nodes, play a critical role in overlay networks and peer-to-peer
systems. In such networks, failures need to be detected quickly and
with low overhead. Achieving these properties simultaneously poses a
difficult tradeoff between detection latency and resource consumption.
In
this paper, we examine this central tradeoff, formalize it as an
optimization problem and analytically derive the optimal closed form
formulas for multi-node failure detectors. We provide two variants of
the optimal solution for optimality metrics appropriate for two
different deployment scenarios. sqrt{s}-LM is
a latency-minimizing optimal failure detector that achieves
the lowest average failure detection latency given a fixed bandwidth
constraint for system maintenance. sqrt{s}-BM is
a bandwidth-minimizing optimal failure detector that meets a
desired detection latency target with the least amount of bandwidth
consumed. We evaluate our optimal results with node lifetimes chosen
from bimodal and Pareto distributions, as well as real-world trace data
from PlanetLab hosts, web sites and Microsoft PCs. Compared to standard
failure detectors in wide use, sqrt{s} failure detectors
reduce failure detection latencies by 40\% on average for the same
bandwidth consumption, or conversely, reduce the amount of bandwidth
consumed by 30\% for the same failure detection latency.
- MELANGE: CREATING A
"FUNCTIONAL" INTERNET. (abstract)
Anil
Madhavapeddy (XenSource
Inc.) Alex Ho (University of Cambridge)
Tim Deegan (XenSource Inc.)
(University of Cambridge) David Scott and Ripduman Sohan(XenSource
Inc.)Most
implementations of critical Internet protocols are written in
type-unsafe languages such as C or C++ and are regularly vulnerable to
serious security and reliability problems. Type-safe languages
eliminate many errors but are not used to due to the perceived
performance overheads.
We combine two techniques to
eliminate
this performance penalty in a practical fashion: strong static typing
and generative meta-programming. Static typing eliminates run-time type
information by checking safety at compile-time and minimises dynamic
checks. Meta-programming uses a single specification to abstract the
low-level code required to transmit and receive packets.
Our
domain-specific language, MPL, describes Internet packet protocols and
compiles into fast, zero-copy code for both parsing and creating these
packets. MPL is designed for implementing quirky Internet protocols
ranging from the low-level: Ethernet, IPv4, ICMP and TCP; to the
complex application-level: SSH, DNS and BGP; and even file-system
protocols such as 9P.
We report on fully-featured
SSH and DNS
servers constructed using MPL and our OCaml framework ``Melange'', and
measure greater throughput, lower latency, better flexibility and more
succinct source code than their C equivalents OpenSSH and BIND. Our
quantitative analysis shows that the benefits of MPL-generated code
overcomes the additional overheads of automatic garbage collection and
dynamic bounds checking. Qualitatively, the flexibility of our approach
shows that dramatic optimisations are easily possible.
- SWEEPER: A LIGHTWEIGHT
END-TO-END SYSTEM FOR DEFENDING AGAINST FAST
WORMS. (abstract)
Joseph Tucek (U. of Illinois at Urbana-Champaign)
James Newsome (Carnegie Mellon University)
Shan Lu, Chengdu Huang, Spiros Xanthos (U. of Illinois at
Urbana-Champaign) David Brumley (Carnegie
Mellon University) Yuanyuan Zhou (U. of
Illinois at Urbana-Champaign) Dawn Song (Carnegie
Mellon University)The vulnerabilities
which plague computers cause endless grief to users. Slammer
compromised millions of hosts in minutes; a hit-list worm would take
under a second. Recently proposed techniques respond better than manual
approaches, but require expensive instrumentation, which greatly limits
deployment. Although spreading ``antibodies'' (e.g. signatures)
ameliorates this limitation, hosts depending on antibodies are
defenseless until inoculation; to the fastest hit-list worms this delay
is crucial. Additionally, most recently proposed techniques cannot
provide recovery to provide continuous service after an attack.
We
propose a novel solution called Sweeper that provides both fast and
accurate post-attack analysis and efficient recovery with low normal
execution overhead. Sweeper innovatively combines several techniques:
(1) Sweeper uses lightweight monitoring techniques to detect a wide
array of suspicious requests, providing a first level of defense. (2)
By cleverly leveraging lightweight checkpointing, Sweeper postpones
heavyweight monitoring until absolutely necessary --- after an attack
is detected. Sweeper rolls back and re-executes multiple times to
dynamically apply heavy-weight analysis techniques via dynamic binary
instrumentation. Since only the execution involved in the attack is
analyzed, the analysis is efficient, yet thorough. (3) Based on the
analysis results, Sweeper automatically generates low-overhead
antibodies to prevent future attacks of the same vulnerability. (4)
Finally, Sweeper again re-executes to perform fast recovery for
continuous service.
We implement Sweeper in a real
system. Our experimental results with three real-world servers and four
real security vulnerabilities show that Sweeper detects an attack and
generates antibodies in under 60 milliseconds. Our results also show
that Sweeper imposes under 1% overhead during normal execution, clearly
suitable for widespread production deployment (especially since Sweeper
also allows partial deployment). Finally, we analytically show that,
for a fast hit-list worm otherwise capable of infecting all vulnerable
hosts in under a second, Sweeper contains the extent of infection to
under 5%.
| | 15:00 - 15:30 | Coffee
Break |
| 15:30 - 16:45 | Sensor networks - Session Chair: Dejan Kostic, EPFL
- REMOVING THE MEMORY LIMITATIONS OF SENSOR
NETWORKS WITH FLASH-BASED
VIRTUAL MEMORY. (abstract)
Andreas Lachenmann, Pedro José Marrón, Matthias
Gauger, Daniel Minder, Olga
Saukh, Kurt Rothermel (Universität Stuttgart)Virtual memory has
been successfully used in different domains to extend the amount of
memory available to applications. We have adapted this mechanism to
sensor networks, where RAM traditionally is a severely constrained
resource. In this paper we show that the overhead of virtual memory can
be significantly reduced with compile-time optimizations to make it
usable in practice, even with the resource limitations present in
sensor networks.
Our approach, ViMem, creates an
efficient memory layout based on variable access traces obtained from
simulation tools. This layout is optimized to the memory access
patterns of the application and to the specific properties of the
sensor network hardware.
Our implementation is based
on TinyOS. It includes a pre-compiler for nesC code that translates
virtual memory accesses into calls of ViMem's runtime component. ViMem
uses flash memory as secondary storage. In order to evaluate our system
we have modified nontrivial existing applications to make use of
virtual memory. We show that its runtime overhead is small even for
large data sizes.
- A VIRTUAL MACHINE FOR
SENSOR NETWORKS. (abstract)
Rene Mueller, Gustavo Alonso, Donald Kossmann (ETH Zurich)Sensor networks
are increasingly being deployed for a wide variety of tasks. One open
problem in these networks is the development, deployment, and
maintenance of applications. Today, these tasks are performed largely
ad-hoc because of the severe constraints imposed by the limitations in
processing power, communication bandwidth, and memory capacity of the
sensor nodes. Existing platforms help somewhat but also introduce
implicit trade-offs. On the one hand, low-level programming platforms
and languages make programming cumbersome and error-prone. On the other
hand, declarative approaches greatly facilitate programming but
severely restrict what can be done. In both cases, additional
limitations include lack of support for concurrency, difficulties in
changing applications, and insufficient abstractions from low-level
details. This paper presents SwissQM, a virtual machine designed to
address all these limitations. SwissQM offers a platform-independent
programming abstraction that is geared towards data acquisition and
in-network data processing. Compared with low-level platforms and
languages, SwissQM offers a better abstraction for programming sensor
networks and interacting with high-level applications. Compared with
declarative query-based approaches, SwissQM offers more flexibility and
support for customisation.
- MACROPROGRAMMING
HETEROGENEOUS SENSOR NETWORKS
USING COSMOS. (abstract)
Asad Awan, Suresh Jagannathan and Ananth Grama (Department of
Computer Science, Purdue University)In this paper, we
present COSMOS, a novel architecture for macroprogramming heterogeneous
sensor network systems. Macroprogramming entails aggregate system
behavior specification, as opposed to device-specific applications that
indirectly express distributed behavior through explicit messaging
between nodes. COSMOS is comprised of a macroprogramming language, {\em
mPL}, and an operating system, {\em mOS}. mPL macroprograms specify
distributed system behavior using statically verifiable compositions of
reusable user-provided, or system supported functional components. mOS
provides component management and a lean execution environment for mPL
in heterogeneous resource-constrained sensor networks. COSMOS
facilitates composition of complex real-world applications that are
robust, scalable and adaptive in dynamic data-driven sensor network
environments. The mOS architecture allows runtime application
instantiation, with over-the-air reprogramming of the network. An
important and novel aspect of COSMOS is the ability to easily extend
its component basis library to add rich macroprogramming abstractions
to mPL, tailored to domain and resource constraints, without
modifications to the OS. A fully functional version of COSMOS is
currently in use at the Bowen Labs for Structural Engineering, at
Purdue University, for high-fidelity structural dynamics measurements.
We present comprehensive experimental evaluation using macro- and
micro- benchmarks to demonstrate performance characteristics of COSMOS.
| | 17:00
- 18:30 | Eurosys General Member Meeting - Room 02.1 |
| 9:00 - 10:15 |
File Systems - Session Chair: Maurice Herlihy, Brown University -
hFS: A HYBRID FILE SYSTEM PROTOTYPE FOR IMPROVING SMALL FILE AND METADATA
PERFORMANCE. (abstract)
Zhihui Zhang, Kanad Ghose (State Univ.
of New York, Binghamton)Two oft-cited file
systems, the Fast File System (FFS) and the Log-Structured File System
(LFS), adopt two sharply different update strategies-update-in-place
and update-out-of-place. The choice of an update strategy largely
determines file system data layout, which in turn has a crucial impact
on file system performance. The respective advantages as well as
disadvantages of FFS and LFS are convincing examples of this
correlation.
This paper introduces the design and
implementation of a hybrid file system called hFS, which combines the
strengths of FFS and LFS while avoiding their weaknesses. This is
accomplished by distributing file system data into two partitions based
on their size and type. In hFS, data blocks of large regular files are
stored in a data partition arranged in a FFS-like fashion, while
metadata and small files are stored in a separate log partition
organized in the spirit of LFS but without incurring any cleaning
overhead. This segregation makes it possible to use more appropriate
layouts for different data than would otherwise be possible. In
particular, hFS has the ability to perform clustered I/O on all kinds
of data-including small files, metadata, and large files. We have
implemented a prototype of hFS on FreeBSD and have compared its
performance against three file systems, including FFS with Soft
Updates, a port of NetBSD's LFS, and our lightweight journaling file
system called yFS. Results on a number of benchmarks show that hFS has
excellent small file and metadata performance. For example, hFS beats
FFS with Soft Updates in the range from 53% to 63% in the PostMark
benchmark. - COMPETITIVE
PREFETCHING FOR CONCURRENT SEQUENTIAL I/O. (abstract)
Chuanpeng Li, Kai Shen (University of Rochester)
Athanasios Papathanasiou (Intel Massachusetts)During concurrent
I/O workloads, sequential access to one I/O stream can be interrupted
by accesses to other streams in the system. Frequent switching between
multiple sequential I/O streams may severely affect I/O efficiency due
to long disk seek and rotational delays of disk-based storage devices.
Aggressive prefetching can improve the granularity of sequential data
access in these situations, but comes with a higher risk of retrieving
unneeded data. This paper proposes a competitive prefetching strategy
that controls the prefetching depth so that the overhead of disk I/O
switching and unnecessary prefetching are balanced. The proposed
strategy does not require a-priori information on the data access
pattern, and achieves at least half the performance (in terms of I/O
throughput) of the optimal offline policy. We also provide analysis on
the optimality of our competitiveness result and extend the
competitiveness result to capture prefetching in the case of
random-access workloads.
We have implemented the
proposed competitive prefetching policy in Linux 2.6.10 and evaluated
its performance on both standalone disks and a disk array using a
variety of workloads (including two common file utilities, Linux kernel
compilation, the TPC-H benchmark, the Apache web server, and index
searching). Compared to the original Linux kernel, our competitive
prefetching system improves performance by up to 53%. At the same time,
it trails the performance of an oracle prefetching strategy by no more
than 42%.
- SECURE FILE SYSTEM
VERSIONING AT THE BLOCK LEVEL. (abstract)
Jake Wires, Mike Feeley (University of British Columbia)Versioning file
systems present an appealing storage model that protects data from
being unintentionally deleted or overwritten by transparently retaining
old versions. Achieving this added reliability by adding versioning to
a file system, however, is problematic in two important ways. First,
the complexity of file systems and the operating systems in which they
reside leaves data vulnerable to bugs and viruses. Second, the
mission-critical nature of file systems makes users and OS vendors
justifiably hesitant to adopt new file system features such as
versioning.
This paper presents BLOG, a secure,
block-level versioning system that adds file-grain versioning to a
standard file system without changing the file system. BLOG consists of
a set of untrusted user-mode tools and a trusted, secure kernel that is
implemented within an isolated Xen virtual machine domain. The secure
kernel is designed to be simple and thus trustworthy. It consists of
three parts: a device-driver wrapper that logs every write to the file
system's disk, a cleaner that securely removes unwanted versions, and a
read-only interface that exports the log to the host operating system.
BLOG's
two-level approach to cleaning is based on a version-retention policy
assigned to each BLOGed disk partition when it is created. An
untrusted, user-mode cleaner periodically determines the portions of
the log it wishes to reclaim and then submits log-cleaning requests to
the secure kernel along with a proof that these requests satisfy the
retention policy. The secure kernel verifies the proof and reclaims the
specified log blocks if applicable.
We have
implemented BLOG --- which currently supports ext2 and ext3 --- in Xen
with a Linux guest operating system. Experimental results show that
logging reduces file system throughput by between 10% and 50% and that
secure cleaning limits log growth to roughly 1.5 times that of the file
system.
| | 10:15 - 10:45 | Coffee
Break |
| 10:45 - 12:25 |
Experience Papers - Session Chair: Liuba Shrira, Brandeis University -
AUTOMATIC CONFIGURATION OF INTERNET SERVICES. (abstract)
Wei Zheng,
Ricardo Bianchini, Thu Nguyen (Rutgers University)Recent research has found that operators frequently
misconfigure Internet services, causing various availability and
performance problems. In this paper, we propose a software
infrastructure that eliminates several types of misconfiguration by
automating the generation of configuration files in Internet services,
as the services evolve. The infrastructure comprises a custom scripting
language, configuration file templates, communicating runtime monitors,
and heuristic algorithms to detect dependencies between configuration
parameters and select ideal configurations. To demonstrate our
infrastructure experimentally, we apply it to a realistic online
auction service. Our results show that the infrastructure can simplify
operation significantly while eliminating 58% of the misconfigurations
found in a previous study of the same service. Furthermore, our results
show that the infrastructure can efficiently determine the
configuration parameters that lead to the highest performance as the
service evolves through a hardware upgrade and the scheduled
maintenance of a few nodes.
- COMPARING THE
PERFORMANCE OF WEB SERVER ARCHITECTURES. (abstract)
David
Pariag, Tim Brecht, Ashif Harji, Peter Buhr, Amol Shukla (University
of Waterloo)n this paper we
extensively tune and then compare the performance of web servers based
on three different architectures. We begin by describing modifications
made to the Capriccio thread library and the thread-per-connection Knot
server to allow them to use Linux's zero-copy sendfile interface. We
then explain relatively minor modifications to a single process
event-driven (SPED) server (the userver) to allow it to continue
processing requests in the presence of blocking due to file-system
cache misses. Additionally, we describe an efficient implementation of
a server that is similar in spirit to a staged event-driven
architecture (SEDA) server. Because our server is implemented in C++
and for simplicity excludes the dynamic controls over event queues and
thread pools used in SEDA, we call it a pipeline-based server (WatPipe).
When
comparing the performance of these three server architectures, we
arrive at different conclusions than previous studies. We find that in
spite of recent improvements to threading libraries and our further
improvements to Knot, both the event-based userver and our
pipeline-based WatPipe server provide better throughput (by about 18%).
Interestingly, our results also show that for some architectures that
use blocking system calls when writing to sockets, performance is as
good and in some cases noticeably better than when using non-blocking
system calls.
- CONCIERGE: A SERVICE
PLATFORM FOR RESOURCE-CONSTRAINED DEVICES. (abstract)
Jan S. Rellermeyer, Gustavo Alonso (ETH Zurich, Department of
Computer Science)As mobile and
embedded devices become widespread, the management and configuration of
the software in the devices is increasingly turning into a critical
issue. OSGi is a business standard for the life cycle management of
Java software components. It is based on a service oriented
architecture where functional units are decoupled and components can be
managed independently of each other. However, the focus continuously
shifts from the originally intended area of small and embedded devices
towards large-scaled enterprise systems. As a result, implementations
of the OSGi framework are increasingly becoming more heavyweight and
less suitable for smaller computing devices. In this paper, we describe
the design of Concierge, a complete implementation of the OSGi
specifications for resource-constrained devices focused on performance.
We also discuss and evaluate important optimization techniques for the
Java VMs that are relevant in mobile and embedded systems.
Comprehensive benchmarks show that Concierge performs better than
existing implementations and consumes less resources.
- FINE GRAINED KERNEL
LOGGING WITH KLOGGER: EXPERIENCE AND INSIGHTS. (abstract)
Yoav Etsion (Hebrew University) Dan
Tsafrir (Hebrew University / IBM T.J. Watson Research Center)
Scott Kirkpatrick, Dror Feitelson (Hebrew University)Understanding the
detailed behavior of an operating system is crucial for making informed
design decisions. But such an understanding is very hard to achieve,
due to the increasing complexity of such systems and the fact that they
are implemented and maintained by large and diverse groups of
developers. Tools like Klogger --- presented in this paper --- can help
by enabling fine-grained logging of system events and the sharing of a
logging infrastructure between multiple developers and researchers,
facilitating a methodology where design evaluation can be an integral
part of kernel development. We demonstrate the need for such
methodology by a host of case studies, using Klogger to better
understand various subsystems in the Linux kernel, and pinpointing
overheads and problems therein.
| | 12:30 - 14:00 | Lunch at the Holiday Inn Hotel |
| 14:00 - 14:50 |
Virtualization - Session Chair: Frank Bellosa, Karlsruhe University -
CONTAINER-BASED OPERATING SYSTEM VIRTUALIZATION: A SCALABLE,
HIGH-PERFORMANCE ALTERNATIVE TO HYPERVISORS. (abstract)
Stephen Soltesz (Princeton
University) Herbert Potzl (Thirteenth
Floor Consulting) Marc Fiuczynski, Andy Bavier,
Larry Peterson (Princeton University)Hypervisors,
popularized by Xen and VMware, are quickly becoming commodity. They are
appropriate for many usage scenarios, but there are scenarios that
require system virtualization with high degrees of both isolation and
efficiency. Examples include HPC clusters, the Grid, hosting centers,
and PlanetLab. We present an alternative to hypervisors that is better
suited to such scenarios. The approach is a synthesis of prior work on
resource containers and security containers applied to general-purpose,
time-shared operating systems. Examples of such container-based systems
include Solaris 10, Virtuozzo for Linux, and Linux-VServer. As a
representative instance of container-based systems, this paper
describes the design and implementation of Linux-VServer. In addition,
it contrasts the architecture of Linux-VServer with current generations
of Xen, and shows how Linux-VServer provides comparable support for
isolation and superior system efficiency.
- ADAPTIVE CONTROL OF
VIRTUALIZED RESOURCES IN UTILITY COMPUTING
ENVIRONMENTS. (abstract)
Pradeep Padala (University of Michigan)
Xiaoyun Zhu, Mustafa Uysal, Zhikui Wang, Sharad Singhal, Arif Merchant (Hewlett
Packard Laboratories)
Kenneth Salem (University of Waterloo)
Kang Shin (University of Michigan)Data centers are
often under-utilized due to over-provisioning as well as time-varying
resource demands of typical enter- prise applications. One approach to
increase resource uti- lization is to consolidate applications in a
shared infrastruc- ture using virtualization. When multiple
applications share a common server infrastructure, meeting
application-level quality of service (QoS) goals becomes a challenge as
each application consumes a dierent amount of server resources.
Furthermore, for multi-tier applications, the amount of re- sources
needed to achieve their QoS goals might be dierent at each tier and
may depend on availability of resources in other tiers. In this paper,
we develop an adaptive resource control system that dynamically adjusts
the resource shares to individual virtual machines in order to meet
application- level QoS goals while achieving high resource utilization
in the data center. Our control system is developed using clas- sical
control theory, and we used a black-box system model- ing approach to
overcome the absence of rst principle mod- els for complex enterprise
applications and systems. To eval- uate our controllers, we built a
testbed simulating a virtual data center using Xen virtual machines. We
experimented with two multi-tier applications in this virtual data
center: a two-tier implementation of RUBis, an online auction site, and
a two-tier Java implementation of TPC-W. Our results indicate that the
proposed control system is able to main- tain high resource utilization
and meet QoS goals in spite of varying resource demands from the
applications.
| | 14:50
- 15:20 | Coffee Break |
| 15:20 - 16:10 |
New Ideas & New Benchmarks - Session Chair: Tim Harris, Microsoft Research
- DISCRETE CONTROL FOR SAFE EXECUTION OF IT AUTOMATION
WORKFLOWS. (abstract)
Yin Wang (University of Michigan) Terence
Kelly (Hewlett-Packard Laboratories)
Stiphane Lafortune (University of Michigan)As information
technology (IT) administration becomes increasingly complex, workflow
technologies are increasingly popular for IT automation. Writing
correct workflow programs is notoriously difficult. Although static
analysis tools are available, fixing defects remains manual and
error-prone. This paper applies discrete control theory to IT
automation workflows. Discrete control detects flaws in workflows just
as static analysis does, and more importantly it also allows safe
execution of flawed workflows by dynamically avoiding run-time
failures. Our approach can guarantee compliance with certain
dependability requirements and can partially decouple requirements from
software, reducing the need to modify the latter if the former change.
We have implemented a discrete control module for a real IT automation
system. Experiments with workflows from a real production system and
with randomly generated workflows show that our approach scales to
workflows of practical size.
- STMBENCH7: A BENCHMARK
FOR SOFTWARE TRANSACTIONAL MEMORY. (abstract)
Rachid Guerraoui, Michal Kapalka (EPFL)
Jan Vitek (Purdue University)Software
transactional memory (STM) is a promising technique for controlling
concurrency in modern multi-processor architectures. STM aims to be
more scalable than coarse-grained locking and easier to use than
fine-grained locks. However, STM implementations have yet to
demonstrate that their runtime overheads are acceptable. To date,
empiric evaluations of these implementations have suffered from the
lack of realistic benchmarks. Measuring performance of an STM in an
overly simplified setting can be at best uninformative and at worst
misleading as it may steer researchers to try to optimize irrelevant
aspects of their implementations.
This paper
presents STMBench7: a benchmark for evaluating STM implementations. The
underlying data structure consists of a set of graphs and indexes
intended to be suggestive of many complex applications, e.g., CAD/CAM.
A collection of operations is supported to model a wide range of
workloads and concurrency patterns. Companion locking strategies serve
as a baseline for STM performance comparisons.
STMBench7
strives for simplicity. Users may choose a workload, number of threads,
benchmark length, as well as the possibility of structure modification
and the nature of traversals of shared data structures. We illustrate
the use of STMBench7 with an evaluation of a well-known software
transactional memory implementation.
| | 16:10
- 18:15 | Poster session - Hall 02 |
| 19:30 - 22:30 |
Banquet (Busses leave IST at 18h30!!) |
| 9:00 - 10:15 |
Operating Systems - Session Chair: Gilles Muller, École des Mines de Nantes -
DYNAMIC AND ADAPTIVE UPDATES OF NON-QUIESCENT SUBSYSTEMS IN COMMODITY
OPERATING SYSTEM KERNELS. (abstract)
Kristis Makris (Arizona
State University) Kyung Ryu (IBM T.J.
Watson Research)Continuously
running systems require kernel software updates applied to them without
downtime. Facilitating fast reboots, or delaying an update may not be a
suitable solution in many environments, especially in pay-per-use
high-performance computing clusters and mission critical systems. Such
systems will not reap the benefits of new kernel features, and will
continue to operate with kernel security holes unpatched, at least
until the next scheduled maintenance downtime. To address these
problems we developed an on-the-fly kernel updating system that enables
commodity operating systems to gain adaptive and mutative capabilities
without kernel recompilation or reboot. Our system, DynAMOS, employs a
novel and efficient dynamic code instrumentation technique termed
adaptive function cloning. Execution can be switched adaptively among
multiple editions of functions, possibly concurrently running. This
approach becomes the foundation for dynamic replacement of
non-quiescent kernel subsystems when the timeliness of an update
depends on synchronization of multiple kernel paths. We illustrate our
experience by dynamically updating core subsystems of the Linux kernel.
- SEALING OS PROCESSES
TO IMPROVE DEPENDABILITY AND SAFETY. (abstract)
Galen Hunt, Mark Aiken, Manuel Fähndrich, Chris Hawblitzel, Orion Hodson, James Larus,
Bjarne Steensgaard, David Tarditi and Ted Wobber (Microsoft Research)On most modern
operating systems, a process is a hardware-protected abstraction for
isolating code and data. This protection, however, is selective. A
number of very common mechanisms, including dynamic code loading,
run-time code generation, shared memory, and intrusive system APIs,
make process isolation into a very permeable barrier. This paper argues
that this traditional open process architecture exacerbates
dependability and security weaknesses of modern systems. Moreover, this
architecture compounds these flaws by impairing the operation of
program-analysis tools and reducing their ability to improve
performance or dependability.
By contrast,
Singularity’s sealed process architecture prohibits dynamic
code loading, self-modifying code, shared memory, and limits the scope
of the process API. This paper describes the implementation of the
sealed process architecture in the Singularity operating system,
discusses its merits, and evaluates its effectiveness. Among the
benefits of a sealed process architecture are: improved program
analysis with tools, stronger security and safety guarantees,
elimination of OS redundancies with language runtimes, and improved
software engineering.
- AUTHORIZING
APPLICATIONS IN SINGULARITY. (abstract)
Ted Wobber (Microsoft Research -
Silicon Valley)
Aydan Yumerefendi (Duke University) Martin Abadi, Andrew Birrell (Microsoft Research -
Silicon Valley)
Daniel Simon (Microsoft Research - Redmond)
We describe a new
design for authorization in operating systems in which applications are
first-class entities. In this design, principals embody a flexible
notion of naming. They are compound principals that reflect the
identities of the programs that have executed, even those of login
programs. Our access control lists are patterns that recognize
principals. We present a security model that embodies this design in an
experimental operating system, and we describe the implementation of
our design and its performance in the context of this operating system.
| | 10:15
- 10:45 | Coffee Break |
| 10:45 - 12:00 |
Distributed Systems - Session Chair: Timothy Roscoe, ETH Zürich -
ANTIQUITY: EXPLOITING A SECURE LOG FOR WIDE-AREA DISTRIBUTED STORAGE.
(abstract)
Hakim Weatherspoon (Cornell University)
Patrick Eaton, Byung-Gon Chun, John Kubiatowicz (University of
California, Berkeley)Antiquity is a
wide-area distributed storage system designed to provide a simple
storage service and interface for applications like file systems and
back-up. The design assumes that all servers eventually fail and
attempts to maintain data despite those failures. Antiquity uses a
secure log to maintain data integrity, replicates each log on multiple
servers for increased durability, and uses quorum protocols to ensure
consistency among replicas. We present Antiquity's design and an
extensive experimental evaluation with global and local testbeds.
Antiquity has been running on 400+ PlanetLab servers storing nearly
20,000 logs totalling more than 84 GB of data. Despite constant server
churn, all logs remain durable.
- SPRINT: A MIDDLEWARE FOR
HIGH-PERFORMANCE TRANSACTION PROCESSING. (abstract)
Lasaro
Camargos (USI / Unicamp) Fernando Pedone,
Marcin Wieloch (USI)This paper
describes Sprint, a middleware infrastructure targeting high
performance and high availability data management. Sprint extends the
functionality of a standalone in-memory database (IMDB) to a cluster of
commodity shared-nothing servers. Applications accessing a standalone
IMDB are typically limited by the memory capacity of the server running
the IMDB. Sprint partitions and replicates the database into segments
and stores them in several data servers. Applications are then limited
by the aggregated memory capacity of the servers in the cluster.
Sprint's
execution model is very simple: Transaction synchronization and
commitment rely on a very efficient total-order multicast. Moreover,
differently from several previous approaches, Sprint does not require
accurate failure detection to ensure strong consistency (i.e.,
serializability), leading to very fast reaction to failures.
Experiments conducted using TPC-C and a micro-benchmark on a cluster
with up to 32 data servers show that despite being a middleware
solution, Sprint can provide very good performance and
scalability.
- TASHKENT+:
MEMORY-AWARE LOAD BALANCING AND UPDATE FILTERING IN
REPLICATED DATABASES. (abstract)
Sameh Elnikety, Steven Dropsho, Willy Zwaenepoel (EPFL)We present a
memory-aware load balancing technique, called MALB, to dispatch
transactions to replicas in a replicated database. MALB exploits
knowledge of the working-sets of transactions to fit them in the
replicas’ main memory and reduce memory contention. We
introduce a method to estimate the working-set size and contents of
database transactions. In contrast, other load balancing techniques
– such as the well-known round robin, least connections, and
locality-aware request distribution (LARD) – do not use
explicit information on how requests use memory. In particular, LARD
demonstrates good performance for static content Web workloads. LARD
gives, however, unsatisfactory performance for memory-intensive
database workloads because it lacks information needed to fit
transactions in memory. We also introduce a complementary optimization
called update filtering that reduces the overhead of update propagation
in the replicated system and further boosts performance. In a prototype
system, called Samarkand, we demonstrate that MALB doubles the
throughput over least connections load balancing and improves 52% over
LARD on the ordering mix of the TPC-W benchmark in a 16 replica
cluster. Adding update filtering further improves throughput to triple
that of least connections and more than double that of LARD. Our
techniques exhibit super-linear speedup; the throughput of the 16
replica cluster is 25 times the peak throughput of a standalone
database due to better use of the cluster’s memory.
| | 12:00 - 12:15 | Closing
Remarks
Paulo Ferreira Thomas Gross Joe Sventek
Steven Hand | | 12:15
- 13:45 | Lunch at the Holiday Inn Hotel |
|