Browsing by Author "Zwaenepoel, Willy"
Now showing 1 - 20 of 35
Results Per Page
Sort Options
Item A Characterization of Compound Documents on the Web(1999-11-29) Lara, Eyal de; Wallach, Dan S.; Zwaenepoel, WillyRecent developments in office productivity suites make it easier for users to publish rich {\em compound documents\/} on the Web. Compound documents appear as a single unit of information but may contain data generated by different applications, such as text, images, and spreadsheets. Given the popularity enjoyed by these office suites and the pervasiveness of the Web as a publication medium, we expect that in the near future these compound documents will become an increasing proportion of the Web's content. As a result, the content handled by servers, proxies, and browsers may change considerably from what is currently observed. Furthermore, these compound documents are currently treated as opaque byte streams, but future Web infrastructure may wish to understand their internal structure to provide higher-quality service. In order to guide the design of this future Web infrastructure, we characterize compound documents currently found on the Web. Previous studies of Web content either ignored these document types altogether or did not consider their internal structure. We study compound documents originated by the three most popular applications from the Microsoft Office suite: Word, Excel, and PowerPoint. Our study encompasses over 12,500 documents retrieved from 935different Web sites. Our main conclusions are: Compound documents are in general much larger than current HTML documents. For large documents, embedded objects and images make up a large part of the documents' size. For small documents, XML format produces much larger documents than OLE. For large documents, there is little difference. Compression considerably reduces the size of documents in both formats.Item A Comparison of Software Architectures for E-business Applications(2002-02-20) Cecchet, Emmanuel; Chanda, Anupam; Elnikety, Sameh; Marguerite, Julie; Zwaenepoel, WillyAs dynamic content has become more prevalent on the Web, a number of standard mechanisms have evolved to generate such dynamic content. We study three specific mechanisms in common use: PHP, Java servlets, and Enterprise Java Beans (EJB). PHP and Java servlets require a direct encoding of the database queries in the application logic. EJB provides a level of indirection, allowing the application logic to call bean methods that then perform database queries. Unlike PHP, which typically executes on the same machine as the Web server, Java servlets and EJB allow the application logic to execute on different machines, including the machine on which the database executes or a completely separate (set of) machine(s). We present a comparison of the performance of these three systems in different configurations for two application benchmarks: an auction site and an online bookstore. We choose these two applications because they impose vastly different loads on the sub-systems: the auction site stresses the Web server front-end while the online bookstore stresses the database. We use open-source software in common use in all of our experiments (the Apache Web server, Tomcat servlet server, Jonas EJB server, and MySQL relational database). The computational demands ofJava servlets are modestly higher than those of PHP. The ability, however, of locating the servlets on a machine different from the Web server results in better performance for Java servlets than for PHP in the case that the application imposes a significant load on the front-end Web server. The computational demands of EJB are much higher than those of PHP and Java servlets. As with Java servlets, we can alleviate EJB's performance problems by putting them on a separate machine, but the resulting overall performance remains inferior to that of the other two systems.Item An Efficient Threading Model to Boost Server Performance(2004-09-13) Chanda, Anupam; Cox, Alan L.; Elmeleegy, Khaled; Gil, Romer; Mittal, Sumit; Zwaenepoel, WillyWe investigate high-performance threading architectures for I/O intensive multi-threaded servers. We study thread architectures from two angles: (1) number of user threads per kernel thread, and (2) use of synchronous I/O vs. asynchronous I/O. We underline the shortcomings of 1-to-1threads with synchronous I/O, N-to-1 threads with asynchronous I/O, and N-to-M threads with synchronous I/O with respect to server performance We propose N-to-M threads with asynchronous I/O, a novel and previously unexplored thread model, for such servers. We explain the architectural benefits of this thread model over the above mentioned architectures. We have designed and implemented ServLib, a thread library, modeled in this architecture. We optimize ServLib to reduce context switches for large I/O transfers. ServLib exports standard POSIX threads (pthreads) API and can be slipped transparently beneath any multi-threaded application using such API. We have evaluated ServLib with two applications, the Apache multi-threaded web server, and the MySQL multi-threaded database server. Our results show that for synthetic and real workloads Apache with ServLib registers a performance improvement of 10-25%over Apache with 1-to-1 thread and N-to-1 thread libraries. For TPC-W workload, ServLib improves the performance of MySQL by 10-17% over 1-to-1 and N-to-1thread libraries.Item An efficient threading model to boost server performance(2003) Chanda, Anupam; Zwaenepoel, WillyMulti-threading is a popular choice for server architecture. Widely used servers, like the Apache web server and the MySQL database server, are written in a multi-threaded fashion. We investigate the effects of thread architecture on server performance from two angles: (1) number of user threads per kernel thread, and (2) use of blocking I/O vs. non-blocking I/O. We propose N-to-M threads with non-blocking I/O, a novel threading model, to provide higher performance for servers, and explain its advantages over other existing thread architectures, viz., 1-to-1 threads with blocking I/O, N-to-1 threads with non-blocking I/O, and N-to-M threads with blocking I/O. We demonstrate the efficacy of this threading model by showing performance improvement for Apache and MySQL. Results show that our threading model provides a performance improvement of 10--22% for Apache (for synthetic and real workloads), and 10--17% for MySQL (for TPC-W workload) over existing thread models.Item An Integrated Compile-Time/Run-Time Software Distributed Shared Memory System(1997-11-17) Cox, Alan; Dwarkadas, Sandhya; Zwaenepoel, WillyHigh Performance Fortran (HPF), as well as its predecessor FortranD,has attracted considerable attention as a promising language for writing portable parallel programs for a wide variety of distributed-memory architectures. Programmers express data parallelism using Fortran90 array operations and use data layout directives to direct the partitioning of the data and computation among the processors of a parallel machine. For HPF to gain acceptance as a vehicle for parallel scientific programming, it must achieve high performance on problems for which it is well suited. To achieve high performance with an HPF program on a distributed-memory parallel machine, an HPF compiler must do a superb job of translating Fortran90 data-parallel array constructs into an efficient sequence of operations that minimize the overhead associated with data movement and also maximize data locality. This dissertation presents and analyzes a set of advanced optimizations designed to improve the execution performance of HPF programs on distributed-memory architectures. Presented is a methodology for performing deep analysis ofFortran90 programs, eliminating the reliance upon pattern matching to drive the optimizations as is done in many Fortran90 compilers. The optimizations address the overhead of data movement, both interprocessor and intraprocessor movement, that results from the translation of Fortran90 array constructs. Additional optimizations address the issues of scalarizing array assignment statements, loop fusion, and data locality. The combination of these optimizations results in a compiler that is capable of optimizing dense matrix stencil computations more completely than all previous efforts in this area. This work is distinguished by advanced compile-time analysis and optimizations performed at the whole-array level as opposed to analysis and optimization performed at the loop or array-element levels.Item Automatic data aggregation for software distributed shared memory systems(1997) Rajamani, Karthick; Zwaenepoel, WillySoftware Distributed Shared Memory (DSM) provides a shared-memory abstraction on distributed memory hardware, making a parallel programmer's task easier. Unfortunately, software DSM is less efficient than the direct use of the underlying message-passing hardware. The chief reason for this is that hand-coded and compiler-generated message-passing programs typically achieve better data aggregation in their messages than programs using software DSM. Software DSM has poorer data aggregation because the system lacks the knowledge of the application's behavior that a programmer or compiler analysis can provide. We propose four new techniques to perform automatic data aggregation in software DSM. Our techniques use run-time analysis of past data-fetch accesses made by a processor, to aggregate data movement for future accesses. They do not need any additional compiler support. We implemented our techniques in the TreadMarks software DSM system. We used a test suite of four applications--3D-FFT, Barnes-Hut, Ilink and Shallow. For these applications we obtained 40% to 66% reduction in message counts which resulted in 6% to 19% improvement in execution times.Item Bottleneck Characterization of Dynamic Web Site Benchmarks(2002-02) Amza, Cristiana; Cecchet, Emmanuel; Chanda, Anupam; Cox, Alan; Elnikety, Sameh; Gil, Romer; Marguerite, Julie; Rajamani, Karthick; Zwaenepoel, WillyThe absence of benchmarks for Web sites with dynamic content hasbeen a major impediment to research in this area. We describe three benchmarks for evaluating the performance of Web sites with dynamic content. The benchmarks model three common types of dynamic-content Web sites with widely varying application characteristics: an online bookstore, an auction site, and a bulletin board. For each benchmark we describe the design of the database, the interactions provided by the Web server, and the workloads used in analyzing the performance of the system. We have implemented these three benchmarks with commonly used open-source software. In particular, we used the Apache Web server, the PHP scripting language, and the MySQL relational database. Our implementation is therefore representative of the many dynamic content Web sites built using these tools. Our implementations are available freely from our Web site for other researchers to use. We present a performance evaluation of our implementations of these three benchmarks on contemporary commodity hardware. Our performance evaluation focused on finding andex plaining the bottleneck resources in each benchmark. For the online bookstore, the CPU on the database was the bottleneck, while for the auction site and the bulletin board the CPU on the front-end Web server was the bottleneck. In none of the benchmarks was the network between the front-end and the back-end a bottleneck. With amounts of memory common by today's standards, neither the main memory nor the disk proved to be a limiting factor in terms of performance for any of the benchmarks.Item Cache management in scalable network servers(2000) Pai, Vivek Sadananda; Zwaenepoel, WillyFor many users, the perceived speed of computing is increasingly dependent on the performance of network server systems, underscoring the need for high performance servers. Cost-effective scalable network servers can be built on clusters of commodity components (PCs and LANs) instead of using expensive multiprocessor systems. However, network servers cache files to reduce disk access, and the cluster's physically disjoint memories complicate sharing cached file data. Additionally, the physically disjoint CPUs complicate the problem of load balancing. This work examines the issue of cache management in scalable network servers at two levels---per-node (local) and cluster-wide (global). Per-node cache management is addressed by the IO-Lite unified buffering and caching system. Applications and various parts of the operating system currently use incompatible buffering schemes, resulting in unnecessary data copying. For network servers, overall throughput drops for two reasons---copying wastes CPU cycles, and multiple copies of data compete with the filesystem cache for memory. IO-Lite allows applications, the operating system, file system, and network code to safely and securely share a single copy of data. The cluster-wide solution uses a technique called Locality-Aware Request Distribution (LARD) that examines the content of incoming requests to determine which node in a cluster should handle the request. LARD uses the request content to dynamically partition the incoming request stream. This partitioning increases the file cache hit rates on the individual nodes, and it maintains load balance in the cluster.Item Component-based adaptation for mobile computing(2002) de Lara, Eyal; Zwaenepoel, WillyComponent-based adaptation is a novel approach for adapting applications to the limited availability of resources such as bandwidth and power in mobile environments. Component-based adaptation works by calling on the run-time APIs that modern component-based applications export. Because source code modification is not necessary, even proprietary applications such as productivity tools from Microsoft's Office suite can be adapted. Moreover, new adaptive behavior can be added to applications long after they have been deployed. Even if source code is available, development time for implementing adaptation is much reduced. In addition, the ease with which adaptations can be implemented in this framework has enabled me to explore new avenues in adaptation. First, I have developed the first adaptive system to support document editing and collaboration over bandwidth limited links. The key insight gathered from this work is that support for adaptation is orthogonal to concurrency and consistency mechanisms, and therefore can be integrated easily in existing systems. Second, I have developed a hierarchical adaptive transmission scheduler to support coordinated multi-application adaptation. I have demonstrated the effectiveness of component-based adaptation by implementing a system called Puppeteer, which has allowed me to adapt widely deployed applications, such as productivity tools from Microsoft's Office suite and Sun Microsystems' OpenOffice suite. Although the APIs of these applications impose some limitations, I have been able to implement a wide range of adaptation policies for reading, editing, and collaboration, with modest implementation effort and good performance results.Item Component-based adaptation system and method(2004-08-10) De Lara, Eyal; Wallach, Daniel S.; Zwaenepoel, Willy; Rice University; United States Patent and Trademark OfficeA component-based adaptation system is provided in which the operation of an application or the data being used by the application is adapted according to an application-specific or a user-specific policy. Following a request for a document by an application, the requested document is retrieved and converted into an application-independent format. The data of the document is then supplied to the application according to a user-specific or application-specific policy. The application of the policy may result in a lower fidelity version or a subset of the data of the requested document being supplied to the application. The policy may also govern the updating of the data supplied to the application. The data supplied to the application may be updated following the occurrence of a tracked event in the application or according to a background policy governing the supply of updated data without reference to the user's operation of the application. All of the adaptations are implemented without modifying the source code of the application and without modifying the document as it is permanently stored on a data server.Item Component-based adaptation system and method(2009-05-12) De Lara, Eyal; Wallach, Daniel S.; Zwaenepoel, Willy; Rice University; United States Patent and Trademark OfficeA component-based adaptation system is provided in which the operation of an application or the data being used by the application is adapted according to an application-specific or a user-specific policy. Following a request for a document by an application, the requested document is retrieved and converted into an application-independent format. The data of the document is then supplied to the application according to a user-specific or application-specific policy. The application of the policy may result in a lower fidelity version or a subset of the data of the requested document being supplied to the application. The policy may also govern the updating of the data supplied to the application. The data supplied to the application may be updated following the occurrence of a tracked event in the application or according to a background policy governing the supply of updated data without reference to the user's operation of the application. All of the adaptations are implemented without modifying the source code of the application and without modifying the document as it is permanently stored on a data server.Item Conflict -aware replication for dynamic content Web sites(2003) Amza, Cristiana; Zwaenepoel, WillyConflict-aware replication is a novel lazy replication technique for scaling the back-end database of a dynamic content web server using a cluster of commodity computers. This technique provides both throughput scaling and 1-copy serializability. It has generally been believed that this combination is hard to achieve through replication because of the growth of the number of conflicts. Conflict-aware replication interposes a (possibly replicated) scheduler between the database and application server tiers. The conflict-aware scheduler directs incoming queries in such a way that the overall execution is serializable and the number of conflicts is reduced. The technique requires that the incoming transactions specify the tables that they access at the beginning of the transaction. Using this information, conflict-aware replication provides both scaling and 1-copy serializability, while it avoids making any changes to the application server or database. We have implemented a prototype of the conflict-aware scheduler in a cluster-based dynamic content site. We have also implemented various other scheduler algorithms in this prototype for comparison purposes, including conflict-aware and oblivious, with 1-copy serializability and with different looser consistency models. We have evaluated this method using the industry standard TPC-W e-commerce benchmark, an auction site benchmark, modeled after eBay.com, and a bulletin board benchmark, modeled after slashdot.org. For these applications, we have found that pre-specifying what tables are accessed involves very little work on behalf of the programmer and could easily be automated. For clusters with small number of database machines (up to 8) we have measured an implementation of the algorithms. We use simulation to extend our measurement results to larger clusters, faster database engines, and lower conflict rates. This dissertation shows that conflict-awareness brings considerable benefits in terms of both overall throughput scaling and latency reduction compared to both eager and conflict-oblivious lazy replication for a large range of cluster configurations and conflict rates. Furthermore, for all our applications, except those with very high conflict rates, the performance of conflict-aware replication equals or approaches that of looser consistency models. The dissertation also shows that the cost of conflict-aware replication is minimal in terms of data availability and fault tolerance.Item Database admission control and request scheduling for dynamic content Web servers(2003) Elnikety, Sameh Mohamed; Zwaenepoel, WillyThis thesis presents a method to do admission control and request scheduling for database-bound dynamic content Web servers. Our method is both transparent, requiring no modification to the software components, and external, permitting an implementation in a separate proxy. Admission control prevents overloading the database server. We implement admission control by estimating the amount of work that each request imposes on the system. A request is admitted only when it does not drive the system into overload. Request scheduling improves average response time. We exploit the variability in the workload by using shortest job first scheduling, which reorders the pending requests to reduce the average response time. We evaluate these techniques experimentally using the TPC-W benchmark. We show consistent performance during overload. Moreover, the average response time improves by up to a factor of 14, and peak throughput increases up to 10 percent.Item Distributed system fault tolerance using message logging and checkpointing(1990) Johnson, David Bruce; Zwaenepoel, WillyFault tolerance can allow processes executing in a computer system to survive failures within the system. This thesis addresses the theory and practice of transparent fault-tolerance methods using message logging and checkpointing in distributed systems. A general model for reasoning about the behavior and correctness of these methods is developed, and the design, implementation, and performance of two new low-overhead methods based on this model are presented. No specialized hardware is required with these new methods. The model is independent of the protocols used in the system. Each process state is represented by a dependency vector, and each system state is represented by a dependency matrix showing a collection of process states. The set of system states that have occurred during any single execution of a system forms a lattice, with the sets of consistent and recoverable system states as sublattices. There is thus always a unique maximum recoverable system state. The first method presented uses a new pessimistic message logging protocol called sender-based message logging. Each message is logged in the local volatile memory of the machine from which it was sent, and the order in which the message was received is returned to the sender as a receive sequence number. Message logging overlaps execution of the receiver, until the receiver attempts to send a new message. Implemented in the V-System, the maximum measured failure-free overhead on distributed application programs was under 16 percent, and average overhead measured 2 percent or less, depending on problem size and communication intensity. Optimistic message logging can outperform pessimistic logging, since message logging occurs asynchronously. A new optimistic message logging system is presented that guarantees to find the maximum possible recoverable system state, which is not ensured by previous optimistic methods. All logged messages and checkpoints are utilized, and thus some messages received by a process before it was checkpointed may not need to be logged. Although failure recovery using optimistic message logging is more difficult, failure-free application overhead using this method ranged form only a maximum of under 4 percent to much less than 1 percent.Item Efficient distributed shared memory based on multi-protocol release consistency(1994) Carter, John Bruce; Zwaenepoel, WillyA distributed shared memory (DSM) system allows shared memory parallel programs to be executed on distributed memory multiprocessors. The challenge in building a DSM system is to achieve good performance over a wide range of shared memory programs without requiring extensive modifications to the programs. The performance challenge translates into reducing the amount of communication performed by the DSM system to that performed by an equivalent message passing program. This thesis describes four novel techniques for reducing the communication overhead of DSM, including: (i) the use of software release consistency, (ii) support for multiple consistency protocols, (iii) a multiple writer protocol, and (iv) an update timeout mechanism. Release consistency allows modifications of shared data to be handled via a delayed update queue, which masks network latencies. Providing multiple consistency protocols allows each shared variable to be kept consistent using a protocol well-suited to the way it is accessed. A multiple writer protocol addresses the serious problem of false sharing by reducing the amount of unnecessary communication performed to keep falsely shared data consistent. The update timeout mechanism reduces the impact of updates to stale data. These techniques have been implemented in the Munin DSM system. The impact of these features is evaluated by comparing the performance of a collection of shared memory programs running under Munin with equivalent message passing and conventional DSM programs. Over half of the shared memory programs achieved at least 95% of the speedup of their message passing equivalents. For the other programs, the performance bottlenecks were removed via minor program modifications. Furthermore, Munin programs achieved from 25% to over 100% higher speedups than equivalent conventional DSM programs when there was a high degree of sharing. The results indicate that DSM can be a viable alternative to message passing if the amount of unnecessary communication is minimized.Item Extended queuing network modeling(1985) Doshi, Kshitij Arun; Sinclair, James B.; Briggs, Faye A.; Zwaenepoel, WillyEvaluating the performance of a system is of central concern in making engineering decisions. When direct measurement of performance is not possible or feasible, evaluation consists of two phases: specification of an appropriate performance model, and evaluation of the model to obtain the performance measures. Broadly, a performance model can be evaluated by exact or approximate analysis, or by simulation. A class of models popular for evaluation of a number of systems, computer systems in particular, is that of Extended Queuing Network (EQN) Models. Software tools are typically used for building EQN models for evaluation through analyses or simulation. This thesis describes an effort in experimenting with an approach to the design and implementation of a tool for performance evaluation of EQN models via simulation. The objective in this effort is to design a tool that is easy and intuitive to use, yet versatile and powerful in its modeling capabilities. The tool we have implemented is called Graphical Input Simulation Tool (GIST). GIST meets its design objectives by (1) providing a pair of user interfaces that are capable of accepting the abstract EQN model specification directly, are easy and intuitive to learn and use, and are helpful in quick model specification with reduced likelihood of semantic and syntactic specification errors, and (2) incorporating into the set of EQN objects it provides, the capabilities perceived necessary for realistic modeling of activities that characterize the systems of interest.Item A Flexible and Efficient Application Programming Interface (API) for a Customizable Proxy Cache(2003-03-20) Pai, Vivek S.; Cox, Alan; Pai, Vijay S.; Zwaenepoel, WillyThis paper describes the design, implementation, and performance of a simple yet powerful Application Programming Interface (API) for providing extended services in a proxy cache. This API facilitates the development of customized content adaptation, content management, and specialized administration features. We have developed several modules that exploit this API to perform various tasks within the proxy, including a module to support the Internet Content Adaptation Protocol (ICAP) without any changes to the proxy core. The API design parallels those of high-performance servers, enabling its implementation to have minimal overhead on a high-performance cache. At the same time, it provides the infrastructure required to process HTTP requests and responses at a high level, shielding developers from low-level HTTP and socket details and enabling modules that perform interesting tasks without significant amounts of code. We have implemented this API in the portable and high-performance iMimic DataReactorâ ¢ proxy cache. We show that implementing the API imposes negligible performance overhead and that realistic content-adaptation services achieve high performance levels without substantially hindering a background benchmark load running at a high throughput level.Item Improving TLB Miss Handling with Page Table Pointer Caches(1997-12-16) Wu, Michael; Zwaenepoel, WillyPage table pointer caches are a hardware supplement for TLBs that cache pointers to pages of page table entries rather than page table entries themselves. A PTPC traps and handles most TLB misses in hardware with low overhead (usually a single memory access). PTPC misses are filled in software, allowing for an easy hardware implementation, similar in structure to a TLB. Since each PTPC entry refers to an entire page of page table entries, even a small PTPC maps a large amount of address space and achieves a very high hit rate. The primary goal of a PTPC is to lower TLB miss handling penalties. The combination of a TLB with a small PTPC provides good performance even in situations where standard TLBs alone perform badly (large workloads or multimedia applications). The advantage of this design is that we can continue to use small fixed size pages with standard TLBs. Since PTPCs use traditional page table structures and page sizes, they are very simple to implement in hardware and require minimal operating system modifications. Our simulations show that the addition of a PTPC to a system with a TLB can reduce miss handling costs by nearly an order of magnitude. Small PTPCs are extremely effective and the combination of small to medium sized TLBs coupled with smallPTPCs are an efficient alternative to large TLBs.Item IO-lite: A copy-free UNIX I/O system(1997) Pai, Vivek Sadananda; Zwaenepoel, WillyMemory copy speed is known to be a significant barrier to high-speed communication. We perform an analysis of the requirements for a copy-free buffer system, develop an implementation-independent applications programming interface (API) based on those requirements, and then implement a system that conforms to the API. In addition, we design and implement a fully copy-free filesystem cache. Performance tests indicate that our system dramatically outperforms traditional systems on communications-oriented tasks by a factor of 2 to 10. Application programs that have been modified to utilize our copy-free system have also shown reductions in run time, ranging from 10% to nearly 50%.Item IO-Lite: A unified I/O buffering and caching system(1997-10-27) Druschel, Peter; Pai, Vivek; Zwaenepoel, WillyThis paper presents the design, implementation, and evaluation ofIO-Lite, a unified I/O buffering and caching system. IO-Lite unifies all buffering and caching in the system, to the extent permitted by the hardware. In particular, it allows applications, interprocess communication, the file system, the file cache, and the network subsystem to share a single physical copy of the data safely and concurrently. Protection and security are maintained through a combination of access control and read-only sharing. The various subsystems use (mutable) buffer aggregates to access the data according to their needs. IO-Lite eliminates all copying and multiple buffering of I/Odata, and enables various cross-subsystem optimizations. Performance measurements show significant performance improvements on Web servers and other I/O intensive applications.