A Characterization of Compound Documents on the Web
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Recent developments in office productivity suites make it easier for users to publish rich {\em compound documents/} on the Web. Compound documents appear as a single unit of information but may contain data generated by different applications, such as text, images, and spreadsheets. Given the popularity enjoyed by these office suites and the pervasiveness of the Web as a publication medium, we expect that in the near future these compound documents will become an increasing proportion of the Web's content. As a result, the content handled by servers, proxies, and browsers may change considerably from what is currently observed. Furthermore, these compound documents are currently treated as opaque byte streams, but future Web infrastructure may wish to understand their internal structure to provide higher-quality service. In order to guide the design of this future Web infrastructure, we characterize compound documents currently found on the Web. Previous studies of Web content either ignored these document types altogether or did not consider their internal structure. We study compound documents originated by the three most popular applications from the Microsoft Office suite: Word, Excel, and PowerPoint. Our study encompasses over 12,500 documents retrieved from 935different Web sites. Our main conclusions are: Compound documents are in general much larger than current HTML documents. For large documents, embedded objects and images make up a large part of the documents' size. For small documents, XML format produces much larger documents than OLE. For large documents, there is little difference. Compression considerably reduces the size of documents in both formats.
Description
Advisor
Degree
Type
Keywords
Citation
Lara, Eyal de, Wallach, Dan S. and Zwaenepoel, Willy. "A Characterization of Compound Documents on the Web." (1999) https://hdl.handle.net/1911/96514.