A Characterization of Compound Documents on the Web

dc.contributor.authorLara, Eyal deen_US
dc.contributor.authorWallach, Dan S.en_US
dc.contributor.authorZwaenepoel, Willyen_US
dc.date.accessioned2017-08-02T22:03:48Zen_US
dc.date.available2017-08-02T22:03:48Zen_US
dc.date.issued1999-11-29en_US
dc.date.noteNovember 29, 1999en_US
dc.description.abstractRecent developments in office productivity suites make it easier for users to publish rich {\em compound documents\/} on the Web. Compound documents appear as a single unit of information but may contain data generated by different applications, such as text, images, and spreadsheets. Given the popularity enjoyed by these office suites and the pervasiveness of the Web as a publication medium, we expect that in the near future these compound documents will become an increasing proportion of the Web's content. As a result, the content handled by servers, proxies, and browsers may change considerably from what is currently observed. Furthermore, these compound documents are currently treated as opaque byte streams, but future Web infrastructure may wish to understand their internal structure to provide higher-quality service. In order to guide the design of this future Web infrastructure, we characterize compound documents currently found on the Web. Previous studies of Web content either ignored these document types altogether or did not consider their internal structure. We study compound documents originated by the three most popular applications from the Microsoft Office suite: Word, Excel, and PowerPoint. Our study encompasses over 12,500 documents retrieved from 935different Web sites. Our main conclusions are: Compound documents are in general much larger than current HTML documents. For large documents, embedded objects and images make up a large part of the documents' size. For small documents, XML format produces much larger documents than OLE. For large documents, there is little difference. Compression considerably reduces the size of documents in both formats.en_US
dc.format.extent14 ppen_US
dc.identifier.citationLara, Eyal de, Wallach, Dan S. and Zwaenepoel, Willy. "A Characterization of Compound Documents on the Web." (1999) https://hdl.handle.net/1911/96514.en_US
dc.identifier.digitalTR99-351en_US
dc.identifier.urihttps://hdl.handle.net/1911/96514en_US
dc.language.isoengen_US
dc.rightsYou are granted permission for the noncommercial reproduction, distribution, display, and performance of this technical report in any format, but this permission is only for a period of forty-five (45) days from the most recent time that you verified that this technical report is still available from the Computer Science Department of Rice University under terms that include this permission. All other rights are reserved by the author(s).en_US
dc.titleA Characterization of Compound Documents on the Weben_US
dc.typeTechnical reporten_US
dc.type.dcmiTexten_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
TR99-351.pdf
Size:
224.4 KB
Format:
Adobe Portable Document Format