Reliable parallel computing on clusters of multiprocessors

dc.contributor.advisorBennett, John K.en_US
dc.creatorAbdel-Shafi, Hazim M.en_US
dc.date.accessioned2009-06-04T08:22:28Zen_US
dc.date.available2009-06-04T08:22:28Zen_US
dc.date.issued2000en_US
dc.description.abstractThis dissertation describes the design, implementation, and performance of two mechanisms that address reliability and system management problems associated with parallel computing clusters: thread migration and checkpoint/recovery. A unique aspect of this work is the integration of these two mechanisms. Although there has been considerable prior work on each of these mechanisms in isolation, their integration offers synergistic benefit to both functionality and performance. Used in, conjunction, these mechanisms facilitate failure recovery, and node addition and removal with minimal disruption of executing applications. Our implementation differs from previous work in the following ways. First, by using thread migration instead of process migration, the overhead of moving computation among nodes is reduced. Second, because our implementation of checkpoint/recovery separates computation and data, it is possible to distribute data and threads among other nodes during recovery. This is possible because the underlying support for thread migration in the system allows the recovery of a thread from any checkpoint on any node. Third, our implementation does not require repartitioning of a running parallel application when resources are added or removed. Finally, the checkpoint/recovery and thread migration mechanisms are both implemented at user-level. The benefits of a user-level implementation include ease of development since operating system source code is not required, adaptability to other platforms, and simple upgrades to new versions of the underlying operating system and hardware. The prototype implementation described in this thesis was developed as an extension to the Brazos software distributed shared memory system. Brazos allows multithreaded parallel applications to execute on networks of multiprocessor servers running the Windows NT/2000 operating system.en_US
dc.format.extent132 p.en_US
dc.format.mimetypeapplication/pdfen_US
dc.identifier.callnoTHESIS E.E. 2000 ABDEL-SHAFIen_US
dc.identifier.citationAbdel-Shafi, Hazim M.. "Reliable parallel computing on clusters of multiprocessors." (2000) Diss., Rice University. <a href="https://hdl.handle.net/1911/19462">https://hdl.handle.net/1911/19462</a>.en_US
dc.identifier.urihttps://hdl.handle.net/1911/19462en_US
dc.language.isoengen_US
dc.rightsCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.en_US
dc.subjectElectronicsen_US
dc.subjectElectrical engineeringen_US
dc.subjectComputer scienceen_US
dc.titleReliable parallel computing on clusters of multiprocessorsen_US
dc.typeThesisen_US
dc.type.materialTexten_US
thesis.degree.departmentElectrical Engineeringen_US
thesis.degree.disciplineEngineeringen_US
thesis.degree.grantorRice Universityen_US
thesis.degree.levelDoctoralen_US
thesis.degree.nameDoctor of Philosophyen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
9969222.PDF
Size:
4.38 MB
Format:
Adobe Portable Document Format