Communication Optimizations for Distributed-Memory X10 Programs

Barik, Rajkishore; Budimlić, Zoran; Grove, David; Peshansky, Igor; Sarkar, Vivek; Zhao, Jisheng

Communication Optimizations for Distributed-Memory X10 Programs

dc.contributor.author	Barik, Rajkishore	en_US
dc.contributor.author	Budimlić, Zoran	en_US
dc.contributor.author	Grove, David	en_US
dc.contributor.author	Peshansky, Igor	en_US
dc.contributor.author	Sarkar, Vivek	en_US
dc.contributor.author	Zhao, Jisheng	en_US
dc.date.accessioned	2017-08-02T22:03:08Z	en_US
dc.date.available	2017-08-02T22:03:08Z	en_US
dc.date.issued	2010-04-10	en_US
dc.date.note	April 10, 2010	en_US
dc.description.abstract	X10 is a new object-oriented PGAS (Partitioned Global Address Space) programming language with support for distributed asynchronous dynamic parallelism that goes beyond past SPMD message-passing models such as MPI and SPMD PGAS models such as UPC and Co-Array Fortran. The concurrency constructs in X10 make it possible to express complex computation and communication structures with higher productivity than other distributed-memory programming models. However, this productivity often comes at the cost of high performance overhead when the language is used in its full generality. This paper introduces high-level compiler optimizations and transformations to reduce communication and synchronization overheads in distributed-memory implementations of X10 programs. Specifically, we focus on locality optimizations such as scalar replacement and task localization, combined with supporting transformations such as loop distribution, scalar expansion, loop tiling, and loop splitting. We have completed a prototype implementation of these high-level optimizations, and performed a performance evaluation that shows significant improvements in performance, scalability, communication volume and number of tasks. We evaluated the communication optimizations on three platforms: a 128-node BlueGene/P cluster, a 32-node Nehalem cluster, and a 16-node Power7 cluster. On the BlueGene/P cluster, we observed a maximum performance improvement of 31.46× relative to the unoptimized case (for the MolDyn benchmark). On the Nehalem cluster, we observed a maximum performance improvement of 3.01× (for the NQueens benchmark) and on the Power7 cluster, we observed a maximum performance improvement of 2.73× (for the MolDyn benchmark). In addition, there was no case in which the optimized code was slower than the unoptimized case. We also believe that the optimizations presented in this paper will be necessary for any high-productivity PGAS language based on modern object-oriented principles, that is designed for execution on future Extreme Scale systems that place a high premium on locality improvement for performance and energy efficiency.	en_US
dc.format.extent	13 pp	en_US
dc.identifier.citation	Barik, Rajkishore, Budimli?, Zoran, Grove, David, et al.. "Communication Optimizations for Distributed-Memory X10 Programs." (2010) https://hdl.handle.net/1911/96389.	en_US
dc.identifier.digital	TR10-09	en_US
dc.identifier.uri	https://hdl.handle.net/1911/96389	en_US
dc.language.iso	eng	en_US
dc.rights	You are granted permission for the noncommercial reproduction, distribution, display, and performance of this technical report in any format, but this permission is only for a period of forty-five (45) days from the most recent time that you verified that this technical report is still available from the Computer Science Department of Rice University under terms that include this permission. All other rights are reserved by the author(s).	en_US
dc.title	Communication Optimizations for Distributed-Memory X10 Programs	en_US
dc.type	Technical report	en_US
dc.type.dcmi	Text	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: TR10-09.pdf
Size:: 818.5 KB
Format:: Adobe Portable Document Format

Download

Collections

Computer Science Technical Reports
Center for Research Computing