Browsing by Author "Warburton, Timothy"
Now showing 1 - 14 of 14
Results Per Page
Sort Options
Item Accelerated Plane-wave Discontinuous Galerkin for Heterogeneous Scattering Problems(2015-04-23) Atcheson, Thomas Reid; Warburton, Timothy; Symes, William; Sorensen, Dan; Sarkar, VivekThis thesis considers algorithmic and computational acceleration of numerical wave modelling at high frequencies. Numerical propagation of linear waves at high frequencies poses a significant challenge to modern simulation techniques. Despite the fact that potential practical benefits led a great deal of attention to this problem, current research has yet to provide a general and performant method to solve it. I consider finite element as a possible solution because it can handle geometric complex- ity of heterogeneous domains, but unfortunately it suffers from the “pollution” effect which imposes a prohobitively large memory requirement to handle high frequencies. One recent step towards enabling the finite element method to solve high frequency wave propagation in the frequency domain involves using a plane-wave basis rather than the standard polynomial basis. This allows highly compressed representations of scattering waves but otherwise appeared to limit users to nearly-homogeneous prob- lems. This thesis explores the use of plane-waves in a discontinuous Galerkin method (PWDG) for highly heterogeneous problems possibly containing a point source. The low-memory nature of PWDG and the fact that its expressions can be computed in an entirely symbolic manner without quadratures furthermore permits an efficient graphics processing unit (GPU) implementation such that problems with very high frequencies can be solved on a single workstation. This thesis includes computational results demonstrating results for frequencies in excess of 100 hertz on the Marmousi model, solved using only a single GPU.Item Explicit Discontinuous Galerkin Methods for Linear Hyperbolic Problems(2013-11-14) Atcheson, Thomas; Warburton, Timothy; Symes, William W.; Embree, MarkDiscontinuous Galerkin methods have many features which make them a natural candidate for the solution of hyperbolic problems. One feature is flexibility with the order of approximation; a user with knowledge of the solution's regularity can increase the spatial order of approximation by increasing the polynomial order of the discontinuous Galerkin method. A marked increase in time-stepping difficulty, known as stiffness, often accompanies this increase in spatial order however. This thesis analyzes two techniques for reducing the impact of this stiffness on total time of simulation. The first, operator modification, directly modifies the high order method in a way that retains the same formal order of accuracy, but reduces the stiffness. The second, optimal Runge-Kutta methods, adds additional stages to Runge-Kutta methods and modifies them to customize their stability region to the problem. Three operator modification methods are analyzed analytically and numerically, the mapping technique of Kosloff/Tal-Ezer the covolume filtering technique of Warburton/Hagstrom , and the flux filtering technique of Chalmers, et al. . The covolume filtering and flux filtering techniques outperform mapping in that they negligibly impact accuracy but yield a reasonable improvement in efficiency. For optimal Runge-Kutta methods this thesis considers five top performing methods from the literature on hyperbolic problems and applies them to an unmodified method, a flux filtered method, and a covolume filtered method. Gains of up to 80\% are seen for covolume filtered solutions applied with optimal Runge-Kutta methods, showing the potential for efficient high order solutions of unsteady systems.Item gNek: A GPU Accelerated Incompressible Navier Stokes Solver(2013-09-16) Stilwell, Nichole; Warburton, Timothy; Riviere, Beatrice M.; Embree, MarkThis thesis presents a GPU accelerated implementation of a high order splitting scheme with a spectral element discretization for the incompressible Navier Stokes (INS) equations. While others have implemented this scheme on clusters of processors using the Nek5000 code, to my knowledge this thesis is the first to explore its performance on the GPU. This work implements several of the Nek5000 algorithms using OpenCL kernels that efficiently utilize the GPU memory architecture, and achieve massively parallel on chip computations. These rapid computations have the potential to significantly enhance computational fluid dynamics (CFD) simulations that arise in areas such as weather modeling or aircraft design procedures. I present convergence results for several test cases including channel, shear, Kovasznay, and lid-driven cavity flow problems, which achieve the proven convergence results.Item GPU-accelerated discontinuous Galerkin methods on hybrid meshes: applications in seismic imaging(2017-04-20) Wang, Zheng; Warburton, Timothy; Gillman, AdriannaSeismic imaging is a geophysical technique assisting in the understanding of subsurface structure on a regional and global scale. With the development of computer technology, computationally intensive seismic algorithms have begun to gain attention in both academia and industry. These algorithms typically produce high-quality subsurface images or models, but require intensive computations for solving wave equations. Achieving high-fidelity wave simulations is challenging: first, numerical wave solutions may suffer from dispersion and dissipation errors in long-distance propagations; second, the efficiency of wave simulators is crucial for many seismic applications. High-order methods have advantages of decreasing numerical errors efficiently and hence are ideal for wave modelings in seismic problems. Various high order wave solvers have been studied for seismic imaging. One of the most popular solvers is the finite difference time domain (FDTD) methods. The strengths of finite difference methods are the computational efficiency and ease of implementation, but the drawback of FDTD is the lack of geometric flexibility. It has been shown that standard finite difference methods suffer from first order numerical errors at sharp media interfaces. In contrast to finite difference methods, discontinuous Galerkin (DG) methods, a class of high-order numerical methods built on unstructured meshes, enjoy geometric flexibility and smaller interface errors. Additionally, DG methods are highly parallelizable and have explicit semi-discrete form, which makes DG suitable for large-scale wave simulations. In this dissertation, the discontinuous Galerkin methods on hybrid meshes are developed and applied to two seismic algorithms---reverse time migration (RTM) and full waveform inversion (FWI). This thesis describes in depth the steps taken to develop a forward DG solver for the framework that efficiently exploits the element specific structure of hexahedral, tetrahedral, prismatic and pyramidal elements. In particular, we describe how to exploit the tensor-product property of hexahedral elements, and propose the use of hex-dominant meshes to speed up the computation. The computational efficiency is further realized through a combination of graphics processing unit (GPU) acceleration and multi-rate time stepping. As DG methods are highly parallelizable, we build the DG solver on multiple GPUs with element-specific kernels. Implementation details of memory loading, workload assignment and latency hiding are discussed in the thesis. In addition, we employ a multi-rate time stepping scheme which allows different elements to take different time steps. This thesis applies DG schemes to RTM and FWI to highlight the strengths of the DG methods. For DG-RTM, we adopt the boundary value saving strategy to avoid data movement on GPUs and utilize the memory load in the temporal updating procedure to produce images of higher qualities without a significant extra cost. For DG-FWI, a derivation of the DG-specific adjoint-state method is presented for the fully discretized DG system. Finally, sharp media interfaces are inverted by specifying perturbations of element faces, edges and vertices.Item Hermite Methods for the Simulation of Wave Propagation(2017-04-20) Vargas, Arturo; Warburton, Timothy; Gillman, AdriannaSimulations of wave propagation play a crucial role in science and engineering. In applications of geophysics, they are the engine of many seismic imaging algorithms. For electrical engineers, they can be a useful tool for the design of radars and antennas. In these applications achieving high fidelity, simulations are challenging due to the inherent issues in modeling highly oscillatory waves and the associated high computational cost of high-resolution simulations. Thus the ideal numerical method should be able to capture high-frequency waves and be suitable for parallel computing. In both seismic applications and computational electromagnetics the Yee scheme, a finite difference time domain (FDTD) method, is the method of choice for structured grids. The scheme has the benefit of being easy to implement but performs poorly in the presence of high-frequency waves. High order accurate FDTD methods may be derived but ultimately rely on neighboring grid points when approximating derivative. In contrast to FDTD methods, the Hermite methods of Goodrich and co-authors (2006) use Hermite interpolation and a staggered (dual) grid to construct high order accurate numerical methods for first order hyperbolic equations. These methods achieve high order approximations in both time and space by reconstructing local polynomials within cells of the computational domain and employing Hermite-Taylor time stepping. The resulting schemes are able to evolve the solution locally within a cell making them ideal for parallel computing. Building on the original Hermite methods this thesis focuses on two goals: (1) the development of new Hermite methods and (2) their implementation on modern computing architectures. To accomplish the first objective, this thesis presents two variations of Hermite methods which are designed to simplify the scheme while preserving the favorable features. The first variation is a family of Hermite methods which do not require a dual grid. These methods eliminate the need for storing dual coefficients while maintaining optimal convergence rates. The second type of variation are Hermite methods which use leapfrog time-stepping. These schemes propagate the solution with less computation than the original scheme and may be used for either first or second order equations. To address the second objective, this thesis presents algorithms which take advantage of the many-core architecture of graphics processing units (GPU). As three-dimensional simulations can easily exceed the memory of a single GPU, techniques for partitioning the data across multiple GPUs are presented. Finally, this thesis presents numerical results and performance studies which confirm the accuracy and efficiency of the proposed Hermite methods for linear and nonlinear wave equations.Item High order discontinuous Galerkin methods for simulating miscible displacement process in porous media with a focus on minimal regularity(2015-04-20) Li, Jizhou; Riviere, Beatrice M.; Symes, William; Hirasaki, George; Warburton, Timothy; Heinkenschloss, MatthiasIn my thesis, I formulate, analyze and implement high order discontinuous Galerkin methods for simulating miscible displacement in porous media. The analysis concerning the stability and convergence under the minimal regularity assumption is established to provide theoretical foundations for using discontinuous Galerkin discretization to solve miscible displacement problems. The numerical experiments demonstrate the robustness and accuracy of the proposed methods. The performance study for large scale simulations with highly heterogeneous porous media suggests strong scalability which indicates the efficiency of the numerical algorithm. The simulations performed using the algorithms for physically unstable flow show that higher order methods proposed in thesis are more suitable for simulating such phenomenon than the commonly used cell-center finite volume method.Item High performance high-order numerical methods: applications in ocean modeling(2015-08-27) Gandham, Rajesh; Warburton, Timothy; Symes, William; Bradshaw, Stephen; Beatrice, RiviereThis thesis presents high-order numerical methods for time-dependent simulations of oceanic wave propagation on modern many-core hardware architecture. Simulation of the waves such as tsunami, is challenging because of the varying fluid depths, propagation in many regions, requirement of high resolution near the shore, complex nonlinear wave phenomenon, and necessity of faster than real-time predictions. This thesis addresses issues related to stability, accuracy, and efficiency of the numerical simulation of these waves. For the simulation of tsunami waves, a two-dimensional nonlinear shallow water PDE model is considered. Discontinuous Galerkin (DG) methods on unstructured triangular meshes are used for the numerical solution of the model. These methods are not stable for nonlinear problems. To address the stability of these methods, a total variational bounded slope limiter in conjunction with a positive preserving scheme is developed, in particular for unstructured triangular meshes. Accuracy and stability of the methods are verified with test cases found in literature. These methods are also validated using 2004 Indian Ocean tsunami data to demonstrate faster than real-time simulation capability for practical problems using a commodity workstation. For accurate modeling of tsunami and ocean waves, in general, a three-dimensional hydrostatic incompressible Navier-Stokes model along with free surface conditions is considered. DG discretizations on unstructured prismatic elements are used for the numerical solutions. These prismatic elements are obtained by extruding the triangular meshes from ocean free surface to the ocean bottom. The governing equations are represented in a fixed sigma coordinate reference system. The limiting procedure, time-stepping method, accelerated implementations are adopted from two-dimensional formulations. The runtime performance of this three-dimensional method is compared with the performance of the two-dimensional shallow water model, to give an estimate of computational overhead in moving forward to three-dimensional models in practical ocean modeling applications. A GPU accelerated unsmooth aggregation algebraic method is developed. Algebraic multi-grid method is used as a linear solver in many engineering applications such as computational fluid dynamics. The developed method involves a setup stage and a solution stage. This method is parallelized for both stages unlike most of the methods that are parallelized only for the solution stage. Efficiency of the setup is crucial in these applications since the setup has to be performed many times. The efficiency of the method is demonstrated using a sequence of downsized problems. The computational kernels are expressed in an extensive multi-threading library OCCA. Using OCCA, the developed implementations achieve portability across various hardware architectures such as GPUs, CPUs, and multi-threading programming models OpenCL, CUDA, and OpenMP. The optimal performance of these kernels across various thread models and hardware architecture is compared.Item Locality Transformations of Computation and Data for Portable Performance(2014-08-21) Sharma, Kamal Gopal; Sarkar, Vivek; Mellor-Crummey, John; Warburton, TimothyRecently, multi-cores chips have become omnipresent in computer systems ranging from high-end servers to mobile phones. A variety of multi-core architectures have been developed which vary in the number of cores on a single die, cache hierarchies within a chip and interconnect across chips. This diversity of architectures is likely to continue for the foreseeable future. With these architectural variations, performance tuning of an application becomes a major challenge for the programmer. Code optimizations developed for one architecture may have a different impact, even negative, across other architectures due to the differences in tuning parameters. This problem is compounded when scaling from a single node to multiple nodes in a cluster. Our thesis is that significant performance benefits can be obtained from locality transformations of computation and data that require minimal programmer effort. We establish this thesis by addressing three specific problems — tile size selection, data layout optimization and automatic selection of distribution functions. Tile size selection and data layout optimization improve intra-node performance, whereas automatic selection of distribution functions also enhances inter-node performance. Loop tiling is a well-known locality optimization technique that can increase cache reuse at different levels of a memory hierarchy. Choosing tile sizes for different loop nests is a non-trivial task for a programmer due to the large parameter search space and different architectural parameters for different platforms. In this dissertation, we present analytical bounds for tile size selection. Our approach uses different architectural parameters such as cache size and TLB size to limit the tile size search space, and also includes an automated algorithm to select optimized tile sizes. Another important locality optimization is data layout transformation, which improves cache performance by restructuring program data. An important challenge for a programmer is to select an efficient data layout for a given application and target platform. In the past, selecting a data layout has been limited by the cost of rewriting applications for different layouts and by the effort required to find an optimized layout for a given platform. In this work, we present an automated tool for implementing different layouts for a given program. Using our approach, a programmer can experiment with different layouts across varying platforms for efficient performance. We also provide an automatic algorithm to select an optimized data layout for a given program and its representative executions. When mapping an application onto multiple nodes, one challenge for scalable performance is how to distribute computation and data across the nodes. In this work, we introduce an algorithm to automatically select a task/data distribution function for a set of nodes. We build this solution on Intel CnC’s distributed runtime system. Overall, our approaches for the three problems provide automated methods for locality optimization that enable portable performance while requiring minimal programmer effort.Item Modeling Laser-Induced Thermotherapy in Biological Tissue(2015-05-01) Wang, Zheng; Warburton, Timothy; Riviere, Beatrice M.; Symes, William W.This thesis studies simulations of laser-induced thermotherapy (LITT), a minimally-invasive procedure which ablates cancerous tissue using laser heating. In order to predict this procedure, mathematical models are used to assist in treatment. By simulating the laser heating of tissue, surgeons may estimate regions of tissue exceeding a thermal damage threshold. One important component of the LITT model is laser simulation, which is typically characterized by the radiative transfer equation (RTE). The RTE is a time-dependent integro-differential equation with variables in both angular and physical spaces. In this thesis, we conduct numerical experiments using both discrete ordinate and Galerkin methods. The former discretizes a finite number of directions using finite difference methods, while the latter employs continuous functions for both angular and spatial discretizations. Numerical results indicate that the numerical errors in both methods are dominated by the error in the less restricted space. In addition, the discrete ordinate method suffers from the ray effect, in which isotropic scattering is violated, whereas in the Galerkin method, the ray effect is not observed.Item OCCA: A Unified Approach to Multi-Threading Languages(2014-10-23) Medina, David; Warburton, Timothy; Riviere, Beatrice; Symes, William WWith the current trend of using co-processors for accelerating computations, we are presented with architectures and corresponding programming languages. The inability to predict lasting languages and architectures has led to the development of distinct languages and standards. This thesis details my work on occa, a unified threading language presented as a portable solution to hardware-accelerated coding that combines aspects of OpenMP, OpenCL, and CUDA. With the similarities between OpenMP, OpenCL and CUDA, I present a macro-based approach on a unified kernel language that currently encompasses OpenMP, OpenCL and CUDA. Along with kernel generation, occa includes an API (application programming interface) which serves as a wrapper on the three multi-threading languages. The back-end on occa dynamically compiles and loads function objects for a flexible run-time environment to use different hardware architectures. Computational results using a spectrum of methods, namely finite difference, spectral element and discontinuous Galerkin methods, utilizing occa are shown to deliver portable high performance on different architectures and platforms. The finite difference method chapter reverse engineers optimized code written in CUDA and used in industry, discusses distinct features available in CUDA and compares occa implementations using different optimization techniques. The spectral element method and discontinuous Galerkin methods are derived from two projects I worked on during my studies: gNek, a distributed high-order spectral element method (SEM) implementation for the incompressible Navier-Stokes equations, and RiDG, equipped with discontinuous Galerkin (DG) to simulate acoustic wave equations under different assumptions in the material anisotropies. The parallel algorithms used to achieve high parallelization for GPU acceleration are discussed in both, gNek and RiDG, together with performance results.Item OKL: A Unified Language for Parallel Architectures(2015-11-25) Medina, David; Warburton, Timothy; Riviere, Beatrice; Sorensen, Danny C; Cooper, Keith DRapid evolution of computer processor architectures has spawned multiple programming languages and standards. This thesis strives to address the challenges caused by fast and cyclical changes in programming models. The novel contribution of this thesis is the introduction of an abstract unified framework which addresses portability and performance for programming many-core devices. To test this concept, I developed a specific implementation of this framework called OCCA. OCCA provides evidence that it is possible to achieve high performance across multiple platforms. The programming model investigated in this thesis abstracts a hierarchical representation of modern many-core devices. The model at its lowest level adopts native programming languages for these many-core devices, including serial code, OpenMP, OpenCL, NVIDIA's CUDA, and Intel's COI. At its highest level, the ultimate goal is a high level language that is agnostic about the underlying architecture. I developed a multiply layered approach to bridge the gap between expert "close to the metal" low-level programming and novice-level programming. Each layer requires varying degrees of programmer intervention to access low-level features in device architectures. I begin by introducing an approach for encapsulating programming language features, delivering a single intermediate representation (OCCA IR). Built above the OCCA IR are two kernel languages extending the prominent programming languages C and Fortran, the OCCA kernel language (OKL) and the OCCA Fortran language (OFL). Additionally, I contribute two automated approaches for facilitating data movement and automating translations from serial code to OKL kernels. To validate OCCA as a unified framework implementation, I compare performance results across a variety of applications and benchmarks. A spectrum of applications have been ported to utilize OCCA, showing no performance loss compared to their native programming language counterparts. In addition, a majority of the discussed applications show comparable results with a single OCCA kernel.Item Residual-Based Adaptivity and PWDG Methods for the Helmholtz Equation(Society for Industrial and Applied Mathematics, 2015) Kapita, Shelvean; Monk, Peter; Warburton, TimothyWe present a study of two residual a posteriori error indicators for the plane wave discontinuous Galerkin (PWDG) method for the Helmholtz equation. In particular, we study the $h$-version of PWDG in which the number of plane wave directions per element is kept fixed. First, we use a slight modification of the appropriate a priori analysis to determine a residual indicator. Numerical tests show that this is reliable but pessimistic in that the ratio between the true error and the indicator increases as the mesh is refined. We therefore introduce a new analysis based on the observation that sufficiently many plane waves can approximate piecewise linear functions as the mesh is refined. Numerical results demonstrate an improvement in the efficiency of the indicators.Item Simulation of CO2 sequestration in saline aquifers using discontinuous Galerkin method(2014-08-01) Yang, Xin; Riviere, Beatrice M.; Symes, William W; Warburton, Timothy; Verduzco, RafaelCarbon dioxide disposal into deep aquifer has been an important venue to trap excess gas emission which causes global warming. In the CO2 sequestration process, CO2 is captured from the point source and injected into the saline aquifer deep enough where CO2 reaches the supercritical state and it has a very high density compared to gaseous state. This process is described by the two-phase two-component model, which involves two nonlinear time dependent advection-diffusion equations. The difficulty lies in the injection phase when the advection terms highly dominate over the diffusion terms. Discontinuous Galerkin (DG) methods, which are famous for the properties of high order accuracy, locality and locally mass conservation, have proved to be promising for advection dominated transport equations. I develop a new fully implicit fully coupled DG method called “partial upwind” method, to discretize the equations. For time discretization, it uses the backward Euler method to allow large time steps. For space discretization, it uses the usual interior penalty DG discretization for the elliptic terms and the upwind for part of the advection terms. The other part of the advection terms are handled specially for stabilization purpose. Numerical simulations show that the new DG method works well for the CO2 sequestration problems in homogenous porous media and has shown great potential in heterogenous porous media. I also compare the new method with the primal interior penalty DG method and show that the new method is superior to the usual DG method for some subsurface fluid flow problems. Though DG methods perform well for the CO2 sequestration problem, they are indeed more costly than traditional numerical methods. The first order finite volume (FV) method, on the other hand, is very efficient. A new coupled finite volume and discontinuous Galerkin method, which uses the accuracy of DG methods on some parts of the domain and the efficiency of FV methods everywhere else to reduce the computational cost, is also studied for the time-dependent advection-diffusion equations. Theoretical and numerical results show that the new coupled method converges and can be both accurate and efficient at the same time for some typical examples. We want to apply the new coupled method to the CO2 sequestration problems in the future.Item Wave Equation Based Stencil Optimizations on a Multi-core CPU(2014-11-04) Zhou, Muhong; Symes, William W.; Riviere, Beatrice; Warburton, TimothyWave propagation stencil kernels are engines of seismic imaging algo- rithms. These kernels are both compute- and memory-intensive. This work targets improving the performance of wave equation based stencil code parallelized by OpenMP on a multi-core CPU. To achieve this goal, we explored two techniques: improving vectorization by using hardware SIMD technology, and reducing memory traffic to mitigate the bottle- neck caused by limited memory bandwidth. We show that with loop interchange, memory alignment, and compiler hints, both icc and gcc compilers can provide fully-vectorized stencil code of any order with per- formance comparable to that of SIMD intrinsic code. To reduce cache misses, we present three methods in the context of OpenMP paralleliza- tion: rearranging loop structure, blocking thread accesses, and temporal loop blocking. Our results demonstrate that fully-vectorized high-order stencil code will be about 2X faster if implemented with either of the first two methods, and fully-vectorized low-order stencil code will be about 1.2X faster if implemented with the combination of the last two methods. Our final best-performing code achieves 20%∼30% of peak GFLOPs/sec, depending on stencil order and compiler.