Robust Iterative Matrix Solvers Scalable on Massively Parallel Computers

In the face of ever shrinking feature size of VLSI technology, a major thrust of Stanford TCAD efforts is going to 3D simulations. One of the major issues here is how to solve the matrices with huge number of variables routinely encountered. It has been generally accepted that iterative methods become the most efficient alternative under these situations. Even with the iterative methods, the requirement for the CPU and memory resources usually cannot be met by the typical serial computers. With the rapid maturing of massively parallel computer technology, going parallel is a logical choice. Given the fact that on many occasions, ill-conditioned matrices must be solved in TCAD applications, our efforts in parallelization have been concentrated on developing robust iterative matrix solution methods whose performance is scalable on massively parallel computers.

One important aspect of an iterative solvers is the iterative algorithm. We have taken the advantages of the latest advance in the Krylov subspace based algorithms by implementing a set of algorithms which has been proven to be most effective. For cases where matrix assembly consumes a significant portion of the total cpu time, commonly encountered in TCAD applications, the entire grid domain is often partitioned into nearly equal-sized subdomains to parallelize both the assembly and the solution of the matrices. If the preconditioner to be used is no more complex than the diagonal preconditioner, the only parallelization step is the vector globalization which, for those entries shared by neighboring processors, sums up the contributions from the sharing processors. This domain partitioning approach constitutes our basis for developing more advanced parallel preconditioners. Experiences from various groups have indicated that ILU and its variants are the most effective preconditioners in solving the matrices encountered in TCAD, especially in device simulations. However, these preconditioners have be generally regarded to be very difficult to be parallelized effectively. Our efforts in developing efficient parallel ILU preconditioners started on matrices formed by applying finite volume method to a regular grid. The key step here is to group grid points according their processor sharing characteristics and reorder them in terms of the degree of sharing. Since the interior nodes of the subdomain are not interconnected, the ILU operations related to these grid points can be parallelized. More importantly, the nodes shared with neighboring processors can be parallelized if the tasks for the rest of the grid point groups are distributed in a suitable way and the corresponding matrix entries are combined appropriately[1].

In short, the ILU operations are parallelized by dividing the operations into several stages which involves grid points of nondecreasing degree of sharing with proper data communication between the stages. Very good parallel performances have been achieved over several generations of parallel computers[1][2]. Although ILU preconditioners work well most of the times, there are matrices arising from TCAD applications that failed to converge with ILU preconditioner. We have further developed an efficient parallel implementation of ILUV preconditioner which is able to finish the jobs where ILU cannot. Our proudest moment came when we successfully solved a five million grid-point bipolar transistor simulation problem on the Caltech delta machine with 512 Intel i860 computing nodes. We are currently working on extending our approach to the matrices formed by applying both finite volume and finite element methods on the general grid domains. We are also working to modualize our programs so it can be more easily adopted into other programs.

References:
[1] R. F. Lucas, etc "A Parallel 3-D Poisson solver on a hypercube processors", Proc. ICCAD, 1987, pp. 442

[2] K.-C. Wu, "A STRIDE Towards Practical 3-D Device Simulation--Numerical and Visualization Considerations", IEEE Trans. on CAD, pp. 516, 1991

Ke-Chih Wu(wuk@gloworm.Stanford.EDU)
AEL 201
Integrated Circuits Laboratory
Stanford University
Stanford, CA 94305-4055