Life without Death Parallel Simulation
This project consists of a parallel and hyper-optimized implementation of Life without Death, a variant of John Conway’s famous Game of Life cellular automaton. The main objective was to maximize computational throughput and reduce execution times for simulations on large grids by exploiting multi-core architectures via OpenMP.
Parallelization Strategies (OpenMP)
To achieve maximum speedup, the core of the simulation (evolve_step_parallel) was parallelized and refined through several iterations:
- Dynamic Scheduling & Loop Collapse: The two outermost
forloops were parallelized using thecollapse(2)clause to distribute iterations uniformly across all cells among the threads. A dynamic scheduling approach (which showed similar results to guided scheduling) was chosen to dynamically balance the load, since the automaton’s evolution generates non-uniform computational zones. - Reduction Variables: To safely and efficiently track the automaton’s state changes without causing data races or incurring lock-based bottlenecks, specific
reductionvariables were introduced.
Architectural & Algorithmic Optimizations
Beyond inserting OpenMP directives, the source code was deeply reorganized to “assist” the underlying hardware:
- Cache Locality Optimization: The iteration order of the
forloops within the evolution function was inverted to ensure contiguous memory access. This drastically reduced cache misses and maximized CPU cache utilization. - Branch Pruning & Fast-Paths: The computational logic was streamlined to avoid unnecessary operations:
- Introduced a preemptive check to see if a cell is “dead” before computing neighbors and hue, saving precious clock cycles.
- Added a quick check on neighbors’ hues to bypass redundant recomputations.
- Removed conditional checks on “survived cells” that proved superfluous for the specific rules of this automaton.
Trade-off Analysis
A fundamental part of HPC optimization is analyzing what does not work. The experiments demonstrated that:
- Static Assignment of tasks produced less speedup due to the highly conditional and unbalanced behavior of the automaton’s loops.
- Nested Parallelism in the inner loops degraded overall performance due to the low number of available threads and the excessive overhead of creating and managing thread teams.
Repository: View on GitHub (private for now)weight: 999