DescriptionLarge-scale parallel numerical simulations are essential for a wide range of engineering problems that involve complex, coupled physical processes interacting across a broad range of spatial and temporal scales. The data structures involved in such simulations (meshes, sparse matrices, etc.) are frequently represented as graphs, and these graphs must be optimally partitioned across the available computational resources in order for the underlying calculations to scale efficiently. Partitions which minimize the number of graph edges that are cut (edge-cuts) while simultaneously maintaining a balance in the amount of work (i.e. graph nodes) assigned to each processor core are desirable, and the performance of most existing partitioning software begins to degrade in this metric for partitions with more than than $O(10^3)$ processor cores. In this work, we consider a general-purpose hierarchical partitioner which takes into account the existence of multiple processor cores and shared memory in a compute node while partitioning a graph into an arbitrary number of subgraphs. We demonstrate that our algorithms significantly improve the preconditioning efficiency and overall performance of realistic numerical simulations running on up to 32,768 processor cores with nearly $10^9$ unknowns.