The distributed-multigrid-preconditioned-solver program

The distributed-multigrid-preconditioned-solver program#

Reference API: The distributed-multigrid-preconditioned-solver program
Reference API
The distributed-multigrid-preconditioned-solver program

The distributed multigrid preconditioned solver with customized components example..

This example depends on distributed-solver.

Table of contents
  1. Introduction
  2. The commented program
  1. Results
  2. The plain program

Introduction

This distributed multigrid preconditioned solver example should help you understand customizing Ginkgo multigrid in a distributed setting. The example will solve a simple 1D Laplace equation where the system can be distributed row-wise to multiple processes. Note. Because the stencil for the discretized Laplacian is configured with equal weight, the coarsening method does not perform well on this kind of problem. To run the solver with multiple processes, use mpirun -n NUM_PROCS ./distributed-solver [executor] [num_grid_points] [num_iterations].

If you are using GPU devices, please make sure that you run this example with at most as many processes as you have GPU devices available.

The commented program

As vector type we use the following, which implements a subset of gko::matrix::Dense.

As matrix type we simply use the following type, which can read distributed data and be applied to a distributed vector.

using dist_mtx =
gko::experimental::distributed::Matrix<ValueType, LocalIndexType,
GlobalIndexType>;

We still need a localized vector type to be used as scalars in the advanced apply operations.

The partition type describes how the rows of the matrices are distributed.

using part_type =
GlobalIndexType>;

We can use here the same solver type as you would use in a non-distributed program. Please note that not all solvers support distributed systems at the moment.

Definition cg.hpp:49

We use the Schwarz preconditioner to extend non-distributed preconditioners, like our Jacobi, to the distributed case. The Schwarz preconditioner wraps another preconditioner, and applies it only to the local part of a distributed matrix. This will be used as our distributed multigrid smoother.

ValueType, LocalIndexType, GlobalIndexType>;
Definition jacobi.hpp:189

Multigrid and Pgm can accept the distributed matrix, so we still use the same type as the non-distributed case.

Definition pgm.hpp:52
Definition multigrid.hpp:108

Create an MPI communicator get the rank of the calling process.

const auto comm = gko::experimental::mpi::communicator(MPI_COMM_WORLD);
const auto rank = comm.rank();

User Input Handling

User input settings:

  • The executor, defaults to reference.
  • The number of grid points, defaults to 100.
  • The number of iterations, defaults to 1000.
if (argc == 2 && (std::string(argv[1]) == "--help")) {
if (rank == 0) {
std::cerr << "Usage: " << argv[0]
<< " [executor] [num_grid_points] [num_iterations] "
<< std::endl;
}
std::exit(-1);
}
const auto executor_string = argc >= 2 ? argv[1] : "reference";
const auto grid_dim =
static_cast<gko::size_type>(argc >= 3 ? std::atoi(argv[2]) : 100);
const auto num_iters =
static_cast<gko::size_type>(argc >= 4 ? std::atoi(argv[3]) : 1000);
const std::map<std::string,
std::function<std::shared_ptr<gko::Executor>(MPI_Comm)>>
executor_factory_mpi{
{"reference",
[](MPI_Comm) { return gko::ReferenceExecutor::create(); }},
{"omp", [](MPI_Comm) { return gko::OmpExecutor::create(); }},
{"cuda",
[](MPI_Comm comm) {
device_id, gko::ReferenceExecutor::create());
}},
{"hip",
[](MPI_Comm comm) {
device_id, gko::ReferenceExecutor::create());
}},
{"dpcpp", [](MPI_Comm comm) {
int device_id = 0;
} else {
throw std::runtime_error("No suitable DPC++ devices");
}
device_id, gko::ReferenceExecutor::create());
}}};
auto exec = executor_factory_mpi.at(executor_string)(MPI_COMM_WORLD);
static std::shared_ptr< CudaExecutor > create(int device_id, std::shared_ptr< Executor > master, bool device_reset, allocation_mode alloc_mode=default_cuda_alloc_mode, CUstream_st *stream=nullptr)
static int get_num_devices()
static int get_num_devices(std::string device_type)
static std::shared_ptr< DpcppExecutor > create(int device_id, std::shared_ptr< Executor > master, std::string device_type="all", dpcpp_queue_property property=dpcpp_queue_property::in_order)
static int get_num_devices()
static std::shared_ptr< HipExecutor > create(int device_id, std::shared_ptr< Executor > master, bool device_reset, allocation_mode alloc_mode=default_hip_alloc_mode, CUstream_st *stream=nullptr)
static std::shared_ptr< OmpExecutor > create(std::shared_ptr< CpuAllocatorBase > alloc=std::make_shared< CpuAllocator >())
Definition executor.hpp:1396
int map_rank_to_device_id(MPI_Comm comm, int num_devices)
double get_walltime()
Definition mpi.hpp:1583
std::size_t size_type
Definition types.hpp:89

Creating the Distributed Matrix and Vectors

As a first step, we create a partition of the rows. The partition consists of ranges of consecutive rows which are assigned a part-id. These part-ids will be used for the distributed data structures to determine which rows will be stored locally. In this example each rank has (nearly) the same number of rows, so we can use the following specialized constructor. See gko::distributed::Partition for other modes of creating a partition.

const auto num_rows = grid_dim;
auto partition = gko::share(part_type::build_from_global_size_uniform(
exec->get_master(), comm.size(),
static_cast<GlobalIndexType>(num_rows)));
detail::shared_type< OwningPointer > share(OwningPointer &&p)
Definition utils_helper.hpp:224

Assemble the matrix using a 3-pt stencil and fill the right-hand-side with a sine value. The distributed matrix supports only constructing an empty matrix of zero size and filling in the values with gko::experimental::distributed::Matrix::read_distributed. Only the data that belongs to the rows by this rank will be assembled.

A_data.size = {num_rows, num_rows};
b_data.size = {num_rows, 1};
x_data.size = {num_rows, 1};
const auto range_start = partition->get_range_bounds()[rank];
const auto range_end = partition->get_range_bounds()[rank + 1];
for (int i = range_start; i < range_end; i++) {
if (i > 0) {
A_data.nonzeros.emplace_back(i, i - 1, -1);
}
A_data.nonzeros.emplace_back(i, i, 2);
if (i < grid_dim - 1) {
A_data.nonzeros.emplace_back(i, i + 1, -1);
}
b_data.nonzeros.emplace_back(i, 0, std::sin(i * 0.01));
x_data.nonzeros.emplace_back(i, 0, gko::zero<ValueType>());
}
Definition matrix_data.hpp:126
dim< 2 > size
Definition matrix_data.hpp:444
std::vector< nonzero_type > nonzeros
Definition matrix_data.hpp:453

Take timings.

comm.synchronize();
ValueType t_init_end = gko::experimental::mpi::get_walltime();

Read the matrix data, currently this is only supported on CPU executors. This will also set up the communication pattern needed for the distributed matrix-vector multiplication.

auto A_host = gko::share(dist_mtx::create(exec->get_master(), comm));
auto x_host = dist_vec::create(exec->get_master(), comm);
auto b_host = dist_vec::create(exec->get_master(), comm);
A_host->read_distributed(A_data, partition);
b_host->read_distributed(b_data, partition);
x_host->read_distributed(x_data, partition);

After reading, the matrix and vector can be moved to the chosen executor, since the distributed matrix supports SpMV also on devices.

auto A = gko::share(dist_mtx::create(exec, comm));
auto x = dist_vec::create(exec, comm);
auto b = dist_vec::create(exec, comm);
A->copy_from(A_host);
b->copy_from(b_host);
x->copy_from(x_host);

Take timings.

comm.synchronize();
ValueType t_read_setup_end = gko::experimental::mpi::get_walltime();

Solve the Distributed System

Generate the solver

Setup the multigrid factory with customized smoother and coarse solver Because BlockJacobi does not support distributed matrix, we need wrap it in Schwarz.

auto schwarz_bj_factory =
gko::share(schwarz::build().with_local_solver(bj::build()).on(exec));
auto smoother_factory = gko::share(gko::solver::build_smoother(
schwarz_bj_factory, 2u, static_cast<ValueType>(0.9)));
auto build_smoother(std::shared_ptr< const LinOpFactory > factory, size_type iteration=1, ValueType relaxation_factor=0.9)
Definition ir.hpp:301

Cg supports distributed matrix, so we can use it as we did in non-distributed case

auto coarsest_factory = gko::share(
solver::build()
.with_criteria(gko::stop::Iteration::build().with_max_iters(4u))
.on(exec));

The multigrid preconditioner uses the Schwarz Jacobi as smoother and Cg as coarse solver

auto mg_factory = gko::share(
mg::build()
.with_mg_level(pgm::build().with_deterministic(true))
.with_pre_smoother(smoother_factory)
.with_coarsest_solver(coarsest_factory)
.with_criteria(gko::stop::Iteration::build().with_max_iters(1u))
.on(exec));

Setup the stopping criterion and logger

const gko::remove_complex<ValueType> reduction_factor{1e-8};
std::shared_ptr<const gko::log::Convergence<ValueType>> logger =
auto Ainv = solver::build()
.with_preconditioner(mg_factory)
.with_criteria(
gko::stop::Iteration::build().with_max_iters(num_iters),
.with_reduction_factor(reduction_factor))
.on(exec)
->generate(A);
static std::unique_ptr< Convergence > create(std::shared_ptr< const Executor >, const mask_type &enabled_events=Logger::criterion_events_mask|Logger::iteration_complete_mask)
Definition convergence.hpp:73
Definition residual_norm.hpp:113
typename detail::remove_complex_s< T >::type remove_complex
Definition math.hpp:260

Add logger to the generated solver to log the iteration count and residual norm

Ainv->add_logger(logger);

Take timings.

comm.synchronize();
ValueType t_solver_generate_end = gko::experimental::mpi::get_walltime();

Apply the distributed solver, this is the same as in the non-distributed case.

Ainv->apply(b, x);

Take timings.

comm.synchronize();

Get the residual.

auto res_norm = gko::clone(exec->get_master(),
gko::as<vec>(logger->get_residual_norm()));
detail::cloned_type< Pointer > clone(const Pointer &p)
Definition utils_helper.hpp:173

Printing Results

Print the achieved residual norm and timings on rank 0.

if (comm.rank() == 0) {

clang-format off

std::cout << "\nNum rows in matrix: " << num_rows
<< "\nNum ranks: " << comm.size()
<< "\nFinal Res norm: " << res_norm->at(0, 0)
<< "\nIteration count: " << logger->get_num_iterations()
<< "\nInit time: " << t_init_end - t_init
<< "\nRead time: " << t_read_setup_end - t_init
<< "\nSolver generate time: " << t_solver_generate_end - t_read_setup_end
<< "\nSolver apply time: " << t_end - t_solver_generate_end
<< "\nTotal time: " << t_end - t_init
<< std::endl;

clang-format on

}
}

Results

This is the expected output for mpirun -n 4 ./distributed-multigrid-preconditioned-solver:

Num rows in matrix: 100
Num ranks: 4
Final Res norm: 1.61045e-08
Iteration count: 18
Init time: 0.000117699
Read time: 0.000522518
Solver generate time: 0.000430548
Solver apply time: 0.00183804
Total time: 0.00279111

The timings may vary depending on the machine.

The plain program

#include <ginkgo/ginkgo.hpp>
#include <iostream>
#include <map>
#include <string>
int main(int argc, char* argv[])
{
const gko::experimental::mpi::environment env(argc, argv);
using GlobalIndexType = gko::int64;
using LocalIndexType = gko::int32;
using ValueType = double;
using dist_mtx =
gko::experimental::distributed::Matrix<ValueType, LocalIndexType,
GlobalIndexType>;
using part_type =
GlobalIndexType>;
ValueType, LocalIndexType, GlobalIndexType>;
const auto comm = gko::experimental::mpi::communicator(MPI_COMM_WORLD);
const auto rank = comm.rank();
if (argc == 2 && (std::string(argv[1]) == "--help")) {
if (rank == 0) {
std::cerr << "Usage: " << argv[0]
<< " [executor] [num_grid_points] [num_iterations] "
<< std::endl;
}
std::exit(-1);
}
const auto executor_string = argc >= 2 ? argv[1] : "reference";
const auto grid_dim =
static_cast<gko::size_type>(argc >= 3 ? std::atoi(argv[2]) : 100);
const auto num_iters =
static_cast<gko::size_type>(argc >= 4 ? std::atoi(argv[3]) : 1000);
const std::map<std::string,
std::function<std::shared_ptr<gko::Executor>(MPI_Comm)>>
executor_factory_mpi{
{"reference",
[](MPI_Comm) { return gko::ReferenceExecutor::create(); }},
{"omp", [](MPI_Comm) { return gko::OmpExecutor::create(); }},
{"cuda",
[](MPI_Comm comm) {
device_id, gko::ReferenceExecutor::create());
}},
{"hip",
[](MPI_Comm comm) {
device_id, gko::ReferenceExecutor::create());
}},
{"dpcpp", [](MPI_Comm comm) {
int device_id = 0;
} else {
throw std::runtime_error("No suitable DPC++ devices");
}
device_id, gko::ReferenceExecutor::create());
}}};
auto exec = executor_factory_mpi.at(executor_string)(MPI_COMM_WORLD);
const auto num_rows = grid_dim;
auto partition = gko::share(part_type::build_from_global_size_uniform(
exec->get_master(), comm.size(),
static_cast<GlobalIndexType>(num_rows)));
A_data.size = {num_rows, num_rows};
b_data.size = {num_rows, 1};
x_data.size = {num_rows, 1};
const auto range_start = partition->get_range_bounds()[rank];
const auto range_end = partition->get_range_bounds()[rank + 1];
for (int i = range_start; i < range_end; i++) {
if (i > 0) {
A_data.nonzeros.emplace_back(i, i - 1, -1);
}
A_data.nonzeros.emplace_back(i, i, 2);
if (i < grid_dim - 1) {
A_data.nonzeros.emplace_back(i, i + 1, -1);
}
b_data.nonzeros.emplace_back(i, 0, std::sin(i * 0.01));
x_data.nonzeros.emplace_back(i, 0, gko::zero<ValueType>());
}
comm.synchronize();
ValueType t_init_end = gko::experimental::mpi::get_walltime();
auto A_host = gko::share(dist_mtx::create(exec->get_master(), comm));
auto x_host = dist_vec::create(exec->get_master(), comm);
auto b_host = dist_vec::create(exec->get_master(), comm);
A_host->read_distributed(A_data, partition);
b_host->read_distributed(b_data, partition);
x_host->read_distributed(x_data, partition);
auto A = gko::share(dist_mtx::create(exec, comm));
auto x = dist_vec::create(exec, comm);
auto b = dist_vec::create(exec, comm);
A->copy_from(A_host);
b->copy_from(b_host);
x->copy_from(x_host);
comm.synchronize();
ValueType t_read_setup_end = gko::experimental::mpi::get_walltime();
auto schwarz_bj_factory =
gko::share(schwarz::build().with_local_solver(bj::build()).on(exec));
auto smoother_factory = gko::share(gko::solver::build_smoother(
schwarz_bj_factory, 2u, static_cast<ValueType>(0.9)));
auto coarsest_factory = gko::share(
solver::build()
.with_criteria(gko::stop::Iteration::build().with_max_iters(4u))
.on(exec));
auto mg_factory = gko::share(
mg::build()
.with_mg_level(pgm::build().with_deterministic(true))
.with_pre_smoother(smoother_factory)
.with_coarsest_solver(coarsest_factory)
.with_criteria(gko::stop::Iteration::build().with_max_iters(1u))
.on(exec));
const gko::remove_complex<ValueType> reduction_factor{1e-8};
std::shared_ptr<const gko::log::Convergence<ValueType>> logger =
auto Ainv = solver::build()
.with_preconditioner(mg_factory)
.with_criteria(
gko::stop::Iteration::build().with_max_iters(num_iters),
.with_reduction_factor(reduction_factor))
.on(exec)
->generate(A);
Ainv->add_logger(logger);
comm.synchronize();
ValueType t_solver_generate_end = gko::experimental::mpi::get_walltime();
Ainv->apply(b, x);
comm.synchronize();
auto res_norm = gko::clone(exec->get_master(),
gko::as<vec>(logger->get_residual_norm()));
if (comm.rank() == 0) {
std::cout << "\nNum rows in matrix: " << num_rows
<< "\nNum ranks: " << comm.size()
<< "\nFinal Res norm: " << res_norm->at(0, 0)
<< "\nIteration count: " << logger->get_num_iterations()
<< "\nInit time: " << t_init_end - t_init
<< "\nRead time: " << t_read_setup_end - t_init
<< "\nSolver generate time: " << t_solver_generate_end - t_read_setup_end
<< "\nSolver apply time: " << t_end - t_solver_generate_end
<< "\nTotal time: " << t_end - t_init
<< std::endl;
}
}
std::int32_t int32
Definition types.hpp:106
std::int64_t int64
Definition types.hpp:112