The custom matrix format example..
This example depends on simple-solver, poisson-solver, three-pt-stencil-solver, .
Introduction
This example solves a 1D Poisson equation:
\(
u : [0, 1] \rightarrow R\\
u'' = f\\
u(0) = u0\\
u(1) = u1
\)
using a finite difference method on an equidistant grid with K discretization points (K can be controlled with a command line parameter). The discretization is done via the second order Taylor polynomial:
\(
u(x + h) = u(x) - u'(x)h + 1/2 u''(x)h^2 + O(h^3)\\
u(x - h) = u(x) + u'(x)h + 1/2 u''(x)h^2 + O(h^3) / +\\
---------------------- \\
-u(x - h) + 2u(x) + -u(x + h) = -f(x)h^2 + O(h^3)
\)
For an equidistant grid with K "inner" discretization points \(x1, ..., xk, \)and step size \( h = 1 / (K + 1)\), the formula produces a system of linear equations
\(
2u_1 - u_2 = -f_1 h^2 + u0\\
-u_(k-1) + 2u_k - u_(k+1) = -f_k h^2, k = 2, ..., K - 1\\
-u_(K-1) + 2u_K = -f_K h^2 + u1\\
\)
which is then solved using Ginkgo's implementation of the CG method preconditioned with block-Jacobi. It is also possible to specify on which executor Ginkgo will solve the system via the command line. The function \(`f` \)is set to \(`f(x) = 6x`\) (making the solution \(`u(x) = x^3`\)), but that can be changed in the main function.
The intention of this example is to show how a custom linear operator can be created and integrated into Ginkgo to achieve performance benefits.
About the example
The commented program
template <typename ValueType>
public:
Definition custom-matrix-format.cpp:18
Definition polymorphic_object.hpp:767
Definition lin_op.hpp:878
This constructor will be called by the create method. Here we initialize the coefficients of the stencil.
ValueType center = 2.0, ValueType right = -1.0)
coefficients(exec, {left, center, right})
{}
protected:
The Ginkgo namespace.
Definition abstract_factory.hpp:20
std::size_t size_type
Definition types.hpp:89
Here we implement the application of the linear operator, x = A * b. apply_impl will be called by the apply method, after the arguments have been moved to the correct executor and the operators checked for conforming sizes.
For simplicity, we assume that there is always only one right hand side and the stride of consecutive elements in the vectors is 1 (both of these are always true in this example).
{
Definition lin_op.hpp:117
we only implement the operator for dense RHS. gko::as will throw an exception if its argument is not Dense.
auto dense_b = gko::as<vec>(b);
auto dense_x = gko::as<vec>(x);
we need separate implementations depending on the executor, so we create an operation which maps the call to the correct implementation
stencil_operation(
const coef_type& coefficients,
const vec* b,
: coefficients{coefficients}, b{b}, x{x}
{}
Definition executor.hpp:258
OpenMP implementation
void run(std::shared_ptr<const gko::OmpExecutor>) const override
{
auto b_values = b->get_const_values();
auto x_values = x->get_values();
#pragma omp parallel for
for (std::size_t i = 0; i < x->
get_size()[0]; ++i) {
auto coefs = coefficients.get_const_data();
auto result = coefs[1] * b_values[i];
if (i > 0) {
result += coefs[0] * b_values[i - 1];
}
if (i < x->get_size()[0] - 1) {
result += coefs[2] * b_values[i + 1];
}
x_values[i] = result;
}
}
const dim< 2 > & get_size() const noexcept
Definition lin_op.hpp:210
CUDA implementation
void run(std::shared_ptr<const gko::CudaExecutor>) const override
{
stencil_kernel(x->
get_size()[0], coefficients.get_const_data(),
b->get_const_values(), x->get_values());
}
We do not provide an implementation for reference executor. If not provided, Ginkgo will use the implementation for the OpenMP executor when calling it in the reference executor.
const coef_type& coefficients;
};
this->get_executor()->run(
stencil_operation(coefficients, dense_b, dense_x));
}
There is also a version of the apply function which does the operation x = alpha * A * b + beta * x. This function is commonly used and can often be better optimized than implementing it using x = A * b. However, for simplicity, we will implement it exactly like that in this example.
{
auto dense_b = gko::as<vec>(b);
auto dense_x = gko::as<vec>(x);
auto tmp_x = dense_x->clone();
this->apply_impl(b, tmp_x.get());
dense_x->scale(beta);
dense_x->add_scaled(alpha, tmp_x);
}
private:
coef_type coefficients;
};
Creates a stencil matrix in CSR format for the given number of discretization points.
template <typename ValueType, typename IndexType>
{
const auto discretization_points = matrix->get_size()[0];
IndexType pos = 0;
const ValueType coefs[] = {-1, 2, -1};
row_ptrs[0] = pos;
for (int i = 0; i < discretization_points; ++i) {
for (auto ofs : {-1, 0, 1}) {
if (0 <= i + ofs && i + ofs < discretization_points) {
values[pos] = coefs[ofs + 1];
col_idxs[pos] = i + ofs;
++pos;
}
}
row_ptrs[i + 1] = pos;
}
}
index_type * get_row_ptrs() noexcept
Definition csr.hpp:1005
value_type * get_values() noexcept
Definition csr.hpp:955
index_type * get_col_idxs() noexcept
Definition csr.hpp:986
Generates the RHS vector given f and the boundary conditions.
template <typename Closure, typename ValueType>
void generate_rhs(Closure f, ValueType u0, ValueType u1,
{
const auto discretization_points = rhs->get_size()[0];
auto values = rhs->get_values();
const ValueType h = 1.0 / (discretization_points + 1);
for (int i = 0; i < discretization_points; ++i) {
const ValueType xi = ValueType(i + 1) * h;
values[i] = -f(xi) * h * h;
}
values[0] += u0;
values[discretization_points - 1] += u1;
}
Prints the solution u.
template <typename ValueType>
void print_solution(ValueType u0, ValueType u1,
{
std::cout << u0 << '\n';
for (int i = 0; i < u->get_size()[0]; ++i) {
}
std::cout << u1 << std::endl;
}
const value_type * get_const_values() const noexcept
Definition dense.hpp:859
Computes the 1-norm of the error given the computed u and the correct solution function correct_u.
template <typename Closure, typename ValueType>
double calculate_error(int discretization_points,
Closure correct_u)
{
const auto h = 1.0 / (discretization_points + 1);
auto error = 0.0;
for (int i = 0; i < discretization_points; ++i) {
using std::abs;
const auto xi = (i + 1) * h;
error +=
}
return error;
}
int main(int argc, char* argv[])
{
Some shortcuts
using ValueType = double;
using IndexType = int;
typename detail::remove_complex_s< T >::type remove_complex
Definition math.hpp:260
Figure out where to run the code
if (argc == 2 && (std::string(argv[1]) == "--help")) {
std::cerr << "Usage: " << argv[0] << " [executor]" << std::endl;
std::exit(-1);
}
const auto executor_string = argc >= 2 ? argv[1] : "reference";
const unsigned int discretization_points =
argc >= 3 ? std::atoi(argv[2]) : 100u;
std::map<std::string, std::function<std::shared_ptr<gko::Executor>()>>
exec_map{
{"cuda",
[] {
}},
{"hip",
[] {
}},
{"dpcpp",
[] {
}},
{"reference", [] { return gko::ReferenceExecutor::create(); }}};
static std::shared_ptr< CudaExecutor > create(int device_id, std::shared_ptr< Executor > master, bool device_reset, allocation_mode alloc_mode=default_cuda_alloc_mode, CUstream_st *stream=nullptr)
static std::shared_ptr< DpcppExecutor > create(int device_id, std::shared_ptr< Executor > master, std::string device_type="all", dpcpp_queue_property property=dpcpp_queue_property::in_order)
static std::shared_ptr< HipExecutor > create(int device_id, std::shared_ptr< Executor > master, bool device_reset, allocation_mode alloc_mode=default_hip_alloc_mode, CUstream_st *stream=nullptr)
static std::shared_ptr< OmpExecutor > create(std::shared_ptr< CpuAllocatorBase > alloc=std::make_shared< CpuAllocator >())
Definition executor.hpp:1396
executor where Ginkgo will perform the computation
const auto exec = exec_map.at(executor_string)();
executor used by the application
const auto app_exec = exec->get_master();
problem:
auto correct_u = [](ValueType x) { return x * x * x; };
auto f = [](ValueType x) { return ValueType{6} * x; };
auto u0 = correct_u(0);
auto u1 = correct_u(1);
initialize vectors
generate_rhs(f, u0, u1, rhs.get());
for (int i = 0; i < u->get_size()[0]; ++i) {
}
const RealValueType reduction_factor{1e-7};
static std::unique_ptr< Dense > create(std::shared_ptr< const Executor > exec, const dim< 2 > &size={}, size_type stride=0)
value_type * get_values() noexcept
Definition dense.hpp:850
Generate solver and solve the system
cg::build()
.with_criteria(
gko::stop::Iteration::build().with_max_iters(discretization_points),
reduction_factor))
.on(exec)
Definition residual_norm.hpp:113
notice how our custom StencilMatrix can be used in the same way as any built-in type
-1, 2, -1))
->apply(rhs, u);
std::cout << "\nSolve complete."
<< "\nThe average relative error is "
<< calculate_error(discretization_points, u.get(), correct_u) /
discretization_points
<< std::endl;
}
Results
Comments about programming and debugging
The plain program
#include <iostream>
#include <map>
#include <string>
#include <omp.h>
#include <ginkgo/ginkgo.hpp>
template <typename ValueType>
void stencil_kernel(std::size_t size, const ValueType* coefs,
const ValueType* b, ValueType* x);
template <typename ValueType>
public:
ValueType center = 2.0, ValueType right = -1.0)
coefficients(exec, {left, center, right})
{}
protected:
{
auto dense_b = gko::as<vec>(b);
auto dense_x = gko::as<vec>(x);
stencil_operation(
const coef_type& coefficients,
const vec* b,
: coefficients{coefficients}, b{b}, x{x}
{}
void run(std::shared_ptr<const gko::OmpExecutor>) const override
{
auto b_values = b->get_const_values();
auto x_values = x->get_values();
#pragma omp parallel for
for (std::size_t i = 0; i < x->
get_size()[0]; ++i) {
auto coefs = coefficients.get_const_data();
auto result = coefs[1] * b_values[i];
if (i > 0) {
result += coefs[0] * b_values[i - 1];
}
if (i < x->get_size()[0] - 1) {
result += coefs[2] * b_values[i + 1];
}
x_values[i] = result;
}
}
void run(std::shared_ptr<const gko::CudaExecutor>) const override
{
stencil_kernel(x->
get_size()[0], coefficients.get_const_data(),
b->get_const_values(), x->get_values());
}
const coef_type& coefficients;
};
stencil_operation(coefficients, dense_b, dense_x));
}
{
auto dense_b = gko::as<vec>(b);
auto dense_x = gko::as<vec>(x);
auto tmp_x = dense_x->clone();
this->apply_impl(b, tmp_x.get());
dense_x->scale(beta);
dense_x->add_scaled(alpha, tmp_x);
}
private:
coef_type coefficients;
};
template <typename ValueType, typename IndexType>
{
const auto discretization_points = matrix->get_size()[0];
IndexType pos = 0;
const ValueType coefs[] = {-1, 2, -1};
row_ptrs[0] = pos;
for (int i = 0; i < discretization_points; ++i) {
for (auto ofs : {-1, 0, 1}) {
if (0 <= i + ofs && i + ofs < discretization_points) {
values[pos] = coefs[ofs + 1];
col_idxs[pos] = i + ofs;
++pos;
}
}
row_ptrs[i + 1] = pos;
}
}
template <typename Closure, typename ValueType>
void generate_rhs(Closure f, ValueType u0, ValueType u1,
{
const auto discretization_points =
rhs->get_size()[0];
auto values =
rhs->get_values();
const ValueType h = 1.0 / (discretization_points + 1);
for (int i = 0; i < discretization_points; ++i) {
const ValueType xi = ValueType(i + 1) * h;
values[i] = -f(xi) * h * h;
}
values[0] += u0;
values[discretization_points - 1] += u1;
}
template <typename ValueType>
void print_solution(ValueType u0, ValueType u1,
{
std::cout << u0 << '\n';
for (int i = 0; i < u->get_size()[0]; ++i) {
}
std::cout << u1 << std::endl;
}
template <typename Closure, typename ValueType>
double calculate_error(int discretization_points,
Closure correct_u)
{
const auto h = 1.0 / (discretization_points + 1);
auto error = 0.0;
for (int i = 0; i < discretization_points; ++i) {
using std::abs;
const auto xi = (i + 1) * h;
error +=
}
return error;
}
int main(int argc, char* argv[])
{
using ValueType = double;
using IndexType = int;
if (argc == 2 && (std::string(argv[1]) == "--help")) {
std::cerr << "Usage: " << argv[0] << " [executor]" << std::endl;
std::exit(-1);
}
const auto executor_string = argc >= 2 ? argv[1] : "reference";
const unsigned int discretization_points =
argc >= 3 ? std::atoi(argv[2]) : 100u;
std::map<std::string, std::function<std::shared_ptr<gko::Executor>()>>
exec_map{
{"cuda",
[] {
}},
{"hip",
[] {
}},
{"dpcpp",
[] {
}},
{"reference", [] { return gko::ReferenceExecutor::create(); }}};
const auto exec = exec_map.at(executor_string)();
const auto app_exec = exec->get_master();
auto correct_u = [](ValueType x) { return x * x * x; };
auto f = [](ValueType x) { return ValueType{6} * x; };
auto u0 = correct_u(0);
auto u1 = correct_u(1);
generate_rhs(f, u0, u1,
rhs.get());
for (int i = 0; i < u->get_size()[0]; ++i) {
}
const RealValueType reduction_factor{1e-7};
cg::build()
.with_criteria(
gko::stop::Iteration::build().with_max_iters(discretization_points),
reduction_factor))
.on(exec)
-1, 2, -1))
->apply(rhs, u);
std::cout << "\nSolve complete."
<< "\nThe average relative error is "
<< calculate_error(discretization_points, u.get(), correct_u) /
discretization_points
<< std::endl;
}
std::shared_ptr< const Executor > get_executor() const noexcept
Definition polymorphic_object.hpp:243
constexpr std::enable_if_t<!is_complex_s< T >::value, T > abs(const T &x)
Definition math.hpp:931