Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tpetra: Improve performance of CrsMatrix::copyAndPermute #13587

Open
skennon10 opened this issue Nov 11, 2024 · 1 comment
Open

Tpetra: Improve performance of CrsMatrix::copyAndPermute #13587

skennon10 opened this issue Nov 11, 2024 · 1 comment
Assignees
Labels
pkg: Tpetra type: enhancement Issue is an enhancement, not a bug

Comments

@skennon10
Copy link
Contributor

skennon10 commented Nov 11, 2024

Enhancement - Improve performance of CrsMatrix::copyAndPermute

In evaluating the Jacobian in a Panzer mini-EM test, a significant fraction of time is spent in copyAndPermute. Timing calipers show replaceGlobalValues (actually, combineGlobalValues called with REPLACE) and getGlobalRowCopy are two culprits. Additional time is spent in getLocal/GlobalElement hash table lookups.

Serial speedups have been obtained by writing a batch/array version of getGlobalElement (which amortizes the 'if(contiguous)' and other if-stmts), rewriting replaceGlobalValuesImpl and getGlobalRowCopy. Timing results to follow in an update to this issue.

This bottom-up approach led to a top-down investigation, leading to using Kokkos to parallelize the copyAndPermute main loop over rows of the source matrix.

@skennon10 skennon10 added the type: enhancement Issue is an enhancement, not a bug label Nov 11, 2024
@skennon10 skennon10 self-assigned this Nov 11, 2024
@skennon10
Copy link
Contributor Author

Timings

2024-11-11
----------

4 cores, mesh 100x100x100, ascicgpu038 (4 devices)

old:

|   |   panzer::ModelEvaluator::evalModel(J): 4.90168 - 3.69278% [1] {min=4.90166, max=4.90169, std dev=9.54096e-06} <1, 0, 0, 0, 1, 0, 1, 0, 0, 1>
|   |   |   panzer::AssemblyEngine::evaluate_scatter(panzer::Traits::Jacobian): 3.83767 - 78.2931% [1] {min=3.83512, max=3.84019, std dev=0.00257403} <1, 1, 0, 0, 0, 0, 0, 0, 1, 1>
|   |   |   |   panzer::AssemblyEngine::lof->ghostToGlobalContainer(panzer::Traits::Jacobian): 3.8376 - 99.9982% [1] {min=3.83505, max=3.84012, std dev=0.00257341} <1, 1, 0, 0, 0, 0, 0, 0, 1, 1>
|   |   |   |   |   Tpetra::MultiVector::putScalar: 2.22505e-05 - 0.000579802% [2] {min=1.6897e-05, max=2.73e-05, std dev=4.47445e-06} <1, 0, 0, 1, 0, 0, 1, 0, 0, 1>
|   |   |   |   |   Tpetra::DistObject::beginTransfer[Host]: 3.58066 - 93.3045% [6] {min=3.57394, max=3.58427, std dev=0.00469991} <1, 0, 0, 0, 0, 0, 1, 0, 0, 2>
|   |   |   |   |   |   Tpetra::DistObject::doTransferNew::checkSizes: 4.193e-06 - 0.000117101% [6] {min=3.818e-06, max=5.192e-06, std dev=6.67502e-07} <3, 0, 0, 0, 0, 0, 0, 0, 0, 1>
|   |   |   |   |   |   Tpetra::DistObject::doTransferNew::copyAndPermute: 3.50848 - 97.9843% [6] {min=3.43601, max=3.56608, std dev=0.0670518} <1, 0, 1, 0, 0, 0, 0, 0, 0, 2>
|   |   |   |   |   |   |   Tpetra::MultiVector::copyAndPermute[Device]: 0.0030337 - 0.0864676% [2] {min=0.00295513, max=0.00314429, std dev=8.69089e-05} <1, 1, 0, 0, 0, 1, 0, 0, 0, 1>
|   |   |   |   |   |   |   Tpetra::CrsMatrix::copyAndPermute: 3.50544 - 99.9131% [4] {min=3.43304, max=3.56309, std dev=0.0670119} <1, 0, 1, 0, 0, 0, 0, 0, 0, 2>

|   |   |   |   |   |   |   |   Tpetra::CrsMatrix::copyAndPermuteStaticGraph: 3.50542 - 99.9996% [4] {min=3.43302, max=3.56308, std dev=0.0670113} <1, 0, 1, 0, 0, 0, 0, 0, 0, 2>

|   |   |   |   |   |   |   |   Remainder: 1.53587e-05 - 0.000438141%
|   |   |   |   |   |   |   Remainder: 1.52455e-05 - 0.000434532%

new  (cuda):
|   |   panzer::ModelEvaluator::evalModel(J): 1.21941 - 1.00007% [1] {min=1.21941, max=1.21942, std dev=7.11369e-06} <1, 0, 0, 1, 0, 0, 0, 1, 0, 1>
|   |   |   panzer::AssemblyEngine::evaluate_scatter(panzer::Traits::Jacobian): 0.139482 - 11.4385% [1] {min=0.136319, max=0.142396, std dev=0.00248484} <1, 0, 0, 0, 0, 2, 0, 0, 0, 1>
|   |   |   |   panzer::AssemblyEngine::lof->ghostToGlobalContainer(panzer::Traits::Jacobian): 0.139416 - 99.9525% [1] {min=0.136253, max=0.14233, std dev=0.00248504} <1, 0, 0, 0, 0, 2, 0, 0, 0, 1>
|   |   |   |   |   Tpetra::MultiVector::putScalar: 2.23683e-05 - 0.0160442% [2] {min=1.6478e-05, max=2.628e-05, std dev=4.68044e-06} <1, 0, 0, 0, 1, 0, 0, 0, 0, 2>
|   |   |   |   |   Tpetra::DistObject::beginTransfer[Host]: 0.127398 - 91.3795% [6] {min=0.123563, max=0.131997, std dev=0.00389488} <1, 1, 0, 0, 0, 0, 1, 0, 0, 1>
|   |   |   |   |   |   Tpetra::DistObject::doTransferNew::checkSizes: 3.8715e-06 - 0.00303891% [6] {min=3.018e-06, max=4.391e-06, std dev=6.5438e-07} <1, 0, 0, 0, 1, 0, 0, 0, 0, 2>
|   |   |   |   |   |   Tpetra::DistObject::doTransferNew::copyAndPermute: 0.0853825 - 67.0204% [6] {min=0.0847438, max=0.086271, std dev=0.000639976} <1, 0, 0, 2, 0, 0, 0, 0, 0, 1>
|   |   |   |   |   |   |   Tpetra::MultiVector::copyAndPermute[Device]: 0.00311388 - 3.64697% [2] {min=0.00286818, max=0.00370714, std dev=0.000397135} <2, 1, 0, 0, 0, 0, 0, 0, 0, 1>
|   |   |   |   |   |   |   Tpetra::CrsMatrix::copyAndPermute: 0.082258 - 96.3406% [4] {min=0.0818093, max=0.0825522, std dev=0.000318425} <1, 0, 0, 0, 0, 0, 1, 1, 0, 1>

|   |   |   |   |   |   |   |   Tpetra::CrsMatrix::copyAndPermuteStaticGraph: 0.0822469 - 99.9864% [4] {min=0.0818001, max=0.0825393, std dev=0.000317065} <1, 0, 0, 0, 0, 0, 1, 1, 0, 1>

|   |   |   |   |   |   |   |   Remainder: 1.11495e-05 - 0.0135543%
|   |   |   |   |   |   |   Remainder: 1.0596e-05 - 0.01241%

4 cores cpu only build

old:
|   |   panzer::ModelEvaluator::evalModel(J): 18.3043 - 4.61639% [1] {min=18.3043, max=18.3043, std dev=1.06624e-06} <1, 0, 1, 0, 0, 0, 0, 0, 0, 2>
|   |   |   panzer::AssemblyEngine::evaluate_scatter(panzer::Traits::Jacobian): 2.83329 - 15.4788% [1] {min=2.83225, max=2.83384, std dev=0.000738868} <1, 0, 0, 0, 0, 0, 1, 0, 0, 2>
|   |   |   |   panzer::AssemblyEngine::lof->ghostToGlobalContainer(panzer::Traits::Jacobian): 2.83319 - 99.9967% [1] {min=2.83215, max=2.83374, std dev=0.000737956} <1, 0, 0, 0, 0, 0, 1, 0, 0, 2>
|   |   |   |   |   Tpetra::MultiVector::putScalar: 0.00185293 - 0.0654009% [2] {min=0.00162416, max=0.00250493, std dev=0.000434902} <3, 0, 0, 0, 0, 0, 0, 0, 0, 1>
|   |   |   |   |   Tpetra::DistObject::beginTransfer[Host]: 2.73086 - 96.3881% [6] {min=2.72678, max=2.73467, std dev=0.00358903} <1, 0, 1, 0, 0, 0, 0, 1, 0, 1>
|   |   |   |   |   |   Tpetra::DistObject::doTransferNew::checkSizes: 9.698e-06 - 0.000355126% [6] {min=6.641e-06, max=1.2518e-05, std dev=2.6499e-06} <1, 0, 0, 1, 0, 0, 0, 1, 0, 1>
|   |   |   |   |   |   Tpetra::DistObject::doTransferNew::copyAndPermute: 2.66067 - 97.4296% [6] {min=2.59913, max=2.7225, std dev=0.0648184} <2, 0, 0, 0, 0, 0, 0, 0, 0, 2>
|   |   |   |   |   |   |   Tpetra::MultiVector::copyAndPermute[Device]: 0.00153164 - 0.0575659% [2] {min=0.00145472, max=0.00162206, std dev=7.51679e-05} <1, 1, 0, 0, 0, 0, 1, 0, 0, 1>
|   |   |   |   |   |   |   Tpetra::CrsMatrix::copyAndPermute: 2.65912 - 99.9418% [4] {min=2.59755, max=2.72103, std dev=0.0648882} <2, 0, 0, 0, 0, 0, 0, 0, 0, 2>

|   |   |   |   |   |   |   |   Tpetra::CrsMatrix::copyAndPermuteStaticGraph: 2.65909 - 99.9991% [4] {min=2.59753, max=2.72101, std dev=0.0648877} <2, 0, 0, 0, 0, 0, 0, 0, 0, 2>

|   |   |   |   |   |   |   |   Remainder: 2.40172e-05 - 0.000903203%
|   |   |   |   |   |   |   Remainder: 1.7488e-05 - 0.000657279%

new:
|   |   panzer::ModelEvaluator::evalModel(J): 16.0974 - 4.13672% [1] {min=16.0974, max=16.0974, std dev=1.90735e-06} <1, 1, 1, 0, 0, 0, 0, 0, 0, 1>
|   |   |   panzer::AssemblyEngine::evaluate_scatter(panzer::Traits::Jacobian): 0.496743 - 3.08586% [1] {min=0.49626, max=0.497459, std dev=0.000576046} <2, 0, 0, 0, 0, 1, 0, 0, 0, 1>
|   |   |   |   panzer::AssemblyEngine::lof->ghostToGlobalContainer(panzer::Traits::Jacobian): 0.49666 - 99.9835% [1] {min=0.496179, max=0.497375, std dev=0.000574821} <2, 0, 0, 0, 0, 1, 0, 0, 0, 1>
|   |   |   |   |   Tpetra::MultiVector::putScalar: 0.00165392 - 0.333009% [2] {min=0.00163596, max=0.00166712, std dev=1.37209e-05} <1, 0, 0, 0, 1, 0, 0, 0, 1, 1>
|   |   |   |   |   Tpetra::DistObject::beginTransfer[Host]: 0.394186 - 79.3672% [6] {min=0.390494, max=0.397387, std dev=0.00290458} <1, 0, 0, 0, 1, 0, 1, 0, 0, 1>
|   |   |   |   |   |   Tpetra::DistObject::doTransferNew::checkSizes: 8.00475e-06 - 0.00203071% [6] {min=5.305e-06, max=1.0206e-05, std dev=2.52345e-06} <1, 0, 1, 0, 0, 0, 0, 0, 0, 2>
|   |   |   |   |   |   Tpetra::DistObject::doTransferNew::copyAndPermute: 0.375744 - 95.3215% [6] {min=0.367203, max=0.386364, std dev=0.00934129} <2, 0, 0, 0, 0, 0, 0, 1, 0, 1>
|   |   |   |   |   |   |   Tpetra::MultiVector::copyAndPermute[Device]: 0.00158284 - 0.421256% [2] {min=0.00154433, max=0.00162551, std dev=3.97939e-05} <1, 1, 0, 0, 0, 0, 0, 1, 0, 1>
|   |   |   |   |   |   |   Tpetra::CrsMatrix::copyAndPermute: 0.374143 - 99.5739% [4] {min=0.365578, max=0.384793, std dev=0.00937656} <2, 0, 0, 0, 0, 0, 0, 1, 0, 1>

|   |   |   |   |   |   |   |   Tpetra::CrsMatrix::copyAndPermuteStaticGraph: 0.374117 - 99.9931% [4] {min=0.365552, max=0.384765, std dev=0.00937545} <2, 0, 0, 0, 0, 0, 0, 1, 0, 1>

|   |   |   |   |   |   |   |   Remainder: 2.57275e-05 - 0.00687639%
|   |   |   |   |   |   |   Remainder: 1.82233e-05 - 0.00484992%

@jhux2 jhux2 changed the title PackageName: General Summary of the Enhancement Tpetra: Improve performance of CrsMatrix::copyAndPermute Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg: Tpetra type: enhancement Issue is an enhancement, not a bug
Projects
None yet
Development

No branches or pull requests

2 participants