This code contains a global memory, shared-memory tiled, and joint shared-memory and register-tiled matrix matrix multiplications.
Examples for using Nsight Compute to compare kernel performance.
1-1-pinned-basic
: (1_1_pinned_basic.cu
)1-2-pinned-tiled
: (1_1_pinned_tiled.cu
)1-3-pinned-joint
: (1_1_pinned_joint.cu
)
Examples for using Nsight Systems to compare data transfer, and relationship between data transfer and end-to-end time.
2-1-pageable-basic
: (2_1_pageable_basic.cu
)2-2-pinned-basic
: (2_2_pinned_basic.cu
)2-3-pinned-tiled
: (2_3_pinned_tiled.cu
)2-4-pinned-tiled-overlap
: (2_4_pinned_tiled_overlap.cu
)2-5-pinned-joint
: (2_5_pinned_joint.cu
)2-6-pinned-joint-overlap
: (2_6_pinned_joint_overlap.cu
)
All programs share the same basic options:
- Three optional positional arguments to set M, N, and K.
--iters <int>
the number of measured iterations (default5
)--warmup <int>
the number of warmup iterations (default5
)--check
: check correctness (defaultfalse
). Only use for small multiplications