|
| 1 | +Constructing Tensors |
| 2 | +==================== |
| 3 | + |
| 4 | +Basic construction of tensors in MatX is intended to be very simple with minimal parameters. This allows users of other languages |
| 5 | +to pick up the syntax quickly without understanding the underlying architecture. While using the simple API provides good performance, |
| 6 | +it lacks flexibility and can prevent your code from running at the highest performance possible. This document walks through the |
| 7 | +different ways to construct tensors, and when you should use certain methods over others. |
| 8 | + |
| 9 | +A Quick Primer On MatX Types |
| 10 | +---------------------------- |
| 11 | +The basic type of tensor used in most examples and tests is the ``tensor_t`` object. ``tensor_t`` is the highest-level tensor class, and |
| 12 | +provides all of the abstractions for viewing and modifying data, holding storage, and any other metadata needed by a tensor. Because of |
| 13 | +their relatively large size, ``tensor_t`` objects are not meant to be passed to GPU devices. In fact, doing so will lead to a compiler Error |
| 14 | +since ``tensor_t`` uses types that are not available on the device at this time. |
| 15 | + |
| 16 | +Within a ``tensor_t`` there is an abstract object called ``Storage`` (more on that later), and another inherited class called ``tensor_impl_t``. |
| 17 | +``tensor_impl_t`` is a lightweight class containing only the minimum amount of member variables needed to access the data from a GPU kernel. Currently the |
| 18 | +member variables are a tensor descriptor and a data pointer. Tensor descriptors will be covered later in this document. |
| 19 | + |
| 20 | +``tensor_impl_t`` also includes member functions for accessing and modifying the tensor. Examples are all ``operator()`` functions |
| 21 | +(both const and non-const), helper functions for the shape (``Size()`` and ``Stride()``), and utilities for printing on the host. ``tensor_impl_t`` |
| 22 | +is the type that is passed into GPU kernels, and only contains types that are compatible with CUDA. Furthermore, the total size of the ``tensor_impl_t`` |
| 23 | +object is as small as possible since these objects can be replicated many times within a single complex expression. Reducing the size of |
| 24 | +``tensor_impl_t`` allows for fastest memory accesses, smaller copies before a kernel launch, and makes extending the code easier. |
| 25 | + |
| 26 | +To convert between a ``tensor_t`` and ``tensor_impl_t`` a type trait called ``base_type`` is available and is used like the follow: |
| 27 | + |
| 28 | +.. code-block:: cpp |
| 29 | +
|
| 30 | + typename base_type<I1>::type in1_ = in; |
| 31 | +
|
| 32 | +where ``in`` is the ``tensor_t`` object and ``in1_`` will be a ``tensor_impl_t``. |
| 33 | + |
| 34 | +MatX Storage |
| 35 | +------------ |
| 36 | +Within the ``tensor_t`` class is an abstract template parameter called ``Storage``. ``Storage`` objects are always created from a ``basic_storage`` |
| 37 | +class, which provides all accessor functions common to the underlying storage. ``basic_storage`` can wrap raw pointers using the ``raw_pointer_buffer`` |
| 38 | +class, smart pointers using the ``smart_pointer_buffer`` class, or any RAII object that provides the required interface. If no user-defined storage |
| 39 | +is passed in, MatX will default to allocating a raw CUDA managed memory pointer, and back it using a ``shared_ptr`` for garbage collection. |
| 40 | + |
| 41 | +When not using implicitly-allocated memory, the user is free to define the storage container type, allocator, and ownership semantics. The container |
| 42 | +type requires const and non-const iterators, an allocate function (when applicable), a ``data()`` function to get the raw pointer, and a way to get |
| 43 | +the size. Currently both ``std::array`` and ``std::vector`` from the STL follow these semantics, as do both the raw and smart pointer MatX containers. |
| 44 | + |
| 45 | +The allocator type is used when the user passes in a shape without a pointer to existing data. By default, the allocator will use ``matx_allocator``, |
| 46 | +which is a PMR-compatible allocator with stream semantics. The allocator is used for both allocation and deallocation when no user-provided pointer |
| 47 | +is passed in and ownership semantics are requested. If a pointer is provided, only the deallocator is used when ownership semantics have been requested. |
| 48 | + |
| 49 | +In general, creating a tensor allows you to choose ownership semantics with creation. By using the ``owning`` type, MatX will take ownership of the pointer |
| 50 | +and deallocate memory when the last tensor using the memory goes out of scope. By using the ``non_owning`` type, MatX will use the pointer, but not |
| 51 | +perform any reference counting or deallocations when out of scope. |
| 52 | + |
| 53 | +Tensor Descriptors |
| 54 | +------------------ |
| 55 | +Tensor descriptors are a template type inside ``tensor_impl_t`` that provide information about the size and strides of the tensor. While descriptors |
| 56 | +are a simply concept, the implementation can have a large impact on performance if not tuned properly. Both the sizes and strides of the tensor are |
| 57 | +a template class supporting iterators to access the metadata directly, and utility functions for accessing and computing other values from the metadata. |
| 58 | +Descriptors are commonly stored as ``std::array`` types given its compile-time features, but any class meeting the accessor properties can be used. |
| 59 | + |
| 60 | +Dynamic Descriptors |
| 61 | +################### |
| 62 | +Dynamic descriptors use storage in memory to describe the shapes and strides of a tensor. They can have lower performance than static descriptors |
| 63 | +since more memory accesses and offset calculations are needed when accessing tensors, but have higher flexibility given the data is only needed at runtime. |
| 64 | + |
| 65 | +Dynamic descriptors should be used when either the sizes are not known at compile-time, or when interoperating with existing code. As mentioned in the |
| 66 | +introduction, the descriptor size is very important for both kernel performance and launch time. For this reason, the data types used to store both the |
| 67 | +shape and size can vary depending on the size of the tensor parameters. While shape and stride storage types must match in length, the underlying types |
| 68 | +used to store them can be different. This is useful in scenarios where the shape can be expressed as a smaller type than the strides. |
| 69 | + |
| 70 | +Static Descriptors |
| 71 | +################## |
| 72 | +If the shapes and strides are known at compile time, static descriptors should be used. Static descriptors compute and store the shape and strides in |
| 73 | +constexpr variables, and provide constexpr functions to access both values. When used in a GPU kernel, calling either ``Size()`` or ``Stride()`` emits |
| 74 | +an immediate rvalue that the compiler can use for address calculations. This removes all loads and complex pointer arithmetic that could affect the |
| 75 | +runtime of a kernel |
| 76 | + |
| 77 | + |
| 78 | +Creating Tensors |
| 79 | +---------------- |
| 80 | +With the tensor terminology out of the way, it's time to discuss how to create tensors. If there's one thing to take from this article, it's that you |
| 81 | +should use ``make_tensor`` or ``make_static_tensor`` wherever possible. |
| 82 | + |
| 83 | +.. note:: |
| 84 | + Prefer ``make_tensor`` or ``make_static_tensor`` over constructing tensors directly |
| 85 | + |
| 86 | +Using these helper functions has many benefits: |
| 87 | + |
| 88 | +- They remove the need to specify the rank of a tensor in the template parameters |
| 89 | +- They abstract away many of the complex template types of creating a tensor directly |
| 90 | +- They hide potentially irrelevant types from the user |
| 91 | + |
| 92 | +All ``make_``-style functions return a ``tensor_t`` object with the template parameters deduced or created as part of the input arguments. ``tensor_t`` |
| 93 | +only has two required template parameters (type and rank). For simple cases where only implicitly-allocated memory is needed, the default constructor |
| 94 | +will suffice. Some situations prevent using the ``make_`` functions, such as when a tensor variable is a class member variable. In this case the type of |
| 95 | +the member variable must be specified in the member list. In these scenaries it's expected that the user knows what they are doing and can handle |
| 96 | +spelling out the types themselves. For examples of this, see the simple_pipeline files. |
| 97 | + |
| 98 | +All make functions take the data type as the first template parameter. |
| 99 | + |
| 100 | +Make variants |
| 101 | +############# |
| 102 | +There are currently 4 different variants of the ``make_`` helper functions: |
| 103 | +- ``make_`` for creating a tensor with a dynamic descriptor and returning by value |
| 104 | +- ``make_static_`` for creating a tensor with a static descriptor and returning by value |
| 105 | +- ``make_X_p`` for creating a tensor with a dynamic descriptor and returning a pointer |
| 106 | +- ``make_static_X_p`` for creating a tensor with a static descriptor and returning a pointer |
| 107 | + |
| 108 | +The ``_p`` variants return pointers allocated with `new` and are expected to be deleted by the caller when finished. Returning smart pointers would |
| 109 | +have made this easier, but some users have their own smart pointer wrapper and wouldn't want to unpack the standard library versions. |
| 110 | + |
| 111 | +Within each of these types, there are usually versions both with and without user-defined pointers. These forms are used when an existing device pointer |
| 112 | +is passed to MatX rather than having the allocation done when the tensor is created. |
| 113 | + |
| 114 | +Each of these 4 variants can be used with all of the construction types when applicable. |
| 115 | + |
| 116 | +Creating From C Array Or a Brace-Enclosed list |
| 117 | +############################################## |
| 118 | +Tensors can be created using a C-style shape array from an lvalue, or a brace-enclosed list as an rvalue. The following call the same ``make_`` call: |
| 119 | + |
| 120 | +.. code-block:: cpp |
| 121 | +
|
| 122 | + int array[3] = {10, 20, 30}; |
| 123 | + auto t = make_tensor<float>(array); |
| 124 | +
|
| 125 | +and |
| 126 | + |
| 127 | +.. code-block:: cpp |
| 128 | +
|
| 129 | + auto t = make_tensor<float>({10, 20, 30}); |
| 130 | +
|
| 131 | +In the former case the array is an lvalue that can be modified in memory before calling, whereas the latter case uses rvalues. When the sizes are known |
| 132 | +at compile time the static version of ``make_`` should be used: |
| 133 | + |
| 134 | +.. code-block:: cpp |
| 135 | +
|
| 136 | + auto t = make_static_tensor<float, 10, 20, 30>(); |
| 137 | +
|
| 138 | +Notice the sizes are now template parameters instead of function parameters. Both ways can be used interchangeable in MatX code, but the static version |
| 139 | +can lead to higher performance. |
| 140 | + |
| 141 | +Similarly, all variants can be called with a user-defined pointer: |
| 142 | + |
| 143 | +.. code-block:: cpp |
| 144 | +
|
| 145 | + auto t = make_tensor<float>(ptr, {10, 20, 30}); // ptr is a valid device pointer |
| 146 | +
|
| 147 | +All cases shown above use the default stride parameters. If the strides are not linear in memory, they can be passed in as well: |
| 148 | + |
| 149 | +.. code-block:: cpp |
| 150 | +
|
| 151 | + int shape[3] = {10, 20, 30}; |
| 152 | + int strides[3] = {1200, 60, 2}; |
| 153 | + auto t = make_tensor<float>(shape, strides); |
| 154 | +
|
| 155 | +Creating From A Conforming Shape |
| 156 | +################################ |
| 157 | +As mentioned in the descriptor section, any type that conforms to the shape semantics can be used inside of a descriptor, and can also be passed into the |
| 158 | +``make_`` functions: |
| 159 | + |
| 160 | +.. code-block:: cpp |
| 161 | +
|
| 162 | + std::array<int, 3> = {10, 20, 30}; |
| 163 | + auto t = make_tensor<float>(array); |
| 164 | +
|
| 165 | +Creating From A Descriptor |
| 166 | +########################## |
| 167 | +Descriptors (both shapes and sizes) can be used to construct tensors. This is useful when taking an existing tensor descriptor and creating a new tensor from it: |
| 168 | + |
| 169 | +.. code-block:: cpp |
| 170 | +
|
| 171 | + auto d = existingTensor.Descriptor(); |
| 172 | + auto t = make_tensor<float>(d); |
| 173 | +
|
| 174 | +``t`` is now a tensor with the same shapes and strides of ``existingTensor``. |
| 175 | + |
| 176 | +0-D Tensors |
| 177 | +########### |
| 178 | +0-D tensors are different than higher ranks since they have no meaningful shape or strides, and therefor don't need those parameters. Empty versions of the |
| 179 | +``make_`` helpers existing to create these: |
| 180 | + |
| 181 | +.. code-block:: cpp |
| 182 | +
|
| 183 | + auto t0 = make_tensor<float>(); |
| 184 | + auto t01 = make_tensor<float>(ptr); |
| 185 | +
|
| 186 | +Custom Storage, Descriptors, and Allocators |
| 187 | +########################################### |
| 188 | +Within most of the ``make_`` functions, there are choices in the template parameters for custom storage, descriptor, and allocator types. |
| 189 | + |
| 190 | +Storage |
| 191 | +------- |
| 192 | +Storage types can be created by wrapping a container object in the ``basic_storage`` class. MatX has a container type built-in for both raw pointers and smart |
| 193 | +pointers, but this can be extended to use any conforming container type. The ``basic_storage`` class does not know about any underlying data structures or ownership; |
| 194 | +this is encapsulated inside of the template type ``C``. For example, to create a custom storage object to wrap a raw pointer: |
| 195 | + |
| 196 | +.. code-block:: cpp |
| 197 | +
|
| 198 | + raw_pointer_buffer<T, owning, matx_allocator<T>> rp{ptr, static_cast<size_t>(desc.TotalSize()*sizeof(T))}; |
| 199 | + basic_storage<decltype(rp)> s{std::move(rp)}; |
| 200 | +
|
| 201 | +The code above creates a new ``raw_pointer_buffer`` object with ownership semantics and the ``matx_allocator`` allocator. A constructor taking a pointer and a |
| 202 | +size will not allocate any new data, but track the pointer internally using a smart pointer. If instead ``non_owning`` had been passed as a template parameter, the |
| 203 | +pointer would not be tracked or freed. With the container created, the next line passes the container into a ``basic_storage`` object for use inside ``tensor_t``. |
| 204 | + |
| 205 | +Descriptors |
| 206 | +----------- |
| 207 | +Creating a descriptor can be done by using any conforming descriptor type (See descriptor explanation above). Within MatX, ``std::array`` is used by default |
| 208 | +when creating dynamic descriptors. Because of the variable size of the stride and shape, MatX provides helper types for creating descriptors of common types: |
| 209 | + |
| 210 | +- ``tensor_desc_cr_disi_dist<RANK>`` for a dynamic descriptor with ``index_t`` strides and shapes. This is the default descriptor and can also be creating using the type |
| 211 | + ``DefaultDescriptor``. ``index_t`` is defined at compile-time, and defaults to 64-bit |
| 212 | +- ``tensor_desc_cr_ds_t<ShapeType, StrideType, RANK>`` a ``std::array``-based descriptor with user-provided types |
| 213 | +- ``tensor_desc_cr_ds_32_32_t<RANK>`` is a descriptor with 32-bit sizes and strides |
| 214 | +- ``tensor_desc_cr_ds_64_64_t<RANK>`` is a descriptor with 64-bit sizes and strides |
| 215 | +- ``tensor_desc_cr_ds_32_64_t<RANK>`` is a descriptor with 32-bit sizes and 64-bit strides |
| 216 | +- ``static_tensor_desc_t<size_t I, Size_t Is...>`` is a static-sized descriptor with the shape and stride created at compile time |
| 217 | + |
| 218 | +To create a descriptor: |
| 219 | + |
| 220 | +.. code-block:: cpp |
| 221 | +
|
| 222 | + const index_t arr[3] = {10, 20, 30}; |
| 223 | + DefaultDescriptor<RANK> desc{arr}; |
| 224 | +
|
| 225 | +In this case we create a default descriptor (based on ``index_t`` sizes) using a C-style array. |
| 226 | + |
| 227 | + |
0 commit comments