-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathDeviceRTLInterface.tex
56 lines (34 loc) · 6.5 KB
/
DeviceRTLInterface.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
\section{Target RTL synchronous interface}\label{sc:TargetRTLInterface}
As it can be derived from the previous Sections, a target RTL must provide the following capabilities:
\begin{itemize}
\item \code{int32_t __tgt_rtl_number_of_devices()} – return the number of available devices of the type supported by the target RTL.
\item \code{int32_t __tgt_init_device(int32_t device_id)} – initialize the specified device. In case of success return 0; otherwise return an error code.
\item \code{tgt_target_table* __tgt_rtl_load_binary(int32_t device_id, __tgt_device_image *image)} – pass an executable image section described by image to the specified device and prepare an address table of target entities. In case of error, return \code{NULL}. Otherwise, return a pointer to the built address table. Individual entries in the table may also be \code{NULL}, when the corresponding offload region is not supported on the target device (see previous Section).
\item \code{int32_t __tgt_rtl_is_valid_binary(__tgt_device_image *image)} - return an integer different from zero if the provided device image can be supported by the runtime. The functionality is similar to comparing the result of \code{__tgt_rtl_load_binary} to null. However, this is meant to be a lightweight query to determine if the RTL is suitable for an image without having to load the library, which can be expensive.
\item \code{void* __tgt_rtl_data_alloc(int32_t device_id, int64_t size)} – allocate data on the particular target device, of the specified size. Return address of the allocated data on the target that will be used by \libomptarget{}to initialize the target data mapping structures. These addresses are used to generate a table of target variables to pass to \code{__tgt_rtl_run_region()}. The \code{__tgt_rtl_data_alloc()}returns \code{NULL} in case an error occurred on the target device.
\item \code{int32_t __tgt_rtl_data_submit(int32_t device_id, void *target_ptr, void *host_ptr, int64_t size)} – pass the data content to the target device using the target address. In case of success, return zero. Otherwise, return an error code.
\item \code{int32_t __tgt_rtl_data_retrieve(int32_t device_id, void *host_ptr, void *target_ptr, int64_t size)} – retrieve the data content from the target device using its address. In case of success, return zero. Otherwise, return an error code.
\item \code{int32_t __tgt_rtl_data_delete(int32_t device_id, void *target_ptr)} – de-allocate the data referenced by \code{target_ptr} on the device. In case of success, return zero. Otherwise, return an error code.
\item \code{int32_t __tgt_rtl_run_target_region(int32_t device_id, void *target_entry_ptr, void **target_vars_ptr, int32_t arg_num)} – transfer control to the offloaded entry on the target device; \code{target_vars_ptr} is a table to store the target addresses of all variables used in the target entry code. Entries in \code{target_vars_ptr} match the order of the variables passed in the \code{arg_host_ptr} argument passed to the target region. In case of success, return zero. Otherwise, return an error code.
\item \code{int32_t __tgt_rtl_run_target_team_region(int32_t device_id, void *target_entry_ptr, void **target_vars_ptr, int32_t arg_num, int32_t num_teams, int32_t thread_limit)} – transfer control to the offloaded entry on the target device; \code{target_vars_ptr} is a table to store the target addresses of all variables used in the target team entry code. Entries in \code{target_vars_ptr} match the order of the variables passed in the \code{arg_host_ptr} argument passed to the target team region. In case of success, return zero. Otherwise, return an error code.
\end{itemize}
For each platform there can be a dedicated set of error numbers defined \libomptarget{} assumes that zero is returned in case of success, independently of the target device. If a given system supports shared memory, the target RTL implementation of the \code{__tgt_rtl_data_*()} can optionally not take any action, as host and target device can operate on the same data.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Target RTL asynchronous interface} \label{sc:TargetRTLInterfaceAsync}
The asynchronous RTL interface augment the calls seen in Section~\ref{sc:LibomptargetAsync} and Section~\ref{sc:TargetRTLInterface}. First, it implements the RTL-specific calls related to offloading dependences.
\begin{itemize}
\item \code{void __tgt_rtl_init_dep(tgt_callback_fct completion_callback)}
\item \code{int32_t __tgt_rtl_enforce_dep(int32_t from_device_id, int32_t to_device_id, tgt_event_t completion_event)}
\item \code{void __tgt_rtl_release_dep(tgt_sync_event_t completion_event)}
\item \code{int32_t __tgt_rtl_process_dep()}
\end{itemize}
Second, the asynchronous RTL interface provides the following asynchronous data transfer and offloading calls.
\begin{itemize}
\item \code{int32_t __tgt_rtl_async_dep(int32_t device_id, int32_t num_dep_events, tgt_event_t *dep_events, tgt_event_t *completion_event)}
\item \code{int32_t __tgt_rtl_async_data_submit(int32_t device_id, void *target_ptr, void *host_ptr, int64_t size, int32_t num_dep_events, tgt_event_t *dep_events, tgt_event_t *completion_event)}
\item \code{int32_t __tgt_rtl_async_data_retrieve(int32_t device_id, void *host_ptr, void *target_ptr, int64_t size, int32_t num_dep_events, tgt_event_t *dep_events, tgt_event_t *completion_event)}
\item \code{int32_t __tgt_rtl_async_run_target_region(int32_t device_id, void *target_entry_ptr, void **target_vars_ptr, int32_t arg_num, int32_t num_dep_events, tgt_event_t *dep_events, tgt_event_t *completion_event)}
\item \code{int32_t __tgt_rtl_async_run_target_team_region(int32_t device_id, void *target_entry_ptr, void **target_vars_ptr, int32_t arg_num, int32_t num_teams, int32_t thread_limit, int32_t num_dep_events, tgt_event_t *dep_events, tgt_event_t *completion_event)}
\end{itemize}
In the above list, the first call requires no computation or communication, it just enables \libomptarget{} to summarize all of the synchronization events (listed in \code{dep_event}) into a single event, \code{completion_event}, which must occur after all listed events are completed. Note that the \code{dep_events} synchronization event may now also include dependences on the completion of data transfers initiated by \libomptarget{}. For devices that support the concept of streams, it is expected that the new operation will be queue up in the stream associated with the first synchronization event in \code{dep_event}.