The Ocean Tensor Package is an open-source package for matrix and tensor operations on CPU and GPU. The package aims to serve as a foundational layer for applications that require dense tensor operations on a variety of device types. All operations are available through a unified interface that is carefully designed to be powerful, extensible, and at the same time easy to use. The package has a modular implementation in C and provides a light-weight Python interface. Modularity of the package facilitates the addition of new operations as well as new device types.

Over the last decade or so, general-purpose GPUs have been successfully used in fields such as medical imaging [

The Ocean Tensor Package [

The user interface to the Ocean Tensor Package exposes several object types, illustrated in Figure

Connection between the main object types in Ocean.

Device objects enable the specification of the device to be used when instantiating a Tensor or Storage object. In addition, they provide generic information of the given device, such as the support for byte-swapped data or a list of all currently loaded modules. Depending on the device type, addition information may be available. For instance, on GPU devices it is possible to query numerous properties, including the multiprocessor count, or the currently available amount of free memory. Advanced functions include the instantiation of a new stream, and the specification of the number of intermediate tensor buffers for the device and their maximum size. Ocean maintains a list of available devices, including the

Storage objects encapsulate a contiguous piece of memory that is either allocated dynamically or provided by an external source. The data type associated with the storage has two main purposes: first as the default data type when instantiating a tensor from the storage without providing a type; and second, for formatting the storage elements for display. The data type of storage objects can be changed freely without affecting the tensors that use it. It is possible to superimpose tensors of different data types on the same storage. One typical example where this happens is when querying the imaginary part of a complex double-precision tensor, which results in an additional tensor view on the storage of type double. There are no restrictions on the number of different tensor types that can share the same storage. Tensor operations use the storage stream for synchronization to avoid race conditions and inconsistencies in the data. Storage data can be marked as read only, which prevents any updates to the data, either directly or through tensors operations (marking storage as read-only is reflected in all derived tensors).

Ocean tensors are easy to instantiate on any of the available devices. For example, creation of a 3 × 4 tensor with single-precision floating-point format on device

Tensor operations in Ocean are provided through modules. The Core module forms the basis of Ocean, and includes the definition of the basic object classes as well as the device instances. As the most elementary operation, the Core module supports tensor creation; from storage, from data in the form of nested lists, sequences, and other tensor types, or without initialization. Aside from tensor creation, the Core module provides an extensive set of elementary functions, including functions for shape and axis manipulation, indexing, copy functions, type and device casting, basic arithmetic operations, trigonometric operations (supported on all real and complex floating-point types), as well as tensor reductions along one or more axes. (A complete list of functions can be found on the Ocean Tensor Package repository [

The type of a tensor can be seen as the combination of the data type and the device associated with the tensor. Tensors in Ocean have an associated type, and type casting may therefore be necessary at times. Explicit type casting can be using the

Implicit type casting is used in Ocean to ensure that the input arguments to tensor operations have appropriate types and byte order. Consider for instance the tensor addition:

For implicit casting of the data types we follow Numpy and use the smallest available data type that can keep both data types. For instance, addition of signed and unsigned 8-bit integers would give a 16-bit signed integer. Since no standard data type is available for quadruple-precision floats, an exception is made for 64-bit integers and floating-point numbers, which result in double-precision floats. Automatic type casting in Ocean is switched on by default, but can be disabled by the user in case strict type checks are needed. When switched off, an exception is raised whenever a type mismatch is encountered.

Type casting based on the contents of the tensor is desirable, for example, when taking the square root of negative real numbers or the arccosine of scalars with magnitude larger than one. Should such operations result in a not-a-number (NaN) value, return a complex-valued result, or generate an error? The approach taken in Ocean is to add parameters to such operators that indicate the compute mode. In the standard mode no checks on the tensor elements are done and NaN values are generated whenever needed. Checks are made in the warning and error modes, giving respectively a warning or an error when elements with values outside of the operator domain are encountered. Finally, in the complex mode, checks are made to determine whether the resulting data type should be real or complex. If needed, explicit types casting can always be used.

Ocean supports a variety of indexing modes along one or more dimensions that can be combined to index a tensor. The basic single-dimension indexing modes are (1) scalars, to index a single element along an axis; (2) ranges, to index a regularly spaced set of elements; and (3) the colon ‘:’ operator to indicate the entire dimension. In addition to the basic modes it is possible to use one or two-dimensional index arrays to select particular elements, by specifying the indices along a single dimension, or tuples of indices along several dimensions. As is customary in Python, negative indices can be used to indicate the index relative to the end of the dimension. Finally, boolean tensors can be used as masks for indexing, with the non-zero elements indicating the elements to be selected. Dimensions that are omitted in the indexing are implicitly indexed with the colon operator, and the ellipsis object ‘…’ can appear once to indicate application of zero or more colon operator in that location to complete the index. When indexing a tensor using only basic indexing modes (either explicitly or implicitly), a view of that tensor is returned in the form of a new tensor that shares the original storage. In all other cases a new tensor is created by copying the indexed elements.

Special preprocessing is needed for index arrays and boolean masks: for index arrays the indices need to be checked for validity; whereas for boolean masks it is necessary to count the number of selected elements, in order to determine the size of the output tensor; and to convert the selected indices into relative offsets into the data buffer of tensor being indexed. When such indices are used repeatedly, computational efforts are wasted in applying the same preprocessing steps for each use. To avoid this situation, Ocean introduces index objects, which are constructed by indexing the

The Python interface of Ocean provides for plug-in modules to define external object types for tensors and scalars, and the conversion between these and the corresponding Ocean types. All extended object types provided by the plug-ins are compared against when parsing the tensor operation parameters. This allows them to be used in essentially the same way as Ocean tensors and scalars. As an example, we can declare the Numpy tensor and scalar types by importing

Automatic garbage collection in Python can delay the deletion of tensor objects and cause devices to run out of free memory despite careful management by the user. In order to force tensor deletion, it is possible to call the

We now compare some of the features in Ocean with those in other packages. Since most of the packages are in active development, we only discuss the features available at the time of writing.

Numpy [

A package that supports multiple device types and is written as a general library with separate language bindings is ArrayFire [

In Table

Comparison between different packages providing tensor functionality.

Numpy | CuPy | Caffe | PyTorch | TensorFlow | MXNet | ArrayFire | Ocean | |
---|---|---|---|---|---|---|---|---|

Multiple device types | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |

Automatic type casting | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ |

Unified tensor type | ✓ | ✓ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ |

Complex data types | ✓ | ✓ | ✗ | ✗ | ✓ | ✗ | ✓ | ✓ |

Flexible tensor strides | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✓ |

Tensor overlap detection | ✓ | ✗ | ✗ | ✓ | ✓ | ✗ | ✗ | ✓ |

The layout of tensors in memory is given by the strides, or distance between successive elements along each of the dimensions. Flexibility in the tensor strides enables features such as broadcasting along dimensions, easy manipulation of axes, and creation of views on regularly indexed subtensors. In addition, it ensures compatibility with a wide range of existing data types for tensors and matrices. Most of the deep-learning packages, as well as ArrayFire, adhere to a contiguous row-major data order, with implicit strides that can be inferred based on the tensor dimensions and the element size of the given data type. PyTorch also uses this data order by default, but allows users to override the standard layout by specifying tensor strides as nonnegative multiples of the element size. Numpy and CuPy support arbitrary strides. Each of these packages, along with Ocean, implements sorting of axes and merging of consecutive axes, when possible, to increase memory locality and reduce the overhead of iterating over dimensions, both of which help increase the computational efficiency of tensor operations on strided data. Packages that use a contiguous tensor layout can flatten tensors to a single dimension for many operations, such as unary and binary elementwise operations; other operations may require optimizations similar to those mentioned above. ArrayFire limits the number of tensor dimensions to four and often uses explicit nested for-loops with index computation at the innermost loop to traverse the data.

Some of the difficulties that come with the provision for arbitrary strides is that tensors may self overlap in memory. Hence, overlap detection between pairs of tensors becomes non-trivial. For consistent results in computations, such as

Aside from Ocean, none of the packages considered in this section defines a clear separation between tensor types and the low-level implementation of the tensor operations. As a result, none of the tensor operations other than those already provided by existing libraries, such as BLAS and cuBLAS, are easily transferable for use in other packages.

We now illustrate some of the features of the Ocean Tensor Package based on an example QR factorization (see for instance [

The functions shown in Figures ^{T}

In-place QR factorization functions with

Figure ^{T}Q – I_{F}_{F}

The Ocean Tensor Package is designed to serve as a foundational layer for applications that require dense tensor operations on one or more device types and instances. Given the wide range of potential applications and domains, it is important that the tensor operations are grouped in coherent modules, rather than be provided through a huge monolithic package. This way, functionality can be installed by the user when needed, which helps reduce the effective number of dependencies. Another advantage is that interfaces and compatibility with external libraries is localized to independent modules, thus making the package easier to manage. Another design principle used in Ocean, and discussed later in this section, is the use of well-defined layers.

Modules in Ocean consist of an interface along with independent implementations for each of the supported device types. The module interface takes care of device-independent parameter checks including validity of the tensors and compatibility of tensor dimensions. It then determines the data type and device to use, and queries a function look-up table for the module associated with the device type. When available, the function is called after performing all necessary type conversions, broadcasting of dimensions, and allocation of result and intermediate tensors (for instance when tensors overlap in memory). In case the function is not available, or the module implementation for the device type has not been loaded, an error is raised. Functions at the device level typically only need to check for support of tensors of the given data type and either implement the tensor operation or, more typically, call a lower-level library function that provides the desired operation. If needed, functions can access the module-specific context information associated with each device instance.

Module interfaces and device implementations can be loaded separately, except for the core module interface, which includes the CPU implementation. The separation between the interface and the implementation makes it possible to replace the module implementation with alternatives, such as a highly-tuned or specialized proprietary version. The use of function tables also makes it possible to replace individual functions for performance comparisons or debugging, or to insert functions that record runtime or accumulate call statistics. The separation between module interfaces and device implementation also make it easy to extend Ocean with new device types. In particular, modules and functions within each module can be added and tested one at a time, thus avoiding a huge development effort to get started.

The Core module forms the basis of the Ocean Tensor Package. It provides all elementary tensor operations and instantiates and exposes available device instances. Many of the standard functions, such as printing and tensor copies between different device types require tensor support on the CPU, and the Core module interface is therefore combined with the CPU implementation (

In the implementation of the Ocean Tensor Package care is taken to maintain a clean separation between the different abstraction levels, as illustrated in Figure

Layered design of the Ocean Tensor Package. The current version implements the core modules for CPU and GPU, based on BLAS, cuBLAS, and the Solid foundation library, along with the Python language binding.

The package comes with an extensive collection of unit tests in the

The system has been tested on Red Had Enterprise Linux version 7.4 running on Power8 and Intel Xeon, as well as on MacOS High Sierra, version 10.13.4, running on a MacBook Pro with Intel Core i7.

The Ocean Tensor Package is written in C based on the C99 standard, with GPU functionality implemented using CUDA. The Python-C API is also written in C, along with several Python scripts. The package has been tested with Python 2.7.5 and 3.5.2.

When available, the Ocean Tensor Package supports CUDA-enabled GPUs. The total total disk space required after compilation is approximately 300Mb.

The CPU part of the code has optional dependencies on BLAS or CBLAS. The user is encouraged to provide these, otherwise compilation is done using a non-optimized implementation provided with the package. The implementation of multi-threaded tensor operations is done using OpenMP, when available, otherwise all operations are single-threaded. Compilation on Linux was done using GCC 4.8.5; on MacOS compilation was tested using Clang versions 7.0.0 through 9.1.0.

The GPU part of the code uses CUDA as well as the cuBLAS library. The package has been tested with CUDA versions 7.5, 8.0 (GA1, GA2), 9.0, 9.1, and 9.2. Compilation on MacOS using NVCC requires the appropriate combination of Clang and CUDA versions. A table of compatible versions is provided in the

The Python interface of the package provides an optional interface to Numpy, when available on the system.

Ewout van den Berg designed and implemented the package, and currently maintains the GitHub repository.

English

The Ocean Tensor Package provides support for tensor operations on CPU and GPU. Given the common usage of these operations in various fields, there is a large reuse potential of the package. The package can be used directly as the user level, or can serve as the foundation for other packages. Tensor functionality is organized in modules to enable addition of separate modules with operations specific to other fields. Additional modules can be provided as third-party extensions, or incorporated in the package itself, depending on the level of specialization of the functions. The package was designed to allow support for devices other than CPU and GPU, although no such extensions are currently planned. Please contact the author if you are interested in extending the package, or if you have questions or feedback regarding the installation and usage of the package.

Available at

Ocean supports booleans; 8, 16, 32, and 64-bit signed and unsigned integers; as well as real and complex half, single and double-precision floating-point data types.

The author has no competing interests to declare.