Memory management
-----------------

This section explains how pocl supports multiple address spaces and
host-side memory management of device memory.

Multiple logical address spaces
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

By default, Clang (at least version 3.3 and older) converts the OpenCL C address space 
qualifiers to *target specific* address space identifiers. That is, e.g., for the common CPU 
targets with single uniform address space, all of the OpenCL address spaces are mapped to the
address space identifier 0 (the default C address space). For multiple address space
LLVM backends such as AMD GPUs there are different ids produced for the OpenCL C address spaces,
but they differ from those of the TCE backend, etc. Thus, after the Clang processing of
the kernel source, the information of the original OpenCL C address spaces is lost or is 
target specific, preventing or complicating the special treatment of the pointers pointing 
to (logically) different address spaces (e.g. OpenCL disjoint address space alias analysis,
see :ref:`opencl-optimizations`).

pocl's kernel compiler needs to know the original logical address spaces in the kernel during
some of its processing steps. In order to unify these parts of the kernel compiler, pocl 
uses the "fake address space map" mechanism of Clang to force pocl-known *separate* ids to be 
produced for each of the OpenCL C logical address spaces in the frontend. 

Before the code generation, the forced OpenCL C logical address space ids should be mapped to 
the backend understood ones. This can be done in the kernel compiler pass ``TargetAddressSpaces``. 
It goes through all the memory references in the bitcode and maps their address space ids to the 
target specific ones. In case it is known that the targeted backend either understands the logical
address space ids (or simply maps everything to 0 there aswell), this processing is
skipped (and left for the backend). 

Managing the device memories
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When a buffer is allocated on the device, the device layer implementation is responsible for
making sure the device has enough free space on the memory the given address space is mapped to
and for returning a handle for later referring to that memory. 

When all the memories are mapped to a single address space shared with the host memory (the case 
with CPU host+device setups), one could simply use ``malloc()`` for this. However, for the 
heterogeneous device setup where the device has separate memories, one cannot
use the host's malloc function for managing the memory spaces. For this, pocl implements a simple
memory allocator called ``bufalloc``. With bufalloc it is possible to manage chunks of memory 
allocated from a region of addresses. The allocator is optimized for speed and to minimize
fragmentation assuming largish chunks of memory (the input/output buffers) are allocated and 
freed at once.

Bufalloc can be used for host-side management of continuous ranges of memories on the
device side. Bufalloc is used for managing the memory also in the ``pthread/basic`` 
CPU device implementations for testing and optimization purposes. For an example of 
its use for managing memory in the heterogeneous separated memory setup, one should take 
a look at the TCE device layer code (``lib/CL/devices/tce/tce_common.cc``). For TCE devices 
it is assumed there are actual separated physical address spaces for both the *local* and *global* 
address spaces. The device layer implementation manages allocations from both of these spaces 
using two instances of bufalloc memory regions.

When passing buffer pointers to the kernel/work-group launchers, the memory addresses are
passed as integer values. The values passed from the host are casted to the actual
address-space qualified LLVM IR pointers for calling the kernels with correct types
by the work-group function (see :ref:`wg-functions`).

Custom memory management for pthread device
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Enabled by CMake option USE_POCL_MEMMANAGER. This is only useful for certain
uncommon setups, where pocl is expected to allocate a huge number of queue or
event objects. For most available OpenCL programs / tests / benchmarks, there
is no measurable difference in speed.

Advantages:
* allocation of queues/events/command objects can be a lot faster

Disadvantages:
* memory allocated for those objects is never free()d; it's only returned to allocation pool
* debugging tools will not detect use-after-free bugs on said objects
