To improve performance of copying data between host and GPU pinned, host allocated memory can be used. Copying a managed array incurs a penalty which means that copying from pinned memory can be around 65% faster. On the machine I have here now I get 3500MB/sec upload from managed and 5800MB/sec upload from pinned.
You'd think it would be crazy not to use it, however there are some restrictions and disadvantages (no free lunch).
- Pinned memory is in short supply compared to non-pinned.
- No nice managed array indexing.
This first is something we can't do much about other than advise caution and remember to free when done. A good practice is to allocate some of this memory and use it as a staging post for transfers. That means copying into the staging post first ... which of course also takes time.
The pinned memory is addressed via an IntPtr. Cudafy provides some extensions to IntPtr that allow setting of individual values. This is not so useful or efficient for whole arrays. To do this we can use the GPGPU static method CopyOnHost. This does a fairly efficient copy of a managed array to/from pinned memory.
GPGPU.CopyOnHost(srcData, srcOffset, ptr, dstOffset, cnt);
In Cudafy 1.2 (June 2011) additional extensions of IntPtr give Write and Read methods that encapsulate CopyOnHost.
ptr.Write(managed_array);
We're left with one issue though - the performance. The way to get around this is to take advantage of one of the other features of pinned memory: asynchronous transfers.
for(...) {
...
ptr_a.Write(managed_array_a);
_gpu.CopyToDeviceAsync(ptr_a, 0, dev_a, 0, size, 1);
ptr_b.Write(managed_array_b);
_gpu.CopyToDeviceAsync(ptr_b, 0, dev_a, 0, size, 1);
...
}
Lines 1 and 4 of the loop happen together, as do lines 2 and 3. The last argument of the copy commands is the stream id. If they are the same then these commands are in the same queue. Therefore the second copy will never run until the first is done. We have two managed arrays and two staging posts (pinned) and one array on the device (dev_a).
With all this in place we get back to the same performance as we get with pinned.
CUDAfy V1.5 Update
In V1.5 of CUDAfy this procedure has been simplified by the introduction of smart copy. Basically make some staging posts using HostAllocate. These are same size as count elements. Enable smart copy via EnableSmartCopy and use the overloads of the async transfers.
[Test]
public void Test_smartCopyToDevice()
{
var mod = CudafyModule.TryDeserialize();
if (mod == null || !mod.TryVerifyChecksums())
{
mod = CudafyTranslator.Cudafy();
mod.Serialize();
}
_gpu.LoadModule(mod);
_gpuuintBufferIn = _gpu.Allocate<uint>(N);
_gpuuintBufferOut = _gpu.Allocate<uint>(N);
int batchSize = 8;
int loops = 6;
Stopwatch sw = Stopwatch.StartNew();
for (int x = 0; x < loops; x++)
{
for (int i = 0; i < batchSize; i++)
{
_gpu.CopyToDevice(_uintBufferIn, 0, _gpuuintBufferIn, 0, N);
_gpu.Launch(N / 512, 512, "DoubleAllValues", _gpuuintBufferIn, _gpuuintBufferOut);
_gpu.CopyFromDevice(_gpuuintBufferOut, 0, _uintBufferOut, 0, N);
}
}
long time = sw.ElapsedMilliseconds;
Console.WriteLine(time);
// Now with smart copy
// Make some "staging posts". Do not go overboard with these since pinned memory is scarce.
IntPtr[] stagingPostIn = new IntPtr[batchSize];
IntPtr[] stagingPostOut = new IntPtr[batchSize];
for (int i = 0; i < batchSize; i++)
{
stagingPostIn[i] = _gpu.HostAllocate<uint>(N);
stagingPostOut[i] = _gpu.HostAllocate<uint>(N);
}
_gpu.EnableSmartCopy();
sw.Restart();
for (int x = 0; x < loops; x++)
{
for (int i = 0; i < batchSize; i++)
_gpu.CopyToDeviceAsync(_uintBufferIn, 0, _gpuuintBufferIn, 0, N, i + 1, stagingPostIn[i]);
for (int i = 0; i < batchSize; i++)
_gpu.LaunchAsync(N / 256, 256, i + 1, "DoubleAllValues", _gpuuintBufferIn, _gpuuintBufferOut);
for (int i = 0; i < batchSize; i++)
_gpu.CopyFromDeviceAsync(_gpuuintBufferOut, 0, _uintBufferOut, 0, N, i + 1, stagingPostOut[i]);
for (int i = 0; i < batchSize; i++)
_gpu.SynchronizeStream(i + 1);
}
time = sw.ElapsedMilliseconds;
Console.WriteLine(time);
_gpu.DisableSmartCopy();
for (int i = 0; i < N; i++)
_uintBufferIn[i] *= 2;
Assert.IsTrue(Compare(_uintBufferIn, _uintBufferOut));
ClearOutputsAndGPU();
}