How do I use local memory in OpenCL?
Check out the samples in the NVIDIA or AMD SDKs, they should point you in the right direction. Matrix transpose would use local memory for example. Using your squaring kernel, you could stage the data in an intermediate buffer. Remember to pass in the additional parameter. __kernel square( __global float *input, __global float *output, __local … Read more