AMD just took, and then retracted, a major step forward in the whole ‘fusion’ concept, enabling profiling across heterogeneous cores. Although it may sound minor, it is a huge step forward in the usability of the whole paradigm.
Yesterday, AMD (AMD) put up a ‘blog’ post about CodeAnalyst 3.2, the AMD profiling tool that is currently on v3.1. In a very short period of time, the page disappeared, but not before a sharp eyed reader captured it. Before the conspiracy theorists go nuts, someone probably just hit the wrong button to post instead of schedule the release, so it will probably be up in full soon.
The post detailed some of the features of CodeAnalyst 3.2, including Bulldozer (12h family) support, CPU/memory utilization timelines, Visual Studio integration, and the aforementioned heterogeneous profiling. That is by far the biggest addition.
Although it may not sound like much, this part of the release is the key. “If you captured OpenCL information, that will also be shown on the timeline. The timeline has an easy navigation for zooming into the most minute call, while retaining a relative sense of the entire profile. Each GPU device with OpenCL activity will be displayed. A chart for each thread with OpenCL API calls will display the function durations, with double-click, two-way navigation to a detailed data table of the function traces. Kernel and data transfer events are logged and shown in the respective command queues, with the ability to see the latency involved with enqueued events waiting in parallel.”
OpenCL has the ability to pick a target for your code to run on, CPU or GPU, and have it ‘just work’, at least in theory. Given the disparity between CPU and GPU tools, sending things to the GPU usually meant looking at your code with the tools equivalent of welding goggles and a divining rod. Trying to find bottlenecks in your code across multiple types of execution units simultaneously made coding for the PS3 seem like light hearted fun, even counting the inevitable Sony lawsuit.
Since AMD is heavily promoting GPU compute, even holding a conference on the subject, they have a vested interests in people using OpenCL and similar technologies. If coding for a device is pain, and optimization/debugging is an advanced study in masochism, coders will just say no. AMD finally seems to understand that concept, and is actually making the tools people want and need now, with releases coming thick and fast. This is what they needed to do a few years ago, but it is still a welcome change.
From here, the next big step, possibly the final major hurdle, is to make a system that transparently parses threads to the appropriate device. To do that, you need to know what ‘appropriate device’ means, and the major metric there is performance. For that, you need a tool that can see both CPU and GPU performance counters, data transfer events, and queues/latencies. Now do you see the direction that CodeAnalyst 3.2 is moving us in?S|A