Default example is broken and non-representative

May 4, 2015 at 4:17 PM
The 'hello world' example provided here is wrong on many levels.

First, at 1024x1024 data frame, you get inconsistent results due to overflow. You probably don't want to show users that results diverge between GPU and CPU.

Second, code is broken since there is now a reduce instead of accumulate and also there's the MINMAX business that should probably be mentioned (who will not encounter this problem?)

Finally, and this is the critical issue, the CPU is much faster than the GPU for this example. On my machine I am getting 44msec on GPU vs 2msec on CPU. Even when I deliberately try to add the workload (e.g., calculating x = sin(x)) on both CPU and GPU, I still get 38msec on GPU vs 15msec on CPU.

Here's the source code I'm using:
void consume_library()
  const int count = 1024*1024;
  array<float> data(count);
  array_view<float, 1> view(data);
  amp_stl_algorithms::iota(begin(view), end(view), 1.0f);

  Timer timer(true);
  amp_stl_algorithms::iota(begin(view), end(view), 1.0f);
  auto last = amp_stl_algorithms::remove_if(begin(view), end(view),
    [=](const float& v) restrict(amp) { return int(v) % 2 == 1; });
  // interesting note: sinf will not work below (restrict!), need <amp_math>
  amp_stl_algorithms::transform(begin(view), end(view), begin(view), [=](float f) restrict(amp) { return precise_math::sin(f); } );
  float total = amp_stl_algorithms::reduce(begin(view), last, 0.0f);
  auto elapsed = timer.Elapsed().count();
  cout << setprecision(0) << fixed << total << " on GPU in " << elapsed << "msec."<< endl;

  vector<float> v(count);
  std::iota(begin(v), end(v), 1.0f);
  auto l = std::remove_if(begin(v), end(v), [=](const float& z) { return int(z) % 2 == 1; });
  std::transform(begin(v), end(v), begin(v), sinf);
  total = accumulate(begin(v), l, 0.0f, [=](float f1, float f2) { return f1 + f2; });
  elapsed = timer.Elapsed().count();
  cout << setprecision(0) << fixed << total << " on CPU in " << elapsed << "msec." << endl;
and here is the Timer class:
#pragma once

#include <chrono>

class Timer
  using clock = std::chrono::high_resolution_clock;
  using msec = std::chrono::milliseconds;
  clock::time_point start_;
  void Reset()
    start_ = clock::now();

  msec Elapsed() const
    return std::chrono::duration_cast<msec>(clock::now() - start_);

  explicit Timer(bool run = false)
    if (run) Reset();

  template<typename T, typename Traits>
  friend basic_ostream<T, Traits>& operator<<(basic_ostream<T, Traits>& out, const Timer& timer)
    return out << timer.Elapsed().count();