Don McCrady, Principal Development Lead
Parallel Computing Platform
June 7-10
Some cool new C++
Parallel Iteration
Tricks for reducing shared state
Asynchronous Agents
Concurrent containers
Demo: N-Bodies
Part of the C++ Runtime
No new libraries to link in
PPL: Parallel Pattern Library
Agents: Asynchronous Agents Library
Abstracts away the notion of threads
Tasks are computations that may be run in parallel
Use PPL & Agents to express your potential concurrency
Let the runtime map it to the available concurrency
Scale from 1 to 256 cores
int
x = 5;
int
y = 7;
auto
functor =
[x,&y] (
int
z) {
y += x - z;
};
functor(3);
cout << y;
class
_FT {
public
:
_FT(
int
x,
int
& y)
: _x(x), _y(y)
{ }
void operator
()(
int
z) {
_y += _x - z;
}
private
:
int
_x;
int
& _y;
};
int
x = 5;
int
y = 7;
_FT functor(x, y);
functor(3);
cout << y;
#include
<vector>
#include
<algorithm>
using namespace
std;
vector<
int
> v = …;
foreach(v.begin(), v.end(), [&v] (
int
item) {
cout << item << endl;
});
Lambdas make functional programming palatable in C++
#include
<ppl.h>
using namespace
Concurrency;
parallel_for(0, 1000, [] (
int
i) {
work(i);
});
parallel_for iterates over a range in parallel
•
Order of iteration is
indeterminate.
•
Cores may come and
go.
•
Ranges may be
stolen by newly idle
cores.
parallel_for(0, 1000, [] (
int
i) {
work(i);
});
Core 4
Core 3
Core 1
work(0…249)
work(500…749)
work(750…999)
Core 2
work(250…499)
parallel_for considerations:
Designed for unbalanced loop bodies
An idle core can steal a portion of another core’s range of work
Supports cancellation
Early exit in search scenarios
For fixed-
sized loop bodies that don’t need cancellation,
consider parallel_for_fixed from the sample pack.
parallel_for(0, yBound, [] (
int
y) {
for
(
int
x=0; x < xBound; ++x) {
complex c(minReal + deltaReal * x,
minImag + deltaImag * y);
Color pixel = ComputeMandelBrotColor(c);
…
}
});
Parallelize outer loops first
Usually plenty of outer loop iterations to spread out to all cores
Inner loops do sufficient work to overcome parallel overheads
#include
<ppl.h>
using namespace
Concurrency;
vector<
int
> v = …;
parallel_for_each(v.begin(), v.end(), [] (
int
i) {
work(i);
});
parallel_for_each iterates over an STL container in
parallel
Works best with containers that support random-
access iterators:
std::vector, std::array, std::deque,
Concurrency::concurrent_vector, …
Works okay, but with higher overhead on containers
that support forward (or bi-di) iterators:
std::list, std::map, …
Shared state kills scalability of parallel
iteration
critical_section cs;
double
sum = 0;
parallel_for(0, 1000, [&sum, &cs] (
int
i) {
cs.lock();
if (SomeCondition(i))
sum += SomeComputation(i);
cs.unlock();
SomeFurtherComputation(i);
});
• High contention: entire
loop is serialized.
• Cache thrashing.
• Potential thread
explosion.
Reduce contention if possible.
critical_section cs;
double
sum = 0;
parallel_for(0, 1000, [&sum, &cs] (
int
i) {
if (SomeCondition(i)) {
cs.lock();
sum += SomeComputation(i);
cs.unlock();
}
SomeFurtherComputation(i);
});
• Contention potentially
reduced by moving lock
inside the if-statement.
• Still thrashes the cache.
Use combinable for per-thread computations.
Each thread has its own state; no shared state.
Operations must be commutative.
combinable<
double
> sums;
parallel_for(0, 1000, [&sums] (
int
i) {
if (SomeCondition(i))
sums.local() += SomeComputation(i);
SomeFurtherComputation(i);
});
double
sum = sums.combine(std::plus<
double
>());
• Practically zero
contention.
• No cache thrashing.
Demo: Relatively Prime Numbers
Not all patterns map to loops or tasks.
Pipelines, state machines, producer/consumer
Agent: an asynchronous object that communicates
through message passing.
Message Blocks: participants in message-passing which
transport from source to target.
Message: encapsulates state that is transferred between
message blocks.
Message blocks for
storing data
unbounded_buffer<T
>
overwrite_buffer<T>
single_assignment<T>
Message blocks for
pipelining
transformer<T,U>
call<T>
Message blocks for
joining data
choice
join
Send and receive
send, asend
receive
try_receive
send
unbounded_buffer
transformer
(reverse)
receive
“glorp”`
propagat
e
“glorp”`
propagat
e
“prolg”`
class
ReverserAgent :
public
Concurrency::agent
{
private
:
transformer<string,string> reverser;
public
:
unbounded_buffer<string> inputBuffer;
ReverseAgent()
: reverser([] (string in) -> string {
string reversed(in);
reverse(reversed.begin(), reversed.end());
return
reversed;
})
{
inputBuffer.link_target(&reverser);
}
protected:
virtual void
run();
};
void
ReverserAgent::run() {
for
(;;) {
string s = receive(&reverser);
if (s ==
"pots"
) {
done();
return
;
}
cout <<
"Received message : "
<< s << endl;
}
}
void
main()
{
ReverserAgent reverseAgent;
reverseAgent.start();
for
(;;) {
string s;
cin >> s;
send(reverseAgent.inputBuffer, s);
if
(s ==
"stop"
)
break
;
}
agent::wait(&reverseAgent);
}
Demo: String Reverse Agent
Two thread-safe, lock-free containers provided:
concurrent_vector<T>:
Lock-free push_back, element access, and iteration
No deletion!
concurrent_queue<T>:
Lock-free push and pop
Sample pack adds:
concurrent_unordered_map<T,U>
concurrent_set<T>
#include
<ppl.h>
#include
<concurrent_vector.h>
using namespace
Concurrency;
concurrent_vector<
int
> carmVec;
parallel_for(2, 5000000, [&carmVec](
int
i) {
if
(is_carmichael(i))
carmVec.push_back(i);
});
#include
<ppl.h>
#include
<concurrent_queue.h>
using namespace
Concurrency;
concurrent_queue<
int
> itemQueue;
parallel_invoke([&itemQueue] {
// Produce 1000 items
for
(
int
i=0; i<1000; ++i)
itemQueue.push(i);
},
[&itemQueue] {
// Consume 1000 items
for
(
int
i=0; i<1000; ++i) {
int
result = -1;
while
(!itemQueue.try_pop(result))
Context::Yield();
ProcessItem(result);
}
});
The “Many Core Shift” is happening
VS2010 with the Concurrency Runtime can help
Use PPL & Agents to express your potential concurrency
Let the runtime figure out the actual concurrency
Parallel iteration can help your application scale
Asynchronous Agents provide isolation from shared state
Concurrent collections are scalable and lock-free
Parallel Computing Developer Center
ConcRT Sample Pack
http://code.msdn.com/concrtextras
Native Concurrency Blog
http://blogs.msdn.com/nativeconcurrency
Forums
bool
is_carmichael(
const int
n) {
if
(n < 2) {
return
false; }
int
k = n;
for
(
int
i = 2; i <= k / i; ++i) {
if
(k % i == 0) {
if
((k / i) % i == 0) {
return
false; }
if
((n - 1) % (i - 1) != 0) {
return
false; }
k /= i;
i = 1;
}
}
return
k != n && (n - 1) % (k - 1) == 0;
}