Don McCrady, Principal Development Lead
Parallel Computing Platform
June 7-10
Some cool new C++
Parallel Iteration
Tricks for reducing shared state
Asynchronous Agents
Concurrent containers
Demo: N-Bodies
Part of the C++ Runtime
No new libraries to link in
PPL: Parallel Pattern Library
Agents: Asynchronous Agents Library
Abstracts away the notion of threads
Tasks are computations that may be run in parallel
Use PPL & Agents to express your potential concurrency
Let the runtime map it to the available concurrency
Scale from 1 to 256 cores
class _FT {
int x = 5;
public:
int y = 7;
_FT(int x, int& y)
: _x(x), _y(y)
auto functor =
{ }
[x,&y] (int z) {
void operator()(int z) {
_y += _x - z;
y += x - z;
}
};
private:
int _x;
functor(3);
int& _y;
cout << y;
};
int x = 5;
int y = 7;
_FT functor(x, y);
functor(3);
cout << y;
Lambdas make functional programming palatable in C++
#include
#include
using namespace std;
vector v = & ;
foreach(v.begin(), v.end(), [&v] (int item) {
cout << item << endl;
});
parallel_for iterates over a range in parallel
#include
using namespace Concurrency;
parallel_for(0, 1000, [] (int i) {
work(i);
});
parallel_for(0, 1000, [] (int i) {
work(i);
});
" Order of iteration is
Core 1 Core 2
indeterminate.
work(0& 249) work(250& 499)
" Cores may come and
go.
Core 3 Core 4 " Ranges may be
stolen by newly idle
work(500& 749) work(750& 999)
cores.
parallel_for considerations:
Designed for unbalanced loop bodies
An idle core can steal a portion of another core s range of work
Supports cancellation
Early exit in search scenarios
For fixed-sized loop bodies that don t need cancellation,
consider parallel_for_fixed from the sample pack.
Parallelize outer loops first
Usually plenty of outer loop iterations to spread out to all cores
Inner loops do sufficient work to overcome parallel overheads
parallel_for(0, yBound, [] (int y) {
for (int x=0; x < xBound; ++x) {
complex c(minReal + deltaReal * x,
minImag + deltaImag * y);
Color pixel = ComputeMandelBrotColor(c);
&
}
});
parallel_for_each iterates over an STL container in
parallel
#include
using namespace Concurrency;
vector v = & ;
parallel_for_each(v.begin(), v.end(), [] (int i) {
work(i);
});
Works best with containers that support random-
access iterators:
std::vector, std::array, std::deque,
Concurrency::concurrent_vector, &
Works okay, but with higher overhead on containers
that support forward (or bi-di) iterators:
std::list, std::map, &
Shared state kills scalability of parallel
iteration
critical_section cs;
" High contention: entire
loop is serialized.
double sum = 0;
" Cache thrashing.
parallel_for(0, 1000, [&sum, &cs] (int i) {
cs.lock();
if (SomeCondition(i))
" Potential thread
sum += SomeComputation(i);
explosion.
cs.unlock();
SomeFurtherComputation(i);
});
Reduce contention if possible.
critical_section cs;
" Contention potentially
reduced by moving lock
double sum = 0;
inside the if-statement.
parallel_for(0, 1000, [&sum, &cs] (int i) {
if (SomeCondition(i)) {
" Still thrashes the cache.
cs.lock();
sum += SomeComputation(i);
cs.unlock();
}
SomeFurtherComputation(i);
});
Use combinable for per-thread computations.
Each thread has its own state; no shared state.
Operations must be commutative.
" Practically zero
combinable sums;
contention.
parallel_for(0, 1000, [&sums] (int i) {
" No cache thrashing.
if (SomeCondition(i))
sums.local() += SomeComputation(i);
SomeFurtherComputation(i);
});
double sum = sums.combine(std::plus());
Demo: Relatively Prime Numbers
Not all patterns map to loops or tasks.
Pipelines, state machines, producer/consumer
Agent: an asynchronous object that communicates
through message passing.
Message Blocks: participants in message-passing which
transport from source to target.
Message: encapsulates state that is transferred between
message blocks.
Message blocks for
Message blocks for
pipelining
storing data
transformer
unbounded_buffer
call
overwrite_buffer
single_assignment
Message blocks for Send and receive
joining data
send, asend
choice
receive
join
try_receive
glorp `
unbounded_buffer
glorp `
send
transformer
(reverse)
prolg `
receive
propagate
propagate
class ReverserAgent : public Concurrency::agent
{
private:
transformer reverser;
public:
unbounded_buffer inputBuffer;
ReverseAgent()
: reverser([] (string in) -> string {
string reversed(in);
reverse(reversed.begin(), reversed.end());
return reversed;
})
{
inputBuffer.link_target(&reverser);
}
protected:
virtual void run();
};
void ReverserAgent::run() {
for (;;) {
string s = receive(&reverser);
if (s == "pots") {
done();
return;
}
cout << "Received message : " << s << endl;
}
}
void main()
{
ReverserAgent reverseAgent;
reverseAgent.start();
for (;;) {
string s;
cin >> s;
send(reverseAgent.inputBuffer, s);
if (s == "stop")
break;
}
agent::wait(&reverseAgent);
}
Demo: String Reverse Agent
Two thread-safe, lock-free containers provided:
concurrent_vector:
Lock-free push_back, element access, and iteration
No deletion!
concurrent_queue:
Lock-free push and pop
Sample pack adds:
concurrent_unordered_map
concurrent_set
#include
#include
using namespace Concurrency;
concurrent_vector carmVec;
parallel_for(2, 5000000, [&carmVec](int i) {
if (is_carmichael(i))
carmVec.push_back(i);
});
#include
#include
using namespace Concurrency;
concurrent_queue itemQueue;
parallel_invoke([&itemQueue] { // Produce 1000 items
for (int i=0; i<1000; ++i)
itemQueue.push(i);
},
[&itemQueue] { // Consume 1000 items
for (int i=0; i<1000; ++i) {
int result = -1;
while (!itemQueue.try_pop(result))
Context::Yield();
ProcessItem(result);
}
});
The Many Core Shift is happening
VS2010 with the Concurrency Runtime can help
Use PPL & Agents to express your potential concurrency
Let the runtime figure out the actual concurrency
Parallel iteration can help your application scale
Asynchronous Agents provide isolation from shared state
Concurrent collections are scalable and lock-free
Parallel Computing Developer Center
http://msdn.com/Concurrency
ConcRT Sample Pack
http://code.msdn.com/concrtextras
Native Concurrency Blog
http://blogs.msdn.com/nativeconcurrency
Forums
http://social.msdn.microsoft.com/Forums/en-
US/category/parallelcomputing
bool is_carmichael(const int n) {
if (n < 2) { return false; }
int k = n;
for (int i = 2; i <= k / i; ++i) {
if (k % i == 0) {
if ((k / i) % i == 0) { return false; }
if ((n - 1) % (i - 1) != 0) { return false; }
k /= i;
i = 1;
}
}
return k != n && (n - 1) % (k - 1) == 0;
}
Wyszukiwarka
Podobne podstrony:
concurrency>796647
ConcurrentModificationException
ConcurrentSkipListMap
ConcurrentSkipListSet
1 5 Data Concurrency and Locking Labid?97
ConcurrentMap
ConcurrentLinkedQueue
ConcurrentHashMap
concurrency?2E011A
ConcurrentSkipListSet
ConcurrentNavigableMap
ConcurrentModificationException
ConcurrentSample csproj FileListAbsolute
Lei 6404 Lei das S A Esquematizada para concursos
ConcurrentHashMap
ConcurrentSkipListMap
ConcurrentLinkedQueue
Edital 10 16 Concurso Público Prefeitura de Colombo
więcej podobnych podstron