Parallelizing many nested for loops in openMP c++ - c++

Hi i am new to c++ and i made a code which runs but it is slow because of many nested for loops i want to speed it up by openmp anyone who can guide me. i tried to use '#pragma omp parallel' before ip loop and inside this loop i used '#pragma omp parallel for' before it loop but it does not works
#pragma omp parallel
for(int ip=0; ip !=nparticle; ip++){
inf14>>r>>xp>>yp>>zp;
zp/=sqrt(gamma2);
counter++;
double para[7]={0,0,Vz,x0-xp,y0-yp,z0-zp,0};
if(ip>=0 && ip<=43){
#pragma omp parallel for
for(int it=0;it<NT;it++){
para[6]=PosT[it];
for(int ix=0;ix<NumX;ix++){
para[3]=PosX[ix]-xp;
for(int iy=0;iy<NumY;iy++){
para[4]=PosY[iy]-yp;
for(int iz=0;iz<NumZ;iz++){
para[5]=PosZ[iz]-zp;
int position=it*NumX*NumY*NumZ+ix*NumY*NumZ+iy*NumZ+iz;
rotation(para,&Field[3*position]);
MagX[position] +=chg*Field[3*position];
MagY[position] +=chg*Field[3*position+1];
MagZ[position] +=chg*Field[3*position+2];
}
}
}
}
}
}enter code here
and my rotation function also has infinite integration for loop as given below
for(int i=1;;i++){
gsl_integration_qag(&F, 10*i, 10*i+10, 1.0e-8, 1.0e-8, 100, 2, w, &temp, &error);
result+=temp;
if(abs(temp/result)<ACCURACY){
break;
}
}
i am using gsl libraries as well. so how to speed up this process or how to make openmp?

Do not set parallel pragmas inside another parallel pragma. You might overhead the machine creating more threads than it can handle. I would establish the parallelization in the outter loop (if it is big enough):
#pragma omp parallel for
for(int ip=0; ip !=nparticle; ip++)
Also make sure you do not have any race condition between threads (e.g. RAW).
Advice: if you do not get a great speed-up, a good practice is iterating by chunks and not only by one increment. For instance:
int num_threads = 1;
#pragma omp parallel
{
#pragma omp single
{
num_threads = omp_get_num_threads();
}
}
int chunkSize = 20; //Define your own chunk here
for (int position = 0; position < total; position+=(chunkSize*num_threads)) {
int endOfChunk = position + (chunkSize*num_threads);
#pragma omp parallel for
for(int ip = position; ip < endOfChunk ; ip += chunkSize) {
//Code
}
}

If you don't have inter-loop dependences, you can use the collapse keyword to parallelize multiple loops altoghether. Example:
void scale( int N, int M, float A[N][M], float B[N][M], float alpha ) {
#pragma omp for collapse(2)
for( int i = 0; i < N; i++ ) {
for( int j = 0; j < M; j++ ) {
A[i][j] = alpha * B[i][j];
}
}
}
I suggest you to check out the OpenMP C/C++ cheat sheet (PDF), which contain all the specifications for loop parallelization.

Related

OpenMP/C++: Parallel for loop with reduction afterwards - best practice?

Given the following code...
for (size_t i = 0; i < clusters.size(); ++i)
{
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster)
velocity[j] += f(j);
}
...which I would like to run on multiple CPUs/cores. The function f does not use velocity.
A simple #pragma omp parallel for before the first for loop will produce unpredictable/wrong results, because the std::vector<T> velocity is modified in the inner loop. Multiple threads may access and (try to) modify the same element of velocity at the same time.
I think the first solution would be to write #pragma omp atomic before the velocity[j] += f(j);operation. This gives me a compile error (might have something to do with the elements being of type Eigen::Vector3d or velocity being a class member). Also, I read atomic operations are very slow compared to having a private variable for each thread and doing a reduction in the end. So that's what I would like to do, I think.
I have come up with this:
#pragma omp parallel
{
// these variables are local to each thread
std::vector<Eigen::Vector3d> velocity_local(velocity.size());
std::fill(velocity_local.begin(), velocity_local.end(), Eigen::Vector3d(0,0,0));
#pragma omp for
for (size_t i = 0; i < clusters.size(); ++i)
{
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster)
velocity_local[j] += f(j); // save results from the previous calculations
}
// now each thread can save its results to the global variable
#pragma omp critical
{
for (size_t i = 0; i < velocity_local.size(); ++i)
velocity[i] += velocity_local[i];
}
}
Is this a good solution? Is it the best solution? (Is it even correct?)
Further thoughts: Using the reduce clause (instead of the critical section) throws a compiler error. I think this is because velocity is a class member.
I have tried to find a question with a similar problem, and this question looks like it's almost the same. But I think my case might differ because the last step includes a for loop. Also the question whether this is the best approach still holds.
Edit: As request per comment: The reduction clause...
#pragma omp parallel reduction(+:velocity)
for (omp_int i = 0; i < velocity_local.size(); ++i)
velocity[i] += velocity_local[i];
...throws the following error:
error C3028: 'ShapeMatching::velocity' : only a variable or static data member can be used in a data-sharing clause
(similar error with g++)
You're doing an array reduction. I have described this several times (e.g. reducing an array in openmp and fill histograms array reduction in parallel with openmp without using a critical section). You can do this with and without a critical section.
You have already done this correctly with a critical section (in your recent edit) so let me describe how to do this without a critical section.
std::vector<Eigen::Vector3d> velocitya;
#pragma omp parallel
{
const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
const int vsize = velocity.size();
#pragma omp single
velocitya.resize(vsize*nthreads);
std::fill(velocitya.begin()+vsize*ithread, velocitya.begin()+vsize*(ithread+1),
Eigen::Vector3d(0,0,0));
#pragma omp for schedule(static)
for (size_t i = 0; i < clusters.size(); i++) {
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster) velocitya[ithread*vsize+j] += f(j);
}
#pragma omp for schedule(static)
for(int i=0; i<vsize; i++) {
for(int t=0; t<nthreads; t++) {
velocity[i] += velocitya[vsize*t + i];
}
}
}
This method requires extra care/tuning due to false sharing which I have not done.
As to which method is better you will have to test.

openmp latency for inside for

I have a piece of code that i want to parallelize and the openmp program is much slower than the serial version, so what is wrong with my implementation?. This is the code of the program
#include <iostream>
#include <gsl/gsl_math.h>
#include "Chain.h"
using namespace std;
int main(){
int const N=1000;
int timeSteps=100;
double delta=0.0001;
double qq[N];
Chain ch(N);
ch.initCond();
for (int t=0; t<timeSteps; t++){
ch.changeQ(delta*t);
ch.calMag_i();
ch.calForce001();
}
ch.printSomething();
}
The Chain.h is
class Chain{
public:
int N;
double *q;
double *mx;
double *my;
double *force;
Chain(int const Np);
void initCond();
void changeQ(double delta);
void calMag_i();
void calForce001();
};
And the Chain.cpp is
Chain::Chain(int const Np){
this->N = Np;
this->q = new double[Np];
this->mx = new double[Np];
this->my = new double[Np];
this->force = new double[Np];
}
void Chain::initCond(){
for (int i=0; i<N; i++){
q[i] = 0.0;
force[i] = 0.0;
}
}
void Chain::changeQ(double delta){
int i=0;
#pragma omp parallel
{
#pragma omp for
for (int i=0; i<N; i++){
q[i] = q[i] + delta*i + 1.0*i/N;
}
}
}
void Chain::calMag_i(){
int i =0;
#pragma omp parallel
{
#pragma omp for
for (i=0; i<N; i++){
mx[i] = cos(q[i]);
my[i] = sin(q[i]);
}
}
}
void Chain::calForce001(){
int i;
int j;
double fij =0.0;
double start_time = omp_get_wtime();
#pragma omp parallel
{
#pragma omp for private(j, fij)
for (i=0; i<N; i++){
force[i] = 0.0;
for (j=0; j<i; j++){
fij = my[i]*mx[j] - mx[i]*my[j];
#pragma omp critical
{
force[i] += fij;
force[j] += -fij;
}
}
}
}
double time = omp_get_wtime() - start_time;
cout <<"time = " << time <<endl;
}
So the methods changeQ() and calMag_i() are in fact faster than the serial code, but my problem is the calForce001(). The execution time are:
with openMP 3.939s
without openMP 0.217s
Now, clearly i'm doing something wrong or the code can't be parallelize. Please any help with be usefull.
Thanks in advance.
Carlos
Edit:
In order to clarify the question i add the functions omp_get_wtime() to calculate the execution time for the function calForce001() and the times for one execution are
with omp :0.0376656
without omp: 0.00196766
So with omp method is 20 times slower.
Otherwise, i'm also calculate the time for the calMag_i() method
with omp: 3.3845e-05
without omp: 9.9516e-05
for this method omp is 3 times faster.
I hope this confirm that the latency problem is in the calForce001() method.
There are three reasons why you don't benefit from any speedup.
you have #pragma omp parallel all over your code. What this pragma does, is start the "team of threads". At the end of the block, this team is disbanded. This is quite costly. Removing those and using #pragma omp parallel for instead of #pragma omp for will start the team upon first encounter and put it to sleep after each block. This made the application 4x faster for me.
you use #pragma omp critical. On most platforms, this will force the use of a mutex - which is heavily contended because all threads want to write to that variable at the same time. So, don't use a critical section here. You could use atomic updates, but in this case, that won't make much of a difference - see third item. Just removing the critical section improved the speed by another 3x.
Parallelism only makes sense when you have an actual workload. All of your code is too small to benefit from parallelism. There's simply too little workload to win back the time lost on starting/waking/destroying the threads. If your workload would be ten times this, some of the parallel for statements would make sense. But especially Chain::calForce001() will never be worth it if you have to do atomic updates.
With respect to programming style: you're programming in C++. Please use local scope variables wherever you can - in e.g. Chain::calForce001(), use a local double fij inside the inner loop. That saves you from having to write private clauses. Compilers are smart enough to optimize that. Correct scoping allows for better optimizations.

Why does a for loop containing three linearly scaling for loops not scale linearly?

Edit2: I believe this problem can be boiled down to an example that is relatively less specific. Why is it that if I run three parallel for loops, which all scale relatively linearly, inside an enclosing for loop, it no longer scales even close to linearly? That is when I comment out the second two loops within my s for loop, I obtain a linear speed up, but when I don't, I don't.
I've tried several implementations of what essentially amounts to filling an array with calculated values. I'm attempting to parallelize it using OpenMP on a shared-memory system. I've worked from the basic layout, shown in my full code at the bottom, in all cases. I've tried using the simple #pragma omp for inside of a parallel loop I've tried manually splitting up the for loop into components based on the thread number, like so:
for(int i=thread_number; i<Nx/num_threads; i++)
{
...
}
From what I understand this is how OpenMP for loops normally split up the loops and thus, as expected I experienced negligible difference in performance. Next I tried splitting up the for loops (the outer i for loops in the full code) such that they would be accessing adjacent locations, like so:
for(int i=thread_number; i<Nx; i+=num_threads)
{
...
}
This seemed to be more efficient than the original method but still became slower as I increased the number of processors. I understand that due to the shared-memory it won't scale linearly, but I don't understand why even 2 processors is slower than the none threaded version.
Edit: I've added a more accessible and runnable imitation code below, I found that with just this I am able to increase the speed; however, I'm still unable to obtain close to a linear speed up as expected. Compiling with g++ and the flags -std=c++11 -lm -O3 -fopenmp -lpthread. I'm getting speeds around 54s using one thread and 41s using two threads. I should also note that the lines which follow each i iterating for loop input_q[ss]=1 were placed there as a reminder (to myself) that these loops cannot be combined due to changes to the ****_q arrays between the for loops.
More accessible/runnable imitation code:
#include <iostream>
#include <math.h>
#include <omp.h>
#include <boost/timer.hpp>
#define Ns 100
#define Nx 64
#define Ny 64
#define Nz 64
#define num_threads 1 // Set this to >1 for parallel tests
int main()
{
omp_set_num_threads(num_threads);
int ss=0;
const double ds=1.0/300;
double * wds = new double[Nx*Ny*Nz];
double * kds = new double[Nx*Ny*Nz];
double * w = new double[Nx*Ny*Nz];
double * k = new double[Nx*Ny*Nz];
double * q = new double[Nx*Ny*Nz*(Ns+1)];
double * input_q;
double * transformed_q;
double * final_q;
input_q=(double*)malloc(sizeof(double)*Nx*Ny*Nz);
transformed_q=(double*)malloc(sizeof(double)*Nx*Ny*Nz);
final_q=(double*)malloc(sizeof(double)*Nx*Ny*Nz);
boost::timer t;
for(int a=0; a<100;a++)
{
#pragma omp parallel for private(ss)
for(int i=0;i<Nx;i++){
for(int j=0;j<Ny;j++){
for(int l=0;l<Nz;l++){
kds[l+Nz*(j+Ny*i)]=exp((-ds)*k[l+Nz*(j+Ny*i)]);
wds[l+Nz*(j+Ny*i)]=exp((-0.5)*ds*w[l+Nz*(j+Ny*i)]);
}
}
}
//////////////// This is where it scales poorly //////////////
for(int s=0;s<Ns;s++){
#pragma omp parallel for private(ss)
for(int i=0;i<Nx;i++){
for(int j=0;j<Ny;j++){
for(int l=0;l<Nz;l++){
ss=l+Nz*(j+Ny*i);
input_q[ss]=q[s+Ns*(l+Nz*(j+Ny*i))]*wds[ss];
}
}
}
input_q[ss]=1; // This is a Fourier transform in the real program
#pragma omp parallel for private(ss)
for(int i=0;i<Nx;i++){
for(int j=0;j<Ny;j++){
for(int l=0;l<Nz;l++){
ss=l+Nz*(j+Ny*i);
transformed_q[ss]*=kds[ss];
}
}
}
input_q[ss]=1; // This is a Fourier transform in the real program
#pragma omp parallel for private(ss)
for(int i=0;i<Nx;i++){
for(int j=0;j<Ny;j++){
for(int l=0;l<Nz;l++){
ss=l+Nz*(j+Ny*i);
q[s+1+Ns*(l+Nz*(j+Ny*i))]=(final_q[ss]*wds[ss])/(8.0*Nx*Ny*Nz));
}
}
}
}
}
std::cout<<"Time: " <<t.elapsed()/num_threads<<std::endl;
return 0;
}
Full Code (original excerpt):
#pragma omp parallel private(ss)
{
#pragma omp for
for(int i=0;i<Nx;i++){
for(int j=0;j<Ny;j++){
for(int l=0;l<Nz;l++){
kds(i,j,l)=exp((-ds)*k(i,j,l));
wds(i,j,l)=exp((-0.5)*ds*w(i,j,l));
}
}
}
if(sign==1){
#pragma omp for
for(int i=0;i<Nx;i++){
for(int j=0;j<Ny;j++){
for(int l=0;l<Nz;l++){
q(i,j,l,0)=qint(i,j,l);
}
}
}
for(int s=0;s<Ns;s++){
#pragma omp for
for(int i=0;i<Nx;i++){
for(int j=0;j<Ny;j++){
for(int l=0;l<Nz;l++){
ss=l+Nz*(j+Ny*i);
input_q[ss]=q(i,j,l,s)*wds(i,j,l);
}
}
}
#pragma omp barrier
#pragma omp single
{
fftw_execute(forward_plan);
}
#pragma omp for
for(int i=0;i<Nx;i++){
for(int j=0;j<Ny;j++){
for(int l=0;l<Nz;l++){
ss=l+Nz*(j+Ny*i);
transformed_q[ss]*=kds(i,j,l);
}
}
}
#pragma omp barrier
#pragma omp single
{
fftw_execute(inverse_plan);
}
}
for(int i=0;i<Nx;i++){
for(int j=0;j<Ny;j++){
for(int l=0;l<Nz;l++){
ss=l+Nz*(j+Ny*i);
q(i,j,l,s+1)=((final_q[ss]*wds(i,j,l))/(8.0*Nx*Ny*Nz));
}
}
}
}

Influence on the static scheduling overhead in OpenMP

I thought about which factors would influence the static scheduling overhead in OpenMP.
In my opinion it is influenced by:
CPU performance
specific implementation of the OpenMP run-time library
the number of threads
But am I missing further factors? Maybe the size of the tasks, ...?
And furthermore: Is the overhead linearly dependent on the number of iterations?
In this case I would expect that having static scheduling and 4 cores, the overhead increases linearly with 4*i iterations. Correct so far?
EDIT:
I am only interested in the static (!) scheduling overhead itself. I am not talking about thread start-up overhead and time spent in synchronisation and critical section overhead.
You need to separate the overhead for OpenMP to create a team/pool of threads and the overhead for each thread to operate separate sets of iterators in a for loop.
Static scheduling is easy to implement by hand (which is sometimes very useful). Let's consider what I consider the two most important static scheduling schedule(static) and schedule(static,1) then we can compare this to schedule(dynamic,chunk).
#pragma omp parallel for schedule(static)
for(int i=0; i<N; i++) foo(i);
is equivalent to (but not necessarily equal to)
#pragma omp parallel
{
int start = omp_get_thread_num()*N/omp_get_num_threads();
int finish = (omp_get_thread_num()+1)*N/omp_get_num_threads();
for(int i=start; i<finish; i++) foo(i);
}
and
#pragma omp parallel for schedule(static,1)
for(int i=0; i<N; i++) foo(i);
is equivalent to
#pragma omp parallel
{
int ithread = omp_get_thread_num();
int nthreads = omp_get_num_threads();
for(int i=ithread; i<N; i+=nthreads) foo(i);
}
From this you can see that it's quite trivial to implement static scheduling and so the overhead is negligible.
On the other hand if you want to implement schedule(dynamic) (which is the same as schedule(dynamic,1)) by hand it's more complicated:
int cnt = 0;
#pragma omp parallel
for(int i=0;;) {
#pragma omp atomic capture
i = cnt++;
if(i>=N) break;
foo(i);
}
This requires OpenMP >=3.1. If you wanted to do this with OpenMP 2.0 (for MSVC) you would need to use critical like this
int cnt = 0;
#pragma omp parallel
for(int i=0;;) {
#pragma omp critical
i = cnt++;
if(i>=N) break;
foo(i);
}
Here is an equivalent to schedule(dynamic,chunk) (I have not optimized this using atomic accesss):
int cnt = 0;
int chunk = 5;
#pragma omp parallel
{
int start, finish;
do {
#pragma omp critical
{
start = cnt;
finish = cnt+chunk < N ? cnt+chunk : N;
cnt += chunk;
}
for(int i=start; i<finish; i++) foo(i);
} while(finish<N);
}
Clearly using atomic accesses is going to cause more overhead. This also shows why using larger chunks for schedule(dynamic,chunk) can reduce the overhead.

Why my C code is slower using OpenMP

I m trying to do multi-thread programming on CPU using OpenMP. I have lots of for loops which are good candidate to be parallel. I attached here a part of my code. when I use first #pragma omp parallel for reduction, my code is faster, but when I try to use the same command to parallelize other loops it gets slower. does anyone have any idea why it is like this?
.
.
.
omp_set_dynamic(0);
omp_set_num_threads(4);
float *h1=new float[nvi];
float *h2=new float[npi];
while(tol>0.001)
{
std::fill_n(h2, npi, 0);
int k,i;
float h222=0;
#pragma omp parallel for private(i,k) reduction (+: h222)
for (i=0;i<npi;++i)
{
int p1=ppi[i];
int m = frombus[p1];
for (k=0;k<N;++k)
{
h222 += v[m-1]*v[k]*(G[m-1][k]*cos(del[m-1]-del[k])
+ B[m-1][k]*sin(del[m-1]-del[k]));
}
h2[i]=h222;
}
//*********** h3*****************
std::fill_n(h3, nqi, 0);
float h333=0;
#pragma omp parallel for private(i,k) reduction (+: h333)
for (int i=0;i<nqi;++i)
{
int q1=qi[i];
int m = frombus[q1];
for (int k=0;k<N;++k)
{
h333 += v[m-1]*v[k]*(G[m-1][k]*sin(del[m-1]-del[k])
- B[m-1][k]*cos(del[m-1]-del[k]));
}
h3[i]=h333;
}
.
.
.
}
I don't think your OpenMP code gives the same result as without OpenMP. Let's just concentrate on the h2[i] part of the code (since the h3[i] has the same logic). There is a dependency of h2[i] on the index i (i.e. h2[1] = h2[1] + h2[0]). The OpenMP reduction you're doing won't give the correct result. If you want to do the reduction with OpenMP you need do it on the inner loop like this:
float h222 = 0;
for (int i=0; i<npi; ++i) {
int p1=ppi[i];
int m = frombus[p1];
#pragma omp parallel for reduction(+:h222)
for (int k=0;k<N; ++k) {
h222 += v[m-1]*v[k]*(G[m-1][k]*cos(del[m-1]-del[k])
+ B[m-1][k]*sin(del[m-1]-del[k]));
}
h2[i] = h222;
}
However, I don't know if that will be very efficient. An alternative method is fill h2[i] in parallel on the outer loop without a reduction and then take care of the dependency in serial. Even though the serial loop is not parallelized it still should have a small effect on the computation time since it does not have the inner loop over k. This should give the same result with and without OpenMP and still be fast.
#pragma omp parallel for
for (int i=0; i<npi; ++i) {
int p1=ppi[i];
int m = frombus[p1];
float h222 = 0;
for (int k=0;k<N; ++k) {
h222 += v[m-1]*v[k]*(G[m-1][k]*cos(del[m-1]-del[k])
+ B[m-1][k]*sin(del[m-1]-del[k]));
}
h2[i] = h222;
}
//take care of the dependency serially
for(int i=1; i<npi; i++) {
h2[i] += h2[i-1];
}
Keep in mind that creating and destroying threads is a time consuming process; clock the execution time for the process and see for yourself. You only use parallel reduction twice which may be faster than a serial reduction, however the initial cost of creating the threads may still be higher. Try parallelizing the outer most loop (if possible) to see if you can obtain a speedup.

Resources