我想计算我的代码在我的机器中执行的浮点运算 (Flops) 的数量。为此,我想确保我正确地计算了最内层循环中的操作。
我的理解是相关数字,即矢量化循环内的操作,总计,在 b 样条内核之外和内核内部的操作。
int compute_density_3d_ref(int N,double h,
double* restrict x, double* restrict y,
double* restrict z, double* restrict nu,
double* restrict rho){
const double inv_h = 1./h;
const double kernel_constant = w_bspline_3d_constant(h);
#pragma omp parallel for
for(size_t ii=0;ii<N;ii+=1){
double xii = x[ii];
double yii = y[ii];
double zii = z[ii];
double rhoii = 0.0;
#pragma omp simd reduction(+:rhoii) aligned(x,y,z,nu)
for(size_t jj=0;jj<N;jj+=1){
double q = 0.;
double xij = xii-x[jj]; // One operation
double yij = yii-y[jj]; // One operation
double zij = zii-z[jj]; // One operation
q += xij*xij; // Two operations
q += yij*yij; // Two operations
q += zij*zij; // Two operations
q = sqrt(q)*inv_h; // Two operations
// Two operation plus the number of operations in w_bspline
rhoii += nu[jj]*w_bspline_3d_simd(q);//*w_bspline_3d(q); // box->w(sqrt(dist),h);
}
rho[ii] = kernel_constant*rhoii;
}
return 0;
}
调用的函数有:
double w_bspline_3d_constant(double h){
return 3./(2.*M_PI*h*h*h);
}
#pragma omp declare simd
double w_bspline_3d(double q){
double wq = 0.0;
double wq1 = (0.6666666666666666 - q*q + 0.5*q*q*q); // Six operations
double wq2 = 0.16666666666666666*(2.-q)*(2.-q)*(2.-q); // Six operations
if(q<2.)
wq = wq2;
if(q<1.)
wq = wq1;
return wq;
}
我在数作为单个操作,假设 CPU 可以在每个周期调度一个(向量)sqrt 指令,假设为 AVX256。我还将加法和乘法计算为平等的浮点运算。
我想知道我是否正确地进行这个计算,或者我高估(甚至低估)计算中涉及的操作数量。
提前致谢。