How can numpy be so much faster than my Fortran routine?

Question

Your Fortran implementation suffers two major shortcomings:

You mix IO and computations (and read from the file entry by entry).
You don’t use vector/matrix operations.

This implementation does perform the same operation as yours and is faster by a factor of 20 on my machine:

program test
  integer gridsize,unit
  real mini,maxi,mean
  real, allocatable :: tmp (:,:,:)

  gridsize=512
  unit=40

  allocate( tmp(gridsize, gridsize, gridsize))

  open(unit=unit,file="T.out",status="old",access="stream",&
       form='unformatted',action='read')
  read(unit=unit) tmp

  close(unit=unit)

  mini = minval(tmp)
  maxi = maxval(tmp)
  mean = sum(tmp)/gridsize**3
  print *, mini, maxi, mean

end program

The idea is to read in the whole file into one array tmp in one go. Then, I can use the functions MAXVAL, MINVAL, and SUM on the array directly.

For the accuracy issue: Simply using double precision values and doing the conversion on the fly as

mean = sum(real(tmp, kind=kind(1.d0)))/real(gridsize**3, kind=kind(1.d0))

only marginally increases the calculation time. I tried performing the operation element-wise and in slices, but that did only increase the required time at the default optimization level.

At -O3, the element-wise addition performs ~3 % better than the array operation. The difference between double and single precision operations is less than 2% on my machine – on average (the individual runs deviate by far more).

Here is a very fast implementation using LAPACK:

program test
  integer gridsize,unit, i, j
  real mini,maxi
  integer  :: t1, t2, rate
  real, allocatable :: tmp (:,:,:)
  real, allocatable :: work(:)
!  double precision :: mean
  real :: mean
  real :: slange

  call system_clock(count_rate=rate)
  call system_clock(t1)
  gridsize=512
  unit=40

  allocate( tmp(gridsize, gridsize, gridsize), work(gridsize))

  open(unit=unit,file="T.out",status="old",access="stream",&
       form='unformatted',action='read')
  read(unit=unit) tmp

  close(unit=unit)

  mini = minval(tmp)
  maxi = maxval(tmp)

!  mean = sum(tmp)/gridsize**3
!  mean = sum(real(tmp, kind=kind(1.d0)))/real(gridsize**3, kind=kind(1.d0))
  mean = 0.d0
  do j=1,gridsize
    do i=1,gridsize
      mean = mean + slange('1', gridsize, 1, tmp(:,i,j),gridsize, work)
    enddo !i
  enddo !j
  mean = mean / gridsize**3

  print *, mini, maxi, mean
  call system_clock(t2)
  print *,real(t2-t1)/real(rate)

end program

This uses the single precision matrix 1-norm SLANGE on matrix columns. The run-time is even faster than the approach using single precision array functions – and does not show the precision issue.

Leave a Comment Cancel reply