A Performance Puzzle

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

A Performance Puzzle

Gregory Wright

Hi,

Something Haskell has lacked for a long time is a good medium-duty linear system solver based on the LU decomposition.  There are bindings to the usual C/Fortran libraries, but not one in pure Haskell.  (An example "LU factorization" routine that does not do partial pivoting has been around for years, but lacking pivoting it can fail unexpectedly on well-conditioned inputs.  Another Haskell LU decomposition using partial pivoting is around, but it uses an inefficient representation of the pivot matrix, so it's not suited to solving systems of more than 100 x 100, say.)

By medium duty I mean a linear system solver that can handle systems of (1000s) x (1000s) and uses Crout's efficient in-place algorithm.  In short, a program does everything short of exploiting SIMD vector instructions for solving small subproblems.

Instead of complaining about this, I have written a little library that implements this.  It contains an LU factorization function and an LU system solver.  The LU factorization also returns the parity of the pivots ( = (-1)^(number of row swaps) ) so it can be used to calculate determinants.  I used Gustavson's recursive (imperative) version of Crout's method.  The implementation is quite simple and deserves to be better known by people using functional languages to do numeric work.  The library can be downloaded from GitHub: https://github.com/gwright83/luSolve

The performance scales as expected (as n^3, a linear system 10 times larger in each dimension takes a 1000 times longer to solve):

Benchmark luSolve-bench: RUNNING...
benchmarking LUSolve/luFactor 100 x 100 matrix
time                 1.944 ms   (1.920 ms .. 1.980 ms)
                     0.996 R²   (0.994 R² .. 0.998 R²)
mean                 1.981 ms   (1.958 ms .. 2.009 ms)
std dev              85.64 μs   (70.21 μs .. 107.7 μs)
variance introduced by outliers: 30% (moderately inflated)

benchmarking LUSolve/luFactor 500 x 500 matrix
time                 204.3 ms   (198.1 ms .. 208.2 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 203.3 ms   (201.2 ms .. 206.2 ms)
std dev              3.619 ms   (2.307 ms .. 6.231 ms)
variance introduced by outliers: 14% (moderately inflated)

benchmarking LUSolve/luFactor 1000 x 1000 matrix
time                 1.940 s    (1.685 s .. 2.139 s)
                     0.998 R²   (0.993 R² .. 1.000 R²)
mean                 1.826 s    (1.696 s .. 1.880 s)
std dev              93.63 ms   (5.802 ms .. 117.8 ms)
variance introduced by outliers: 19% (moderately inflated)

Benchmark luSolve-bench: FINISH


The puzzle is why the overall performance is so poor.  When I solve random 1000 x 1000 systems using the linsys.c example file from the Recursive LAPACK (ReLAPACK) library -- which implements the same algorithm -- the average time is only 26 ms.  (I have tweaked ReLAPACK's  dgetrf.c so that it doesn't use optimized routines for small matrices.  As near as I can make it, the C and haskell versions should be doing the same thing.)

The haskell version runs 75 times slower.  This is the puzzle.

My haskell version uses a mutable, matrix of unboxed doubles (from Kai Zhang's matrices library).  Matrix reads and writes are unsafe, so there is no overhead from bounds checking.

Let's look at the result of profiling:

        Tue Jul 31 21:07 2018 Time and Allocation Profiling Report  (Final)

           luSolve-hspec +RTS -N -p -RTS

        total time  =     7665.31 secs   (7665309 ticks @ 1000 us, 1 processor)
        total alloc = 10,343,030,811,040 bytes  (excludes profiling overheads)

COST CENTRE           MODULE                            SRC                                                      %time %alloc

unsafeWrite           Data.Matrix.Dense.Generic.Mutable src/Data/Matrix/Dense/Generic/Mutable.hs:(38,5)-(39,38)   17.7   29.4
basicUnsafeWrite      Data.Vector.Primitive.Mutable     Data/Vector/Primitive/Mutable.hs:115:3-69                 14.7   13.0
unsafeRead            Data.Matrix.Dense.Generic.Mutable src/Data/Matrix/Dense/Generic/Mutable.hs:(34,5)-(35,38)   14.2   20.7
matrixMultiply.\.\.\  Numeric.LinearAlgebra.LUSolve     src/Numeric/LinearAlgebra/LUSolve.hs:(245,54)-(249,86)    13.4   13.5
readByteArray#        Data.Primitive.Types              Data/Primitive/Types.hs:184:30-132                         9.0   15.5
basicUnsafeRead       Data.Vector.Primitive.Mutable     Data/Vector/Primitive/Mutable.hs:112:3-63                  8.8    0.1
triangularSolve.\.\.\ Numeric.LinearAlgebra.LUSolve     src/Numeric/LinearAlgebra/LUSolve.hs:(382,45)-(386,58)     5.2    4.5
matrixMultiply.\.\    Numeric.LinearAlgebra.LUSolve     src/Numeric/LinearAlgebra/LUSolve.hs:(244,54)-(249,86)     4.1    0.3
primitive             Control.Monad.Primitive           Control/Monad/Primitive.hs:152:3-16                        3.8    0.0
basicUnsafeRead       Data.Vector.Unboxed.Base          Data/Vector/Unboxed/Base.hs:278:813-868                    3.3    0.0
basicUnsafeWrite      Data.Vector.Unboxed.Base          Data/Vector/Unboxed/Base.hs:278:872-933                    1.5    0.0
triangularSolve.\.\   Numeric.LinearAlgebra.LUSolve     src/Numeric/LinearAlgebra/LUSolve.hs:(376,33)-(386,58)     1.3    0.1

<snip>


A large amount of time is spent on the invocations of unsafeRead and unsafeWrite.  This is a bit suspicious -- it looks as if these call may not be inlined.  In the Data.Vector.Unboxed.Mutable library, which provides the underlying linear vector of storage locations, the unsafeRead and unsafeWrite functions are declared INLINE.  Could this be a failure of the 'matrices' library to mark its unsafeRead/Write functions as INLINE or SPECIALIZABLE as well?

On the other hand, looking at the core (.dump-simpl) of the library doesn't show any dictionary passing, and the access to matrix seem to be through GHC.Prim.writeDoubleArray# and GHC.Prim.readDoubleArray#.

If this program took three to five times longer, I would not be concerned, but a factor of seventy five indicates that I've missed something important.  Can anyone tell me what it is?

Best Wishes,

Greg


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: A Performance Puzzle

Vanessa McHale
Looking at your benchmarks you may be benchmarking the wrong thing. The function you are benchmarking is runLUFactor, which generates random matrices in addition to factoring them.

On 08/02/2018 05:27 PM, Gregory Wright wrote:

Hi,

Something Haskell has lacked for a long time is a good medium-duty linear system solver based on the LU decomposition.  There are bindings to the usual C/Fortran libraries, but not one in pure Haskell.  (An example "LU factorization" routine that does not do partial pivoting has been around for years, but lacking pivoting it can fail unexpectedly on well-conditioned inputs.  Another Haskell LU decomposition using partial pivoting is around, but it uses an inefficient representation of the pivot matrix, so it's not suited to solving systems of more than 100 x 100, say.)

By medium duty I mean a linear system solver that can handle systems of (1000s) x (1000s) and uses Crout's efficient in-place algorithm.  In short, a program does everything short of exploiting SIMD vector instructions for solving small subproblems.

Instead of complaining about this, I have written a little library that implements this.  It contains an LU factorization function and an LU system solver.  The LU factorization also returns the parity of the pivots ( = (-1)^(number of row swaps) ) so it can be used to calculate determinants.  I used Gustavson's recursive (imperative) version of Crout's method.  The implementation is quite simple and deserves to be better known by people using functional languages to do numeric work.  The library can be downloaded from GitHub: https://github.com/gwright83/luSolve

The performance scales as expected (as n^3, a linear system 10 times larger in each dimension takes a 1000 times longer to solve):

Benchmark luSolve-bench: RUNNING...
benchmarking LUSolve/luFactor 100 x 100 matrix
time                 1.944 ms   (1.920 ms .. 1.980 ms)
                     0.996 R²   (0.994 R² .. 0.998 R²)
mean                 1.981 ms   (1.958 ms .. 2.009 ms)
std dev              85.64 μs   (70.21 μs .. 107.7 μs)
variance introduced by outliers: 30% (moderately inflated)

benchmarking LUSolve/luFactor 500 x 500 matrix
time                 204.3 ms   (198.1 ms .. 208.2 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 203.3 ms   (201.2 ms .. 206.2 ms)
std dev              3.619 ms   (2.307 ms .. 6.231 ms)
variance introduced by outliers: 14% (moderately inflated)

benchmarking LUSolve/luFactor 1000 x 1000 matrix
time                 1.940 s    (1.685 s .. 2.139 s)
                     0.998 R²   (0.993 R² .. 1.000 R²)
mean                 1.826 s    (1.696 s .. 1.880 s)
std dev              93.63 ms   (5.802 ms .. 117.8 ms)
variance introduced by outliers: 19% (moderately inflated)

Benchmark luSolve-bench: FINISH


The puzzle is why the overall performance is so poor.  When I solve random 1000 x 1000 systems using the linsys.c example file from the Recursive LAPACK (ReLAPACK) library -- which implements the same algorithm -- the average time is only 26 ms.  (I have tweaked ReLAPACK's  dgetrf.c so that it doesn't use optimized routines for small matrices.  As near as I can make it, the C and haskell versions should be doing the same thing.)

The haskell version runs 75 times slower.  This is the puzzle.

My haskell version uses a mutable, matrix of unboxed doubles (from Kai Zhang's matrices library).  Matrix reads and writes are unsafe, so there is no overhead from bounds checking.

Let's look at the result of profiling:

        Tue Jul 31 21:07 2018 Time and Allocation Profiling Report  (Final)

           luSolve-hspec +RTS -N -p -RTS

        total time  =     7665.31 secs   (7665309 ticks @ 1000 us, 1 processor)
        total alloc = 10,343,030,811,040 bytes  (excludes profiling overheads)

COST CENTRE           MODULE                            SRC                                                      %time %alloc

unsafeWrite           Data.Matrix.Dense.Generic.Mutable src/Data/Matrix/Dense/Generic/Mutable.hs:(38,5)-(39,38)   17.7   29.4
basicUnsafeWrite      Data.Vector.Primitive.Mutable     Data/Vector/Primitive/Mutable.hs:115:3-69                 14.7   13.0
unsafeRead            Data.Matrix.Dense.Generic.Mutable src/Data/Matrix/Dense/Generic/Mutable.hs:(34,5)-(35,38)   14.2   20.7
matrixMultiply.\.\.\  Numeric.LinearAlgebra.LUSolve     src/Numeric/LinearAlgebra/LUSolve.hs:(245,54)-(249,86)    13.4   13.5
readByteArray#        Data.Primitive.Types              Data/Primitive/Types.hs:184:30-132                         9.0   15.5
basicUnsafeRead       Data.Vector.Primitive.Mutable     Data/Vector/Primitive/Mutable.hs:112:3-63                  8.8    0.1
triangularSolve.\.\.\ Numeric.LinearAlgebra.LUSolve     src/Numeric/LinearAlgebra/LUSolve.hs:(382,45)-(386,58)     5.2    4.5
matrixMultiply.\.\    Numeric.LinearAlgebra.LUSolve     src/Numeric/LinearAlgebra/LUSolve.hs:(244,54)-(249,86)     4.1    0.3
primitive             Control.Monad.Primitive           Control/Monad/Primitive.hs:152:3-16                        3.8    0.0
basicUnsafeRead       Data.Vector.Unboxed.Base          Data/Vector/Unboxed/Base.hs:278:813-868                    3.3    0.0
basicUnsafeWrite      Data.Vector.Unboxed.Base          Data/Vector/Unboxed/Base.hs:278:872-933                    1.5    0.0
triangularSolve.\.\   Numeric.LinearAlgebra.LUSolve     src/Numeric/LinearAlgebra/LUSolve.hs:(376,33)-(386,58)     1.3    0.1

<snip>


A large amount of time is spent on the invocations of unsafeRead and unsafeWrite.  This is a bit suspicious -- it looks as if these call may not be inlined.  In the Data.Vector.Unboxed.Mutable library, which provides the underlying linear vector of storage locations, the unsafeRead and unsafeWrite functions are declared INLINE.  Could this be a failure of the 'matrices' library to mark its unsafeRead/Write functions as INLINE or SPECIALIZABLE as well?

On the other hand, looking at the core (.dump-simpl) of the library doesn't show any dictionary passing, and the access to matrix seem to be through GHC.Prim.writeDoubleArray# and GHC.Prim.readDoubleArray#.

If this program took three to five times longer, I would not be concerned, but a factor of seventy five indicates that I've missed something important.  Can anyone tell me what it is?

Best Wishes,

Greg



_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

signature.asc (499 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: A Performance Puzzle

Gregory Wright
That's an interesting point.  Could the generation of the random matrix be that slow?  Something to check.

In my comparison with dgetrf.c from ReLAPACK, I also used random matrices, but measured the execution time from the start of the factorization, so I did not include the generation of the random matrix.

The one piece of evidence that still points to a performance problem is the scaling, since the execution time goes quite accurately as n^3 for n * n linear systems.  I would expect the time for generation of a random matrix, even if done very inefficiently, to scale as n^2.


On 8/2/18 7:47 PM, Vanessa McHale wrote:
Looking at your benchmarks you may be benchmarking the wrong thing. The function you are benchmarking is runLUFactor, which generates random matrices in addition to factoring them.

On 08/02/2018 05:27 PM, Gregory Wright wrote:

Hi,

Something Haskell has lacked for a long time is a good medium-duty linear system solver based on the LU decomposition.  There are bindings to the usual C/Fortran libraries, but not one in pure Haskell.  (An example "LU factorization" routine that does not do partial pivoting has been around for years, but lacking pivoting it can fail unexpectedly on well-conditioned inputs.  Another Haskell LU decomposition using partial pivoting is around, but it uses an inefficient representation of the pivot matrix, so it's not suited to solving systems of more than 100 x 100, say.)

By medium duty I mean a linear system solver that can handle systems of (1000s) x (1000s) and uses Crout's efficient in-place algorithm.  In short, a program does everything short of exploiting SIMD vector instructions for solving small subproblems.

Instead of complaining about this, I have written a little library that implements this.  It contains an LU factorization function and an LU system solver.  The LU factorization also returns the parity of the pivots ( = (-1)^(number of row swaps) ) so it can be used to calculate determinants.  I used Gustavson's recursive (imperative) version of Crout's method.  The implementation is quite simple and deserves to be better known by people using functional languages to do numeric work.  The library can be downloaded from GitHub: https://github.com/gwright83/luSolve

The performance scales as expected (as n^3, a linear system 10 times larger in each dimension takes a 1000 times longer to solve):

Benchmark luSolve-bench: RUNNING...
benchmarking LUSolve/luFactor 100 x 100 matrix
time                 1.944 ms   (1.920 ms .. 1.980 ms)
                     0.996 R²   (0.994 R² .. 0.998 R²)
mean                 1.981 ms   (1.958 ms .. 2.009 ms)
std dev              85.64 μs   (70.21 μs .. 107.7 μs)
variance introduced by outliers: 30% (moderately inflated)

benchmarking LUSolve/luFactor 500 x 500 matrix
time                 204.3 ms   (198.1 ms .. 208.2 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 203.3 ms   (201.2 ms .. 206.2 ms)
std dev              3.619 ms   (2.307 ms .. 6.231 ms)
variance introduced by outliers: 14% (moderately inflated)

benchmarking LUSolve/luFactor 1000 x 1000 matrix
time                 1.940 s    (1.685 s .. 2.139 s)
                     0.998 R²   (0.993 R² .. 1.000 R²)
mean                 1.826 s    (1.696 s .. 1.880 s)
std dev              93.63 ms   (5.802 ms .. 117.8 ms)
variance introduced by outliers: 19% (moderately inflated)

Benchmark luSolve-bench: FINISH


The puzzle is why the overall performance is so poor.  When I solve random 1000 x 1000 systems using the linsys.c example file from the Recursive LAPACK (ReLAPACK) library -- which implements the same algorithm -- the average time is only 26 ms.  (I have tweaked ReLAPACK's  dgetrf.c so that it doesn't use optimized routines for small matrices.  As near as I can make it, the C and haskell versions should be doing the same thing.)

The haskell version runs 75 times slower.  This is the puzzle.

My haskell version uses a mutable, matrix of unboxed doubles (from Kai Zhang's matrices library).  Matrix reads and writes are unsafe, so there is no overhead from bounds checking.

Let's look at the result of profiling:

        Tue Jul 31 21:07 2018 Time and Allocation Profiling Report  (Final)

           luSolve-hspec +RTS -N -p -RTS

        total time  =     7665.31 secs   (7665309 ticks @ 1000 us, 1 processor)
        total alloc = 10,343,030,811,040 bytes  (excludes profiling overheads)

COST CENTRE           MODULE                            SRC                                                      %time %alloc

unsafeWrite           Data.Matrix.Dense.Generic.Mutable src/Data/Matrix/Dense/Generic/Mutable.hs:(38,5)-(39,38)   17.7   29.4
basicUnsafeWrite      Data.Vector.Primitive.Mutable     Data/Vector/Primitive/Mutable.hs:115:3-69                 14.7   13.0
unsafeRead            Data.Matrix.Dense.Generic.Mutable src/Data/Matrix/Dense/Generic/Mutable.hs:(34,5)-(35,38)   14.2   20.7
matrixMultiply.\.\.\  Numeric.LinearAlgebra.LUSolve     src/Numeric/LinearAlgebra/LUSolve.hs:(245,54)-(249,86)    13.4   13.5
readByteArray#        Data.Primitive.Types              Data/Primitive/Types.hs:184:30-132                         9.0   15.5
basicUnsafeRead       Data.Vector.Primitive.Mutable     Data/Vector/Primitive/Mutable.hs:112:3-63                  8.8    0.1
triangularSolve.\.\.\ Numeric.LinearAlgebra.LUSolve     src/Numeric/LinearAlgebra/LUSolve.hs:(382,45)-(386,58)     5.2    4.5
matrixMultiply.\.\    Numeric.LinearAlgebra.LUSolve     src/Numeric/LinearAlgebra/LUSolve.hs:(244,54)-(249,86)     4.1    0.3
primitive             Control.Monad.Primitive           Control/Monad/Primitive.hs:152:3-16                        3.8    0.0
basicUnsafeRead       Data.Vector.Unboxed.Base          Data/Vector/Unboxed/Base.hs:278:813-868                    3.3    0.0
basicUnsafeWrite      Data.Vector.Unboxed.Base          Data/Vector/Unboxed/Base.hs:278:872-933                    1.5    0.0
triangularSolve.\.\   Numeric.LinearAlgebra.LUSolve     src/Numeric/LinearAlgebra/LUSolve.hs:(376,33)-(386,58)     1.3    0.1

<snip>


A large amount of time is spent on the invocations of unsafeRead and unsafeWrite.  This is a bit suspicious -- it looks as if these call may not be inlined.  In the Data.Vector.Unboxed.Mutable library, which provides the underlying linear vector of storage locations, the unsafeRead and unsafeWrite functions are declared INLINE.  Could this be a failure of the 'matrices' library to mark its unsafeRead/Write functions as INLINE or SPECIALIZABLE as well?

On the other hand, looking at the core (.dump-simpl) of the library doesn't show any dictionary passing, and the access to matrix seem to be through GHC.Prim.writeDoubleArray# and GHC.Prim.readDoubleArray#.

If this program took three to five times longer, I would not be concerned, but a factor of seventy five indicates that I've missed something important.  Can anyone tell me what it is?

Best Wishes,

Greg



_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: A Performance Puzzle

Vanessa McHale

I agree, I think the main problem would still remain once that was accounted for, but it may be worth doing correctly nonetheless :)

The only thing I can think of off the top of my head is using array rather than vector. I am not sure that would fix performance, but it would make it closer to the C implementation. You might also consider looking at a heap profile or using threadscope.


On 08/02/2018 07:16 PM, Gregory Wright wrote:
That's an interesting point.  Could the generation of the random matrix be that slow?  Something to check.

In my comparison with dgetrf.c from ReLAPACK, I also used random matrices, but measured the execution time from the start of the factorization, so I did not include the generation of the random matrix.

The one piece of evidence that still points to a performance problem is the scaling, since the execution time goes quite accurately as n^3 for n * n linear systems.  I would expect the time for generation of a random matrix, even if done very inefficiently, to scale as n^2.


On 8/2/18 7:47 PM, Vanessa McHale wrote:
Looking at your benchmarks you may be benchmarking the wrong thing. The function you are benchmarking is runLUFactor, which generates random matrices in addition to factoring them.

On 08/02/2018 05:27 PM, Gregory Wright wrote:

Hi,

Something Haskell has lacked for a long time is a good medium-duty linear system solver based on the LU decomposition.  There are bindings to the usual C/Fortran libraries, but not one in pure Haskell.  (An example "LU factorization" routine that does not do partial pivoting has been around for years, but lacking pivoting it can fail unexpectedly on well-conditioned inputs.  Another Haskell LU decomposition using partial pivoting is around, but it uses an inefficient representation of the pivot matrix, so it's not suited to solving systems of more than 100 x 100, say.)

By medium duty I mean a linear system solver that can handle systems of (1000s) x (1000s) and uses Crout's efficient in-place algorithm.  In short, a program does everything short of exploiting SIMD vector instructions for solving small subproblems.

Instead of complaining about this, I have written a little library that implements this.  It contains an LU factorization function and an LU system solver.  The LU factorization also returns the parity of the pivots ( = (-1)^(number of row swaps) ) so it can be used to calculate determinants.  I used Gustavson's recursive (imperative) version of Crout's method.  The implementation is quite simple and deserves to be better known by people using functional languages to do numeric work.  The library can be downloaded from GitHub: https://github.com/gwright83/luSolve

The performance scales as expected (as n^3, a linear system 10 times larger in each dimension takes a 1000 times longer to solve):

Benchmark luSolve-bench: RUNNING...
benchmarking LUSolve/luFactor 100 x 100 matrix
time                 1.944 ms   (1.920 ms .. 1.980 ms)
                     0.996 R²   (0.994 R² .. 0.998 R²)
mean                 1.981 ms   (1.958 ms .. 2.009 ms)
std dev              85.64 μs   (70.21 μs .. 107.7 μs)
variance introduced by outliers: 30% (moderately inflated)

benchmarking LUSolve/luFactor 500 x 500 matrix
time                 204.3 ms   (198.1 ms .. 208.2 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 203.3 ms   (201.2 ms .. 206.2 ms)
std dev              3.619 ms   (2.307 ms .. 6.231 ms)
variance introduced by outliers: 14% (moderately inflated)

benchmarking LUSolve/luFactor 1000 x 1000 matrix
time                 1.940 s    (1.685 s .. 2.139 s)
                     0.998 R²   (0.993 R² .. 1.000 R²)
mean                 1.826 s    (1.696 s .. 1.880 s)
std dev              93.63 ms   (5.802 ms .. 117.8 ms)
variance introduced by outliers: 19% (moderately inflated)

Benchmark luSolve-bench: FINISH


The puzzle is why the overall performance is so poor.  When I solve random 1000 x 1000 systems using the linsys.c example file from the Recursive LAPACK (ReLAPACK) library -- which implements the same algorithm -- the average time is only 26 ms.  (I have tweaked ReLAPACK's  dgetrf.c so that it doesn't use optimized routines for small matrices.  As near as I can make it, the C and haskell versions should be doing the same thing.)

The haskell version runs 75 times slower.  This is the puzzle.

My haskell version uses a mutable, matrix of unboxed doubles (from Kai Zhang's matrices library).  Matrix reads and writes are unsafe, so there is no overhead from bounds checking.

Let's look at the result of profiling:

        Tue Jul 31 21:07 2018 Time and Allocation Profiling Report  (Final)

           luSolve-hspec +RTS -N -p -RTS

        total time  =     7665.31 secs   (7665309 ticks @ 1000 us, 1 processor)
        total alloc = 10,343,030,811,040 bytes  (excludes profiling overheads)

COST CENTRE           MODULE                            SRC                                                      %time %alloc

unsafeWrite           Data.Matrix.Dense.Generic.Mutable src/Data/Matrix/Dense/Generic/Mutable.hs:(38,5)-(39,38)   17.7   29.4
basicUnsafeWrite      Data.Vector.Primitive.Mutable     Data/Vector/Primitive/Mutable.hs:115:3-69                 14.7   13.0
unsafeRead            Data.Matrix.Dense.Generic.Mutable src/Data/Matrix/Dense/Generic/Mutable.hs:(34,5)-(35,38)   14.2   20.7
matrixMultiply.\.\.\  Numeric.LinearAlgebra.LUSolve     src/Numeric/LinearAlgebra/LUSolve.hs:(245,54)-(249,86)    13.4   13.5
readByteArray#        Data.Primitive.Types              Data/Primitive/Types.hs:184:30-132                         9.0   15.5
basicUnsafeRead       Data.Vector.Primitive.Mutable     Data/Vector/Primitive/Mutable.hs:112:3-63                  8.8    0.1
triangularSolve.\.\.\ Numeric.LinearAlgebra.LUSolve     src/Numeric/LinearAlgebra/LUSolve.hs:(382,45)-(386,58)     5.2    4.5
matrixMultiply.\.\    Numeric.LinearAlgebra.LUSolve     src/Numeric/LinearAlgebra/LUSolve.hs:(244,54)-(249,86)     4.1    0.3
primitive             Control.Monad.Primitive           Control/Monad/Primitive.hs:152:3-16                        3.8    0.0
basicUnsafeRead       Data.Vector.Unboxed.Base          Data/Vector/Unboxed/Base.hs:278:813-868                    3.3    0.0
basicUnsafeWrite      Data.Vector.Unboxed.Base          Data/Vector/Unboxed/Base.hs:278:872-933                    1.5    0.0
triangularSolve.\.\   Numeric.LinearAlgebra.LUSolve     src/Numeric/LinearAlgebra/LUSolve.hs:(376,33)-(386,58)     1.3    0.1

<snip>


A large amount of time is spent on the invocations of unsafeRead and unsafeWrite.  This is a bit suspicious -- it looks as if these call may not be inlined.  In the Data.Vector.Unboxed.Mutable library, which provides the underlying linear vector of storage locations, the unsafeRead and unsafeWrite functions are declared INLINE.  Could this be a failure of the 'matrices' library to mark its unsafeRead/Write functions as INLINE or SPECIALIZABLE as well?

On the other hand, looking at the core (.dump-simpl) of the library doesn't show any dictionary passing, and the access to matrix seem to be through GHC.Prim.writeDoubleArray# and GHC.Prim.readDoubleArray#.

If this program took three to five times longer, I would not be concerned, but a factor of seventy five indicates that I've missed something important.  Can anyone tell me what it is?

Best Wishes,

Greg



_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

signature.asc (499 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: A Performance Puzzle

Claude Heiland-Allen-3
In reply to this post by Gregory Wright
Hi Gregory,

On 03/08/18 01:16, Gregory Wright wrote:
> That's an interesting point.  Could the generation of the random
> matrix be that slow?  Something to check.
It's not that it's slow by itself, I think it's that the CAF mVals
::[Double] is retained, taking ~40MB of heap which slows down GC.

Using criterion's `env` isn't so hard, and gets a much nicer looking
heap profile graph.  See new benchmark code attached.

Graphs:
https://mathr.co.uk/tmp/luSolve-bench.svg
https://mathr.co.uk/tmp/luSolve-bench-env.svg

>
> On 8/2/18 7:47 PM, Vanessa McHale wrote:
>> Looking at your benchmarks you may be benchmarking the wrong thing.
>> The function you are benchmarking is runLUFactor, which generates
>> random matrices in addition to factoring them.
>>
>> On 08/02/2018 05:27 PM, Gregory Wright wrote:
>>> benchmarking LUSolve/luFactor 1000 x 1000 matrix
>>>
>>> time                 1.940 s    (1.685 s .. 2.139 s)
>>>                      0.998 R²   (0.993 R² .. 1.000 R²)
>>> mean                 1.826 s    (1.696 s .. 1.880 s)
>>>
I started at mean 1.50s with your code.
>>>
>>> std dev              93.63 ms   (5.802 ms .. 117.8 ms)
>>> variance introduced by outliers: 19% (moderately inflated)
>>>

Making the `env` change and compiling with -fllvm (as suggested in
#haskell on irc.freenode.net, for a 4x speed boost) brought my time for
that benchmark to mean 257ms. +RTS -s tells me productivity is 99.1%,
which is pretty high.
I compiled the benchmark by hand for best speed, as cabal seems to add
-prof which slows the bench down slightly.
I also compiled without -threaded, because the code isn't parallelized
afaict, and parallel GC can be a bottleneck (is this still true?).


Claude
--
https://mathr.co.uk


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

LUSolveBenchmark.hs (1K) Download Attachment
Bench.hs (225 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: A Performance Puzzle

Dominic Steinitz-2
In reply to this post by Gregory Wright
Something Haskell has lacked for a long time is a good medium-duty
linear system solver based on the LU decomposition.  There are bindings
to the usual C/Fortran libraries, but not one in pure Haskell.  (An
example "LU factorization" routine that does not do partial pivoting has
been around for years, but lacking pivoting it can fail unexpectedly on
well-conditioned inputs.  Another Haskell LU decomposition using partial
pivoting is around, but it uses an inefficient representation of the
pivot matrix, so it's not suited to solving systems of more than 100 x
100, say.)

By medium duty I mean a linear system solver that can handle systems of
(1000s) x (1000s) and uses Crout's efficient in-place algorithm.  In
short, a program does everything short of exploiting SIMD vector
instructions for solving small subproblems.

Instead of complaining about this, I have written a little library that
implements this.  It contains an LU factorization function and an LU
system solver.  The LU factorization also returns the parity of the
pivots ( = (-1)^(number of row swaps) ) so it can be used to calculate
determinants.  I used Gustavson's recursive (imperative) version of
Crout's method.  The implementation is quite simple and deserves to be
better known by people using functional languages to do numeric work. 
The library can be downloaded from GitHub:
https://github.com/gwright83/luSolve

Great news :)

Dominic Steinitz
Twitter: @idontgetoutmuch




_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: A Performance Puzzle

Dominic Steinitz-2
I just tried using Mersenne to generate the random matrices rather than `System.Random`. It is about 15% faster so I don’t think the actual RNG process is the culprit.

Dominic Steinitz
Twitter: @idontgetoutmuch



On 4 Aug 2018, at 10:02, [hidden email] wrote:

Something Haskell has lacked for a long time is a good medium-duty
linear system solver based on the LU decomposition.  There are bindings
to the usual C/Fortran libraries, but not one in pure Haskell.  (An
example "LU factorization" routine that does not do partial pivoting has
been around for years, but lacking pivoting it can fail unexpectedly on
well-conditioned inputs.  Another Haskell LU decomposition using partial
pivoting is around, but it uses an inefficient representation of the
pivot matrix, so it's not suited to solving systems of more than 100 x
100, say.)

By medium duty I mean a linear system solver that can handle systems of
(1000s) x (1000s) and uses Crout's efficient in-place algorithm.  In
short, a program does everything short of exploiting SIMD vector
instructions for solving small subproblems.

Instead of complaining about this, I have written a little library that
implements this.  It contains an LU factorization function and an LU
system solver.  The LU factorization also returns the parity of the
pivots ( = (-1)^(number of row swaps) ) so it can be used to calculate
determinants.  I used Gustavson's recursive (imperative) version of
Crout's method.  The implementation is quite simple and deserves to be
better known by people using functional languages to do numeric work. 
The library can be downloaded from GitHub:
https://github.com/gwright83/luSolve

Great news :)

Dominic Steinitz
Twitter: @idontgetoutmuch





_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: A Performance Puzzle

Haskell - Haskell-Cafe mailing list
In reply to this post by Gregory Wright
Hi Gregory,

Gregory Wright wrote:
> Something Haskell has lacked for a long time is a good medium-duty
> linear system solver based on the LU decomposition.  There are bindings
> to the usual C/Fortran libraries, but not one in pure Haskell.  (An
> example "LU factorization" routine that does not do partial pivoting has
> been around for years, but lacking pivoting it can fail unexpectedly on
> well-conditioned inputs.  Another Haskell LU decomposition using partial
> pivoting is around, but it uses an inefficient representation of the
> pivot matrix, so it's not suited to solving systems of more than 100 x
> 100, say.)

There are ways to improve the code; in particular, one can change the
matrix multiplication to accumulate the result of multiplying a row
by a column:

    numLoopState 0 (ca - 1) 0 $ \s k -> do
        aik <- MU.unsafeRead a (i, k)
        bkj <- MU.unsafeRead b (k, j)
        return $! s + aik * bkj

At the end of the day, ghc's native code generator is pretty terrible
for tight inner loops as they arise in matrix multiplication. The
above results in an assembly language loop that does lots of useless
shuffling of registers (saving and restoring them on the stack), index
recomputation, and even checking a pointer tag to see if some closure
has already been evaluated, all in the innermost loop. (Full code at
end.)

If you use -fllvm, the code becomes quite a bit faster, and the inner
loop looks amazingly decent; there is no saving of state anymore, no
tag check, and the index computation has mostly disappeared thanks to
strength reduction:

  4114a0:       f2 0f 10 0f             movsd  (%rdi),%xmm1
     ; fetch entry from first matrix
  4114a4:       f2 0f 59 0b             mulsd  (%rbx),%xmm1
     ; multiply by entry from second matrix
  4114a8:       f2 0f 58 c1             addsd  %xmm1,%xmm0
     ; add to accumulator
  4114ac:       49 ff c5                inc    %r13
     ; increment loop counter
  4114af:       4c 01 e3                add    %r12,%rbx
     ; advance pointer into second matrix
  4114b2:       48 83 c7 08             add    $0x8,%rdi
     ; advance pointer into first matrix
  4114b6:       4c 39 ee                cmp    %r13,%rsi
     ; test for end of loop
  4114b9:       75 e5                   jne    4114a0 <cgbl_info$def+0xe0>

But this could be quite a bit faster if the code were vectorized, for
example, by processing 4 columns at the same time. I would expect a
good implementation of BLAS to do that.

To speed matrix multiplication up on a higher level, Strassen
multiplication comes to mind, though "at the price of a somewhat
reduced numerical stability" (Wikipedia).

Cheers,

Bertram


NCG inner loop:

_cdUi:
        cmpq %rbx,%r10          ; check loop upper bound
        je _cdUx                ; exit loop
_cdUr:
        movq $block_cdUp_info,-96(%rbp)
        movq %rbx,%r11          ; save %rbx in %r11
        movq %rcx,%rbx
        movq %rdx,-88(%rbp)
        movq %rsi,-80(%rbp)
        movq %rax,-72(%rbp)
        movq %rdi,-64(%rbp)
        movq %r8,-56(%rbp)
        movq %r9,-48(%rbp)
        movq %r11,-40(%rbp)
        movq %rcx,-32(%rbp)
        movq %r14,-24(%rbp)
        movq %r10,-16(%rbp)
        movsd %xmm0,-8(%rbp)    ; save lots of state
        addq $-96,%rbp
        testb $7,%bl            ; check whether (rbx) is evaluated
        jne _cdUp               ; yes -> skip closure
_cdUs:
        jmp *(%rbx)             ; enter closure
.align 8
        .quad   122571
        .quad   30
block_cdUp_info:
_cdUp:
        movq 8(%rbp),%rax       ; was: %rdx
        movq 16(%rbp),%rcx      ; was: %rsi
        movq 24(%rbp),%rdx      ; was: %rax
        movq 32(%rbp),%rsi      ; was: %rdi
        movq 40(%rbp),%rdi      ; was: %8
        movq 48(%rbp),%r8       ; was: %r9
        movq 56(%rbp),%r9       ; was: %r11
        movq 64(%rbp),%r10      ; was: %rcx
        movq 72(%rbp),%r11      ; was: %r14
        movq 80(%rbp),%r14      ; was: %r10
        movq %rax,64(%rsp)      ; spill, was: %rdx
        movq %r14,%rax
        movq %rcx,72(%rsp)      ; spill, was: %rsi
        movq 64(%rsp),%rcx
        imulq %rcx,%rax
        addq %r11,%rax
        movq %rsi,%rcx
        addq %rax,%rcx
        movq 72(%rsp),%rax
        addq %rcx,%rax
        movq 7(%rbx),%rbx
        movq 64(%rsp),%rcx
        imulq %rcx,%rbx
        addq %r14,%rbx
        movq %rdi,%rcx
        addq %rbx,%rcx
        movq 72(%rsp),%rbx
        addq %rcx,%rbx
        movsd 16(%rdx,%rbx,8),%xmm0     ; read one array entry
        mulsd 16(%rdx,%rax,8),%xmm0     ; multiply by other array entry
        movsd 88(%rbp),%xmm1            ; add accumulator (was: %xmm0)
        addsd %xmm0,%xmm1
        addq $96,%rbp
        incq %r14
_nefT:
        movsd %xmm1,%xmm0       ; update accumulator
        movq %r10,%rcx          ; was: %rcx
        movq %r14,%r10          ; was: %r10 (now + 1)
        movq %r11,%r14          ; was: %r9
        movq %r9,%rbx           ; was: %r11 (where we saved %rbx earlier)
        movq %r8,%r9            ; was: %r9
        movq %rdi,%r8           ; was: %r8
        movq %rsi,%rdi          ; was: %rdi
        movq %rdx,%rax          ; was: %rax
        movq 72(%rsp),%rsi      ; was: %rsi
        movq 64(%rsp),%rdi      ; was: %rdx ???
        jmp _cdUi               ; loop
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: A Performance Puzzle

Gregory Wright
In reply to this post by Claude Heiland-Allen-3

Hi Claude,

Building with llvm is an excellent idea.  I see the same 4x performance improvement that you noted.  I also tried building the benchmarks with and without the '-threaded' option and saw no difference in run times.  Perhaps the threaded gc issues are behind us.

I'll incorporate your changes into the repo on GitHub.  Thank you.

If the obvious deficiencies have been fixed by building with llvm, then the next place to look for improvement is the matrix multiplication.  I profiled the test last night and have not digested the results, but matrixMultiply still stands out as taking a lot of time.

Best Wishes,

Greg


On 8/2/18 11:29 PM, Claude Heiland-Allen wrote:
Hi Gregory,

On 03/08/18 01:16, Gregory Wright wrote:
That's an interesting point.  Could the generation of the random matrix be that slow?  Something to check.
It's not that it's slow by itself, I think it's that the CAF mVals ::[Double] is retained, taking ~40MB of heap which slows down GC.

Using criterion's `env` isn't so hard, and gets a much nicer looking heap profile graph.  See new benchmark code attached.

Graphs:
https://mathr.co.uk/tmp/luSolve-bench.svg
https://mathr.co.uk/tmp/luSolve-bench-env.svg


On 8/2/18 7:47 PM, Vanessa McHale wrote:
Looking at your benchmarks you may be benchmarking the wrong thing. The function you are benchmarking is runLUFactor, which generates random matrices in addition to factoring them.

On 08/02/2018 05:27 PM, Gregory Wright wrote:
benchmarking LUSolve/luFactor 1000 x 1000 matrix

time                 1.940 s    (1.685 s .. 2.139 s)
                     0.998 R²   (0.993 R² .. 1.000 R²)
mean                 1.826 s    (1.696 s .. 1.880 s)

I started at mean 1.50s with your code.

std dev              93.63 ms   (5.802 ms .. 117.8 ms)
variance introduced by outliers: 19% (moderately inflated)


Making the `env` change and compiling with -fllvm (as suggested in #haskell on irc.freenode.net, for a 4x speed boost) brought my time for that benchmark to mean 257ms. +RTS -s tells me productivity is 99.1%, which is pretty high.
I compiled the benchmark by hand for best speed, as cabal seems to add -prof which slows the bench down slightly.
I also compiled without -threaded, because the code isn't parallelized afaict, and parallel GC can be a bottleneck (is this still true?).


Claude


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.