Discussion:
loop vs matrix multiplication: speed vs memory
(too old to reply)
Juliette Salexa
2010-04-05 02:14:06 UTC
Permalink
Hello,

I've had a problem that's been bothering me for a couple of weeks now and wanted to see other people's opinions,

Apart from cases like this: http://www.mathworks.com/matlabcentral/newsreader/view_thread/255653#664380

where the for loop is actually more efficient than all vectorized alternatives,
as a rule of thumb I think most people would agree that MATLAB's matrix multiplication is much faster than doing the summation in a for loop.

The problem I have is that when the matrices become extremely large, even the sparse representation takes up >100GB of memory, so its multiplication by a vector still ends up being slow.

If I were to do the summation in a for loop instead of using matrix-vector multiplication,
I could delete elements of the summation as they are added to the total sum ...
thereby saving on the storage costs of storing ALL elements of the sum in a big matrix.

In other words, the matrix-vector multiplication way stores ALL elements of the sum, even after some of them might not be needed anymore - while doing the summation in a for loop allows one to generate the elements on the fly, and delete them after they've been added to the total sum.
-----
In matlab, is the matrix-vector multiplication not itself just a for loop that is implemented efficiently since it was compiled to machine code ?

If that's true, then if I code a for loop summation and compile it to machine code (using FORTRAN or C) would it be just as fast as matrix-vector multiplication ? or is there something else that makes the matrix-vector multiplication more efficient ?

In each iteration of the for loop I might have to be calling an external function, or reading something from a file in order to figure out what is going to be added to the total sum, would this slow down my code considerably ?

I'm hoping there's a way to code a for loop that's just as fast as matrix-vector multiplication but doesn't store unnecessary elements
Bruno Luong
2010-04-05 07:43:04 UTC
Permalink
Post by Juliette Salexa
Hello,
I've had a problem that's been bothering me for a couple of weeks now and wanted to see other people's opinions,
Apart from cases like this: http://www.mathworks.com/matlabcentral/newsreader/view_thread/255653#664380
where the for loop is actually more efficient than all vectorized alternatives,
as a rule of thumb I think most people would agree that MATLAB's matrix multiplication is much faster than doing the summation in a for loop.
The problem I have is that when the matrices become extremely large, even the sparse representation takes up >100GB of memory, so its multiplication by a vector still ends up being slow.
If I were to do the summation in a for loop instead of using matrix-vector multiplication,
I could delete elements of the summation as they are added to the total sum ...
thereby saving on the storage costs of storing ALL elements of the sum in a big matrix.
In other words, the matrix-vector multiplication way stores ALL elements of the sum, even after some of them might not be needed anymore - while doing the summation in a for loop allows one to generate the elements on the fly, and delete them after they've been added to the total sum.
-----
In matlab, is the matrix-vector multiplication not itself just a for loop that is implemented efficiently since it was compiled to machine code ?
If that's true, then if I code a for loop summation and compile it to machine code (using FORTRAN or C) would it be just as fast as matrix-vector multiplication ? or is there something else that makes the matrix-vector multiplication more efficient ?
In each iteration of the for loop I might have to be calling an external function, or reading something from a file in order to figure out what is going to be added to the total sum, would this slow down my code considerably ?
I'm hoping there's a way to code a for loop that's just as fast as matrix-vector multiplication but doesn't store unnecessary elements
For loop by itself is not slow, what is slow is overhead due to code inside the iteration body. Statements such as calling another function, creating a matrix, creating a scalar, deleting them all cost time, let alone reading from file. The for-loop in C fortran is fast because it usually manipulates the basic type of the computer: a bared large chunk data array of numbers (Matlab array has an overhead before this bared array can be directly accessed by lower level code).

If you need to read a file to build you matrix element, then building matrix element will surely cost more time than any matrix x vector multiplication, regardless the later is carried out with or without for-loop - by adding or using built-in function(s).

In any case, use PROFILER is a recommended strategy to detect what take time in your code (note that the profiler only provides an approximation of what's going on, but it is largely enough to start with).

Bruno
Matt J
2010-04-05 10:49:03 UTC
Permalink
Post by Juliette Salexa
The problem I have is that when the matrices become extremely large, even the sparse representation takes up >100GB of memory, so its multiplication by a vector still ends up being slow.
========

It might be worth describing the application, so that we can assess if there is an unnecessary inefficiency here. I've never seen a situation requiring anything close to a 100GB sparse matrix.
James Tursa
2010-04-05 15:36:05 UTC
Permalink
Post by Juliette Salexa
Hello,
I've had a problem that's been bothering me for a couple of weeks now and wanted to see other people's opinions,
Apart from cases like this: http://www.mathworks.com/matlabcentral/newsreader/view_thread/255653#664380
where the for loop is actually more efficient than all vectorized alternatives,
as a rule of thumb I think most people would agree that MATLAB's matrix multiplication is much faster than doing the summation in a for loop.
The problem I have is that when the matrices become extremely large, even the sparse representation takes up >100GB of memory, so its multiplication by a vector still ends up being slow.
If I were to do the summation in a for loop instead of using matrix-vector multiplication,
I could delete elements of the summation as they are added to the total sum ...
thereby saving on the storage costs of storing ALL elements of the sum in a big matrix.
In other words, the matrix-vector multiplication way stores ALL elements of the sum, even after some of them might not be needed anymore - while doing the summation in a for loop allows one to generate the elements on the fly, and delete them after they've been added to the total sum.
-----
In matlab, is the matrix-vector multiplication not itself just a for loop that is implemented efficiently since it was compiled to machine code ?
If that's true, then if I code a for loop summation and compile it to machine code (using FORTRAN or C) would it be just as fast as matrix-vector multiplication ? or is there something else that makes the matrix-vector multiplication more efficient ?
In each iteration of the for loop I might have to be calling an external function, or reading something from a file in order to figure out what is going to be added to the total sum, would this slow down my code considerably ?
I'm hoping there's a way to code a for loop that's just as fast as matrix-vector multiplication but doesn't store unnecessary elements
I have re-read your post several times now and am still not sure what your exact question is. Are you asking specifically about the operation A* x where A is a 2D matrix and x is a 1d vector? And are you asking how does MATLAB do this and could you do it more efficiently (or as efficiently) by hand coding your own and compiling it? If so, the answer generally is no. For full matrices, MATLAB calls a 3rd party highly optimized BLAS library for this (and similar) operations and it is unlikely you will be able to meet or beat this library for speed. Presumably the writers of this library have taken into account the cache size etc. of the machine to optimize memory access etc. for the operation. I doubt there is any significant large memory wasting code in the library. My experience in writing my mtimesx submission have shown only some limited cases involving complex variables with
Post by Juliette Salexa
In other words, the matrix-vector multiplication way stores ALL elements of the sum, even after some of them might not be needed anymore
It seems you are implying that the MATLAB BLAS library does this for the A * x operation and you have some evidence of it. What exact operation in particular do you think stores all elements of a sum in memory before adding them together?

If you want to try it, there are several implementations of the BLAS library routines for full matrices available for free on the web, both Fortran and C source code. You can download them and try them out. e.g., the Fortran ones can be found at www.netlib.org. I seriously doubt that you will meet or beat the MATLAB BLAS library for any of these self-compiled codes, however (excepting my own hand written code in mtimesx for some specialized cases).

James Tursa

(You can store sparse matrices > 100GB in memory on your computer?)
Juliette Salexa
2010-04-06 20:02:19 UTC
Permalink
Post by James Tursa
Are you asking specifically about the operation A* x where A is a 2D matrix and x is a 1d vector?
Thank you very much James,

Yes, I apologize for not being clear
Post by James Tursa
It seems you are implying that the MATLAB BLAS library does this for the A * x operation and you have some evidence of it. What exact operation in particular do you think stores all elements of a sum in memory before adding them together?
I realize that I was very unclear in my initial post (sorry!)

In order to multiply a (10^5) x (10^5) matrix by a (10^5 x 1) vector, I need to have that matrix and vector stored in matlab in the first place, which would require more than 1TB of RAM. Even if it wasn't stored in matlab, but just stored in a file on my hard drive, (which would of course slow me down significantly), I wouldn't want to store all of the elements of the matrix-vector multiplication at the same time.

Instead, let's say I did the matrix-vector multiplication by my own forloop.
Let's say M(i,j) is my matrix and V(k) is my vector

I create the elements of M(i,j) on the fly as they are needed, multiply them by the appropriate V(k) values, and then delete them when they are no longer needed.

For example,

M(1,1)=rand;
sum=M(1,1)*V(1);
clear(M);

M(1,2)=rand;
sum=sum + M(1,2)*V(2);
clear(M);

M(1,3)=rand;
sum=sum + M(1,3)*V(3);
clear(M);

.... looped ....

*Of course M(1,1) could just have been M, in the above example, but I put the indices for illustrative purposes*

In this way, I don't need to store *all elements at the same time* , which would take a lot of space, but instead I can just store the ones I need at the moment.
-------------------------
Although using matlab's internal matrix-vector multiplication function would require storage of huge arrays, the summation itself would be done extremely fast.

The disadvantage of my above forloop is that although the memory requirement is negligible, the computation is very slow (because the elements are not generated by RAND, but by a more complicated set of instructions).

I was wondering if I did my above forloop in fortran instead of matlab, if the speed of my summation could compete with matlab's speed for matrix-vector multiplication. that way I can save memory cost with not too much sacrifice on the speed.

*I hope I'm making sense ... =) *
Post by James Tursa
And are you asking how does MATLAB do this and could you do it more efficiently (or as efficiently) by hand coding your own and compiling it? If so, the answer generally is no. For full matrices, MATLAB calls a 3rd party highly optimized BLAS library for this (and similar) operations and it is unlikely you will be able to meet or beat this library for speed. Presumably the writers of this library have taken into account the cache size etc. of the machine to optimize memory access etc. for the operation.
Is it possible for the BLAS library to have an optimized code for my matrix vector multiplication, where the elements are created and deleted only when necessary ?
Post by James Tursa
(You can store sparse matrices > 100GB in memory on your computer?)
Not my own computer, but I submit my jobs to a cluster: 16 CPU's, each with 32GB of RAM!
Post by James Tursa
It might be worth describing the application, so that we can assess if there is an unnecessary inefficiency here. I've never seen a situation requiring anything close to a 100GB sparse matrix.
Thank you Matt J, it's a Feynman integral .. summation over 49^12 paths, each with different amplitude, with no symmetry or cancellations ... I've thought about it for awhile and people have wrote papers about this since the early 1990's , and the 100GB matrix is pretty much as small is it gets



Also, thank you Bruno for your comments, those were helpful too

Juliette
Bruno Luong
2010-04-06 20:20:18 UTC
Permalink
Post by Juliette Salexa
Thank you Matt J, it's a Feynman integral .. summation over 49^12 paths, each with different amplitude, with no symmetry or cancellations ...
Are you sure you can do it with your supercomputer?

Even if you can compute 10^9 paths in 1 second, it will takes about 6075 years to compute all the terms of paths.

round(49^12/(1e9*3600*24*365))

Bruno
Juliette Salexa
2010-04-06 20:39:23 UTC
Permalink
Post by Bruno Luong
Even if you can compute 10^9 paths in 1 second, it will takes about 6075 years to compute all the terms of paths.
round(49^12/(1e9*3600*24*365))
Bruno
You're right, sorry i screwed that up big time.
The dimension of the Hilbert space is 7, so I should have said 7^12 or 49^6

a complex double precision matrix with 4^12 elements is ~ 220 GB, and I can cut it down to around 100GB by ignoring the paths with very small amplitude.

It's still a pretty computationally expensive problem, but I think doable and very worthwhile!
Matt J
2010-04-06 20:52:05 UTC
Permalink
Post by Juliette Salexa
Although using matlab's internal matrix-vector multiplication function would require storage of huge arrays, the summation itself would be done extremely fast.
The disadvantage of my above forloop is that although the memory requirement is negligible, the computation is very slow (because the elements are not generated by RAND, but by a more complicated set of instructions).
================


No, that's not at all clear. Even if there was a RAM chip big enough to hold a matrix of this size and even if MATLAB had access to all of this RAM, the bottleneck in summation speed comes from the need to transfer the matrix data between the RAM and CPU. It's not at all clear that these data transfers would be faster than the computations you would use to generate the M(i,j) elements on the fly. Assessing that would, at minimum require my detail about how the M(i,j) are generated.

Anyway, to answer your question, I guess, people often get a lot faster code, even for much smaller matrices then this, by generating their M(i,j) and doing the summation on the fly. It's done all the time in tomographic imaging for example. However, it does depend, as I said on how much computation the on the fly operations really require and how this compares to bus speed.
Juliette Salexa
2010-04-06 23:59:07 UTC
Permalink
Thank you Matt J,
Post by Matt J
No, that's not at all clear. Even if there was a RAM chip big enough to hold a matrix of this size and even if MATLAB had access to all of this RAM, the bottleneck in summation speed comes from the need to transfer the matrix data between the RAM and CPU. It's not at all clear that these data transfers would be faster than the computations you would use to generate the M(i,j) elements on the fly. Assessing that would, at minimum require my detail about how the M(i,j) are generated.
Okay, well I still have to work out how the M(i,j)'s will be calculated. I suspect they will be something like: A(3,4)B(1,4)C(2,4)D(2,2)..P(4,3), where those matrices are stored 16x16 matrices, and the indices are chosen between 1:4 and 1:4 based on what i and j are.


I would suspect (just intuitively) that these computations would be more expensive than the data transfers btwn RAM and CPU.

On my laptop with 4GB ram,

For n=117649
It takes almost negligable time to multiply an nxn matrix by and nx1 vector:

tic; A=(sprand(n,n,0.001));toc;
Elapsed time is 82.586749 seconds.

tic; B=(sprand(n,1,1));toc;
Elapsed time is 0.001320 seconds.

tic; A*B; toc;
Elapsed time is 0.016662 seconds.

Attempting to create A=(sprand(n,n,0.002)) failed because I ran out of memory (probably background processes) ... so I think the computation above was pushing the memory limit of my computer

The majority of the time was spent creating the matrix A, and almost zero time doing the summation.

[btw, any idea why it takes almost no time to create the vector B, while it takes more than a minute to create the matrix A ? A only has 118 times the number of non-zero elements as B!!! ]
Post by Matt J
Anyway, to answer your question, I guess, people often get a lot faster code, even for much smaller matrices then this, by generating their M(i,j) and doing the summation on the fly. It's done all the time in tomographic imaging for example. However, it does depend, as I said on how much computation the on the fly operations really require and how this compares to bus speed.
I think this forloop method that I described in my last post has another advantage, in that it is easily parallelized (it can be done using parfor ... and then the overall sums on each machine can be added together at the end .... while with matrix vector multiplication, I don't see how it could be parallelized easily (but I'm just guessing here and could be totally wrong)
Steven Lord
2010-04-07 14:23:47 UTC
Permalink
Post by Juliette Salexa
Thank you Matt J,
*snip*
Post by Juliette Salexa
On my laptop with 4GB ram,
For n=117649
tic; A=(sprand(n,n,0.001));toc;
Elapsed time is 82.586749 seconds.
tic; B=(sprand(n,1,1));toc;
Elapsed time is 0.001320 seconds.
tic; A*B; toc;
Elapsed time is 0.016662 seconds.
Attempting to create A=(sprand(n,n,0.002)) failed because I ran out of
memory (probably background processes) ... so I think the computation
above was pushing the memory limit of my computer
The majority of the time was spent creating the matrix A, and almost zero
time doing the summation.
[btw, any idea why it takes almost no time to create the vector B, while
it takes more than a minute to create the matrix A ? A only has 118 times
the number of non-zero elements as B!!! ]
A has n times as many columns as B. Due to the way sparse matrices are
stored, this has a significant effect on how much memory is required, and
that memory allocation is likely what took most of the time in the creation
of A. If your call that creates A was the first call to SPRAND in your
session, that could also have contributed somewhat to the amount of time
needed to create A.
--
Steve Lord
***@mathworks.com
comp.soft-sys.matlab (CSSM) FAQ: http://matlabwiki.mathworks.com/MATLAB_FAQ
Matt J
2010-04-07 21:54:04 UTC
Permalink
Post by Juliette Salexa
The majority of the time was spent creating the matrix A, and almost zero time doing the summation.
===============

Apples and oranges. sprand is written in M-code, whereas the summation is executed almost entirely in optimized C-code.
Matt J
2010-04-07 23:10:07 UTC
Permalink
Post by Juliette Salexa
I would suspect (just intuitively) that these computations would be more expensive than the data transfers btwn RAM and CPU.
=====

Even if that's true, the fact is that in the real world, you would have no choice but to store the matrix on hard drive, if you were to precompute it, as opposed to RAM. So, what you really need to be analyzing is whether the computation time of M(i,j) outweighs data transfer time from disk, rather than with data transfer from RAM.

If it does, it might be worthwhile to pre-compute your M and store it on hard drive after all. Note, also, that this might not rule out using MATLAB to do the matrix- vector multiplication. MATLAB workspace variables can store their data on disk using the MEMMAPFILE function. You could try it, I suppose...

Typically, though, for matrices of this size (100GB), people don't expect pre-computation to be optimum and opt for on-the-fly computation of the M(i,j). However, I don't know how thoroughly this assessment is ever made.
Juliette Salexa
2010-04-10 22:46:05 UTC
Permalink
Post by Matt J
Post by Matt J
in the real world, you would have no choice but to store the matrix on hard drive, if you were to precompute it, as opposed to RAM. So, what you really need to be analyzing is whether the computation time of M(i,j) outweighs data transfer time from disk, rather than with data transfer from RAM.
Thanks Matt J,

I've calculated the RAM requirements and for the cluster that I'm using, storing the matrix won't be a problem
Post by Matt J
Apples and oranges. sprand is written in M-code, whereas the summation is executed almost entirely in optimized C-code
Thanks for pointing that out, I realize it was a bad example now that I see that the summation had an unfair advantage .. and I'm starting to see that what I asked originally is not easy to analyze and needs to be examined on a case-by-case basis
vortse a
2010-04-11 12:15:07 UTC
Permalink
If I were you I would write the forloop to compute on the fly a submatrix of M large enough to be stored in the RAM, do the multiplication of said submatrix with the corresponding portion of V and add this to your sum. This way you will drastically reduce the number of loops you will have to perform and you can optimise the submatrix size for performance. Doesn't the computation of the elements of M get a speedup from vectorisation as well?
Juliette Salexa
2010-04-12 16:41:04 UTC
Permalink
Post by vortse a
If I were you I would write the forloop to compute on the fly a submatrix of M large enough to be stored in the RAM, do the multiplication of said submatrix with the corresponding portion of V and add this to your sum. This way you will drastically reduce the number of loops you will have to perform and you can optimise the submatrix size for performance. Doesn't the computation of the elements of M get a speedup from vectorisation as well?
Hello, thank you for the suggestion,

Using submatrices seems like a very sensible thing to do.

As for whether or not the computations to create the matrix M can be sped-up from vectorization. That's another problem that I've been thinking about which is probably even bigger than the question I posted. It would involve constructing a tensor, with as many elements as the forloop, whose rank is somewhere around [log(base4) of the number of elements in the forloop] ...

In theory I expect that using this approach would be faster than using an embedded forloop (a forloop for each iteration of the forloop for the matrix-vector multiplication) , because very many iterations of the forloop could be taken care of at once (one can imagine constructing M from the outer product of two vectors .. that would be faster than calculating the elements of M with a forloop) ... but I think it would be dependent on how well the tensorPackage that I'm using was coded. I'm not sure if the overhead of using the tensorPackage would outweigh the cost of calculating the elements in a forloop.

Thanks,
Juliette.
Yves
2010-04-17 12:13:06 UTC
Permalink
Dear all,

I encountered a similar problem where matlab's matrix multiplication function turned out to be inefficient. The problem was the following:

I needed to compute y=A*x where A is a large square matrix with non-zero elements only in the first row and first column. I noticed that coding this by hand was a lot more efficient than using matlab's multiplication function. Here's the code to show you:

------------------------
clear all

N = 10000;
gk = 0.01;

A = zeros(N,N);

x = zeros(N,1);
x(1) = 1;

A(1,2:end) = ones(1,N-1)*gk;
A(2:end,1) = -ones(1,N-1)*gk;

%multiplication method 1
tic
y1=A*x;
toc

%multiplication method 2
y2=zeros(N,1);
tic
y2(1) = A(1,:)*x;
y2(2:end) = A(2:end,1)*x(1);
toc
----------------------

It turns out that on my computer for N=10000 the first method is more than 100 times slower than the second method. As James I was under the impression that matlab was highly optimized when it comes to matrix operations. Does anybody know what's going on here?

Best,
Yves



As James I thought that this function was highly optimized, and I am very surprised
Post by James Tursa
Post by Juliette Salexa
Hello,
I've had a problem that's been bothering me for a couple of weeks now and wanted to see other people's opinions,
Apart from cases like this: http://www.mathworks.com/matlabcentral/newsreader/view_thread/255653#664380
where the for loop is actually more efficient than all vectorized alternatives,
as a rule of thumb I think most people would agree that MATLAB's matrix multiplication is much faster than doing the summation in a for loop.
The problem I have is that when the matrices become extremely large, even the sparse representation takes up >100GB of memory, so its multiplication by a vector still ends up being slow.
If I were to do the summation in a for loop instead of using matrix-vector multiplication,
I could delete elements of the summation as they are added to the total sum ...
thereby saving on the storage costs of storing ALL elements of the sum in a big matrix.
In other words, the matrix-vector multiplication way stores ALL elements of the sum, even after some of them might not be needed anymore - while doing the summation in a for loop allows one to generate the elements on the fly, and delete them after they've been added to the total sum.
-----
In matlab, is the matrix-vector multiplication not itself just a for loop that is implemented efficiently since it was compiled to machine code ?
If that's true, then if I code a for loop summation and compile it to machine code (using FORTRAN or C) would it be just as fast as matrix-vector multiplication ? or is there something else that makes the matrix-vector multiplication more efficient ?
In each iteration of the for loop I might have to be calling an external function, or reading something from a file in order to figure out what is going to be added to the total sum, would this slow down my code considerably ?
I'm hoping there's a way to code a for loop that's just as fast as matrix-vector multiplication but doesn't store unnecessary elements
I have re-read your post several times now and am still not sure what your exact question is. Are you asking specifically about the operation A* x where A is a 2D matrix and x is a 1d vector? And are you asking how does MATLAB do this and could you do it more efficiently (or as efficiently) by hand coding your own and compiling it? If so, the answer generally is no. For full matrices, MATLAB calls a 3rd party highly optimized BLAS library for this (and similar) operations and it is unlikely you will be able to meet or beat this library for speed. Presumably the writers of this library have taken into account the cache size etc. of the machine to optimize memory access etc. for the operation. I doubt there is any significant large memory wasting code in the library. My experience in writing my mtimesx submission have shown only some limited cases involving complex variables with
Post by Juliette Salexa
In other words, the matrix-vector multiplication way stores ALL elements of the sum, even after some of them might not be needed anymore
It seems you are implying that the MATLAB BLAS library does this for the A * x operation and you have some evidence of it. What exact operation in particular do you think stores all elements of a sum in memory before adding them together?
If you want to try it, there are several implementations of the BLAS library routines for full matrices available for free on the web, both Fortran and C source code. You can download them and try them out. e.g., the Fortran ones can be found at www.netlib.org. I seriously doubt that you will meet or beat the MATLAB BLAS library for any of these self-compiled codes, however (excepting my own hand written code in mtimesx for some specialized cases).
James Tursa
(You can store sparse matrices > 100GB in memory on your computer?)
Yi Cao
2010-04-17 14:53:05 UTC
Permalink
Post by Yves
Dear all,
------------------------
clear all
N = 10000;
gk = 0.01;
A = zeros(N,N);
x = zeros(N,1);
x(1) = 1;
A(1,2:end) = ones(1,N-1)*gk;
A(2:end,1) = -ones(1,N-1)*gk;
%multiplication method 1
tic
y1=A*x;
toc
%multiplication method 2
y2=zeros(N,1);
tic
y2(1) = A(1,:)*x;
y2(2:end) = A(2:end,1)*x(1);
toc
----------------------
It turns out that on my computer for N=10000 the first method is more than 100 times slower than the second method. As James I was under the impression that matlab was highly optimized when it comes to matrix operations. Does anybody know what's going on here?
Best,
Yves
As James I thought that this function was highly optimized, and I am very surprised
Your matrix A is sparse, but you treated it as full matrix when you do y=A*x. In your second method, your calculation used the advantage of sparsity. You can try

B = sparse(A);
tic
y3 = B*x;
toc

to see how efficient on sparse matrix-vector multiplication.

HTH
Yi

Loading...