remove duplicate rows in sparse matrix

Discussion:

(too old to reply)

Ross Anderson

2010-07-09 20:22:05 UTC

Hi all,

I have Ax=B where A and B are sparse. I want to reduce this so any row ax=b is unique, but I don't have the resources to make A and B full first. If I could, I could say

Afull = [full(A) full(B)];
[newmat,index] = unique(Afull,'rows','first');
repeatedIndex = setdiff(1:size(Afull,1),index);
Af = full(A);
Bf = full(B);
Af(repeatedIndex,:) = [];
Bf(repeatedIndex)=[];
A = sparse(Af);
B = sparse(Bf);

but is there a way to do this without un-sparsifying A and B?
I have the components of A and B in vectors
A = sparse(rows(:,1),rows(:,2),rows(:,3),numrows,sizeX);
and I can guarantee that for a row the non-zero elements are in ascending order. I was thinking I might search the matrix rows for a block [rows(:,2) rows(:,3)] with a different rows(:,1). But that seems overly complicated and slow.

Suggestions?

Bruno Luong

2010-07-10 09:45:19 UTC

Permalink

Post by Ross Anderson
Hi all,
I have Ax=B where A and B are sparse. I want to reduce this so any row ax=b is unique, but I don't have the resources to make A and B full first. If I could, I could say
Afull = [full(A) full(B)];
[newmat,index] = unique(Afull,'rows','first');
repeatedIndex = setdiff(1:size(Afull,1),index);
Af = full(A);
Bf = full(B);
Af(repeatedIndex,:) = [];
Bf(repeatedIndex)=[];
A = sparse(Af);
B = sparse(Bf);
but is there a way to do this without un-sparsifying A and B?
I have the components of A and B in vectors
A = sparse(rows(:,1),rows(:,2),rows(:,3),numrows,sizeX);
and I can guarantee that for a row the non-zero elements are in ascending order. I was thinking I might search the matrix rows for a block [rows(:,2) rows(:,3)] with a different rows(:,1). But that seems overly complicated and slow.
Suggestions?

I can't see any short and efficient way to Matlab code this problem. But here are few ideas:

We can define a relational order between two rows r1 and r2 as following

r1 < r2 <=> issorted([r1; r2], 'rows');

Post by Ross Anderson

r1=ceil(10*rand(1,5));
r2=ceil(10*rand(1,5));
[r1; r2]

ans =

1 1 9 7 4
10 1 5 4 8

Post by Ross Anderson

issorted([r1; r2],'rows')

ans =

1

Now you can use this specific definition to sort the rows of the concatenated sparse matrix. You have to code your own sorting algorithm though to make it works on SPARSE without converting to FULL, because the comparison operator is specific and no built-in sorting is able to do that.

After sorting, you can easily remove duplicated rows because they are now grouped together.

Two practical considerations: It is better to program a sorting without moving the matrix data, but keep the sorted index. You might also work on COLUMN rather than ROW because extraction column on sparse matrix is much faster than row due to the internal storing.

Bruno

Bruno Luong

2010-07-10 10:21:05 UTC

Permalink

The SORTROWS command seems to accept SPARSE matrix, but it is very slow possibly because of slow ROW accessing I mentioned earlier.

In any case, I dig out one of my implementation of HEAPSORT below. With this code you could have few more possibilities to play around with (reverse column/row, inplace mex implementation of comparison operator, etc...).

% Bruno

function a = heapsort(a, ltfun, varargin)
% function a = heapsort(a, ltfun, param1, ...)
%
% Sort the vector A with custumized comparison function.
% ltfun(ai, aj, param1, ...) must return TRUE when ai < aj,
% where ai, aj are two elements of a(:), "<" is user-specific relation
%
% Method: Heapsort algorithm.
%
% Example:
% % Data
% a = ceil(10*rand(5,7))
% % Function to compare two rows of matrix a
% ltfun = @(i,j,a) issorted(a([i j],:),'rows');
%
% idx = heapsort(1:size(a,1),ltfun,a);
%
% asorted = a(idx,:)
% issorted(asorted, 'rows')
%
% Author: Bruno Luong <brunoluong@????.com>

% Defauut comparison function
if nargin<2 || isempty(ltfun)
ltfun = @lt;
end

k = length(a);

%%
% sucessively insert elements to the heap
for last=1:k
child = last;
parent = floor(child/2);
ai = a(last);
% up-heap loop of ai
while parent>0
% heap condition violated
if ltfun(a(parent), ai, varargin{:}) % a(parent) < ai % comparison
a(child) = a(parent);
child = parent;
parent = floor(child/2);
else
break % of while-loop, heap ok
end
end % while loop

a(child) = ai; % here is the final home of ai
end % for-loop

%%
% Remove heap root (currently largest), replace it by the last element
% then make modified heap valid again
for last=k:-1:1
ai = a(last); % try to move ai at the root then move it down
a(last) = a(1); % push the max (root of the heap) to the position
parent = 1;
% down-heap loop of ai
while true
c1 = 2*parent;
if c1 >= last % no child
break
end
left = a(c1);
if c1 == last-1 % there is no right leaf
child = c1;
ac = left;
else
c2 = c1+1;
right = a(c2);
if ltfun(left,right,varargin{:}) % left < right % which leaf is the largest
child = c2;
ac = right;
else
child = c1;
ac = left;
end
end % if

% heap condition violated
if ltfun(ai,ac,varargin{:}) % ai < ac % comparison
a(parent) = ac;
parent = child;
else
break % of while-loop, heap ok
end

end % while loop

a(parent) = ai; % here is the final home of ai
end % for-loop

end % heapsort

Bruno Luong

2010-07-10 13:15:23 UTC

Permalink

SORTROWS on sparse matrix is definitively a consuming task, 6 minutes for matrix of size 100 thousands.

a=sprand(1e5,1e5,1e-4);
tic; b=sortrows(a); toc

Elapsed time is 361.659520 seconds.

Bruno

2010-07-10 20:59:06 UTC

Permalink

one of the other solutions is outlined below
- note: run this in a function(!)...

function atest
% the data
n=1000000;
a=zeros(n,3);
b=zeros(n,3);
a([1,3,9],:)=repmat(10*(1:3),3,1);
b([3,4,8],:)=repmat(10*(1:3),3,1);
a(10,:)=ceil(100*rand(1,3));
b([2,5],:)=ceil(-100*rand(2,3));
as=sparse(a);
bs=sparse(b);
% the engine
tic;
[ax,ay,av]=find([as,bs]);
ss=unique([ax,ay,av],'rows','first');
r=sparse(ss(:,1),ss(:,2),ss(:,3));
r=unique(full(r),'rows','first');
t1=toc;
tic;
da=unique([as,bs],'rows','first'); % <- ori data
t2=toc;
% the result
disp(isequal(r,full(da)));
disp([{'t1';'t2';'gain'},num2cell([t1;t2;t2/t1])]);
%{
% wintel: ic2/2*2.6ghz/2gb/winxp.sp3.32/r2010a.32
1 % <- equal results
't1' [0.0011105]
't2' [ 0.43159]
'gain' [ 388.63]
% ru =
0 0 0 -45 -77 -66
0 0 0 -45 -66 -35
0 0 0 0 0 0
0 0 0 10 20 30
10 20 30 0 0 0
10 20 30 10 20 30
24 12 61 0 0 0
%}
end

us

Matt J

2010-07-11 00:18:05 UTC

Permalink

============

I don't see why you think you needed to unsparsify. UNIQUE seems to accept sparse input just fine, at least in R2009b. So I'm wondering why the solution isn't as simple as the following:

n=size(A,2);
Aug=unique( [A,B], 'rows', 'first');

A=Aug(:,1:n);
B=Aug(:,n+1:end);