Yekun's Note

Machine learning notes and writeup.

Fork me on GitHub

Sparse Matrix in Data Processing

It is wasteful to store zeros elements in a sparse matrix, especially for incrementally data. When constructing tf-idf and bag-of-words features or saving graph ajacent matrix, non-efficient sparse matrix storage might lead to the memory error. To circumvent this problems, efficient sparse matrix storage is a choice.

Sparse matrix

A martix is sparse when its sparsity is greater than 0.5, where the sparsity of a matrix is the # of zero-valued elements divided by the total # of elements (which is equal to 1 minus the density of the matrix).[2]

Storage

Each entry in the array represents the element of the matrix and is accessed by the two indices $i$ (row index) and $j$ (colomn index).
Sparse matrix can reduce sustantial memory requirements by storing only the non-zero entries.

Formats can be divided into:

those support efficient modification, such as DOK, LIL or COO, which are typically used to construct the matrices.
those that support efficient access and matrix operations, such as CSR or CSC.

Dictionary of keys (DOK)

  • DOK is storing a dictionary that mays (row, column)-pairs to the valu of the elements. Elements that are missing from the dictionary are zeros.
  • DOK based sparse matrix is efficient for constructing sparse matrices incrementally.
  • Allows for efficient O(1) access of individual elements. Duplicates are not allowed. Can be efficiently converted to a coo_matrix once constructed.[3]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
class scipy.sparse.dok_matrix(
arg1, shape=None, dtype=None, copy=False
)

# with a dense matrix, D
dok_matrix(D)

# with a sparse matrix, S
dok_matrix(S)

# create the matrix with initial shape (M,N) dtype is optional, defaulting to dtype=’d
dok_matrix((M,N), [dtype])

"""
attributes:
---------------------
dtype (dtype) Data type of the matrix
shape (2-tuple) Shape of the matrix
ndim (int) Number of dimensions (this is always 2)
nnz Number of nonzero elements
"""

List of lists (LIL)

  • LIL stores one list per row, with each entry containing the column index and the value, where entries are sorted by column index for faster indexing.[4]
  • Also efficient for constructing sparse matrices incrementally
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class scipy.sparse.lil_matrix(
arg1, shape=None, dtype=None, copy=False)
""" Row-based list of lists sparse matrix """

lil_matrix(D) # with a dense matrix or rank-2 ndarray D

lil_matrix(S) # with another sparse matrix S (equivalent to S.tolil())

lil_matrix((M, N), [dtype]) # to construct an empty matrix with shape (M, N) dtype is optional, defaulting to dtype=’d’.

"""
attributes:
---------------------
dtype: dtype. Data type of the matrix

shape2:tuple. Get shape of a matrix.

ndim: int. Number of dimensions (this is always 2)

nnz: Number of stored values, including explicit zeros.

data: LIL format data array of the matrix

rows: LIL format row index array of the matrix
"""

Advantages of the LIL format

  • supports flexible slicing
  • changes to the matrix sparsity structure are efficient

Disadvantages of the LIL format

  • arithmetic operations LIL + LIL are slow (consider CSR or CSC)
  • slow column slicing (consider CSC)
  • slow matrix vector products (consider CSR or CSC)

Intended Usage

LIL is a convenient format for constructing sparse matrices
once a matrix has been constructed, convert to CSR or CSC format for fast arithmetic and matrix vector operations
consider using the COO format when constructing large matrices

Coordinate list (COO)

  • COO stores a list of (row, column, value) tuples, where entries are sorted first by row index and then by column index. ALso known as the ijv or triplet format.[5]
  • Also efficient for constructing sparse matrices incrementally
  • A[i[k], j[k]] = data[k]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
class scipy.sparse.coo_matrix(
arg1, shape=None, dtype=None, copy=False)
"""
coo_matrix(D) # with a dense matrix D

coo_matrix(S) # with another sparse matrix S (equivalent to S.tocoo())

coo_matrix((M, N), [dtype]) # to construct an empty matrix with shape (M, N) dtype is optional, defaulting to dtype=’d’.

coo_matrix((data, (i, j)), [shape=(M, N)]) # to construct from three arrays:
1. data[:]
the entries of the matrix, in any order
2. i[:]
the row indices of the matrix entries
3. j[:]
the column indices of the matrix entries
Where A[i[k], j[k]] = data[k]. When shape is not specified, it is inferred from the index arrays
"""
A[i[k], j[k]] = data[k]

Advantages of the COO format

  1. facilitates fast conversion among sparse formats
  2. permits duplicate entries (see example)
  3. very fast conversion to and from CSR/CSC formats

Disadvantages of the COO format

  • does not directly support arithmetic operations and slicing

Intended Usage

  1. COO is a fast format for constructing sparse matrices
  2. Once a matrix has been constructed, convert to CSR or CSC format for fast arithmetic and matrix vector operations
  3. By default when converting to CSR or CSC format, duplicate (i,j) entries will be summed together. This facilitates efficient construction of finite element matrices and the like. (see example)

Compressed sparse row (CSR, CRS or Yale format)

The compressed sparse row (CSR) or compressed row storage (CRS) or Yale format stores the matrix in three 1-dim arrays (data, col_indices, row_indptr), respectively containing non-zero values.
Let NNZ denote the # of nonzero entries.[6]

  • The arrays data (non-zero values) and col_indices (column indices) are of length NNZ.
  • The array row_indptr has one element per row in the matrix and encodes the index in data where the given row starts.
  • The first value of row_indptr is always 0 and the last is always set to NNZ.

Insantiate with:

  1. csr_matrix(D) , with a dense matrix or rank-2 ndarray D
  2. csr_matrix(S) , with another sparse matrix S (equivalent to S.tocsr())
  3. csr_matrix((M, N), [dtype]) , to construct an empty matrix with shape (M, N) dtype is optional, defaulting to dtype=’d’.
  4. csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)]), where data, row_ind and col_ind satisfy the relationship a[row_ind[k], col_ind[k]] = data[k].
  5. csr_matrix((data, indices, indptr), [shape=(M, N)]).
    • row = row_indptr[i: i+1]
    • col = indices[row]
    • A[row,col] = data[row]

Advantages of the CSC format

  • efficient arithmetic operations CSR + CSR, CSR * CSR, etc.
  • efficient column slicing
  • fast matrix vector products

Disadvantages of the CSC format

  • slow column slicing operations (consider CSC)
  • changes to the sparsity structure are expensive (consider LIL or DOK)

Compressed sparse column (CSC or CCS)

CSC is similar to CSR except that values are read first by column, storing row indices for each value.

1
2
3
4
5
6
7
>>> indptr = np.array([0, 2, 3, 6]) # col_indptr
>>> indices = np.array([0, 2, 2, 0, 1, 2]) # row_indices
>>> data = np.array([1, 2, 3, 4, 5, 6])
>>> csc_matrix((data, indices, indptr), shape=(3, 3)).toarray()
array([[1, 0, 4],
[0, 0, 5],
[2, 3, 6]])

Advantages of the CSC format

  • efficient arithmetic operations CSC + CSC, CSC * CSC, etc.
  • efficient column slicing
  • fast matrix vector products (CSR, BSR may be faster)

Disadvantages of the CSC format

  • slow row slicing operations (consider CSR)
  • changes to the sparsity structure are expensive (consider LIL or DOK)

Shuffle sparse matrix

When generating batches of training/test data, sklearn.utils.shuffle can help.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from sklearn.utils import shuffle
import scipy.sparse as sp

# creating coo matrix
x_coo = sp.coo_matrix((data, (row, col)), shape=shape)
# shuffle
x, y, x_stat = shuffle(x, y, x_csr, random_state=seed)
# in a data generator
while True:
for offset in range(0, num_samples, bsz):
x_batch = np.array(x[offset: offset + bsz])
y_batch = np.array(y[offset: offset + bsz])
x_csr_batch = x_stat[offset: offset + bsz, :] # batch slice
# yield x_batch, y_batch, adj_batch, x_csr_batch

Feed to TensorFlow

Convert to dense matrix

1
2
x_dense_batch = x_csr_batch.todense()
# then feed to `tf.placeholder`

Feed to tf.sparse.placeholder

1
2
3
4
5
6
7
8
import tensorflow as tf
sess = tf.compat.v1.Session()

sparse_input = tf.compat.v1.sparse.placeholder(tf.float32, (None, num_words), name='sp_placeholder')
x_coo = x_csr.tocoo()
sp_vals = tf.compat.v1.SparseTensorValue(np.array([x_coo.row, x_coo.col]).T, x_coo.data, x_coo.shape)})
# feed dict
sess.run(train_op, {sparse_input: sp_vals})

After feeding into the tensorflow computational graph, use tf.sparse since SparseTensor cannot be directly feed into a Tensor op!

  1. Use tf.sparse

    1
    2
    3
    w = tf.Variable(tf.random_normal([dim1, dim2], mean=.0, stddev=.01))
    b = tf.Variable(tf.random_normal([dim2], mean=.0, stddev=.01))
    sp_x = tf.sparse.sparse_dense_matmul(sparse_input, w) + b
  2. Convert the SparseTensor to dense Tensor

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    tf.sparse.to_dense(
    sp_input, default_value=None, validate_indices=True, name=None
    )
    """
    Args:
    -------------------
    sp_input: SparseTensor
    default_value: Scalar value to set for indices not specified in sp_input. Defaults to zero.
    validate_indices: boolean (default True). If True, indices are checked to assure sorted lexicographic order and no repeats.
    name (optional): a name prefix for returned tensors.
    """

    dense_x = tf.sparse.to_dense(sparse_input)

References