# Yekun's Note

Machine learning notes and writeup.

It is wasteful to store zeros elements in a sparse matrix, especially for incrementally data. When constructing tf-idf and bag-of-words features or saving graph ajacent matrix, non-efficient sparse matrix storage might lead to the memory error. To circumvent this problems, efficient sparse matrix storage is a choice.

# Sparse matrix

A martix is sparse when its sparsity is greater than 0.5, where the sparsity of a matrix is the # of zero-valued elements divided by the total # of elements (which is equal to 1 minus the density of the matrix).[2]

# Storage

Each entry in the array represents the element $a_{i,j}$ of the matrix and is accessed by the two indices $i$ (row index) and $j$ (colomn index).
Sparse matrix can reduce sustantial memory requirements by storing only the non-zero entries.

Formats can be divided into:

those support efficient modification, such as DOK, LIL or COO, which are typically used to construct the matrices.
those that support efficient access and matrix operations, such as CSR or CSC.

## Dictionary of keys (DOK)

• DOK is storing a dictionary that mays (row, column)-pairs to the valu of the elements. Elements that are missing from the dictionary are zeros.
• DOK based sparse matrix is efficient for constructing sparse matrices incrementally.
• Allows for efficient O(1) access of individual elements. Duplicates are not allowed. Can be efficiently converted to a coo_matrix once constructed.[3]

## List of lists (LIL)

• LIL stores one list per row, with each entry containing the column index and the value, where entries are sorted by column index for faster indexing.[4]
• Also efficient for constructing sparse matrices incrementally

• supports flexible slicing
• changes to the matrix sparsity structure are efficient

• arithmetic operations LIL + LIL are slow (consider CSR or CSC)
• slow column slicing (consider CSC)
• slow matrix vector products (consider CSR or CSC)

Intended Usage

LIL is a convenient format for constructing sparse matrices
once a matrix has been constructed, convert to CSR or CSC format for fast arithmetic and matrix vector operations
consider using the COO format when constructing large matrices

## Coordinate list (COO)

• COO stores a list of (row, column, value) tuples, where entries are sorted first by row index and then by column index. ALso known as the ijv or triplet format.[5]
• Also efficient for constructing sparse matrices incrementally
• A[i[k], j[k]] = data[k]

1. facilitates fast conversion among sparse formats
2. permits duplicate entries (see example)
3. very fast conversion to and from CSR/CSC formats

• does not directly support arithmetic operations and slicing

Intended Usage

1. COO is a fast format for constructing sparse matrices
2. Once a matrix has been constructed, convert to CSR or CSC format for fast arithmetic and matrix vector operations
3. By default when converting to CSR or CSC format, duplicate (i,j) entries will be summed together. This facilitates efficient construction of finite element matrices and the like. (see example)

## Compressed sparse row (CSR, CRS or Yale format)

The compressed sparse row (CSR) or compressed row storage (CRS) or Yale format stores the matrix in three 1-dim arrays (data, col_indices, row_indptr), respectively containing non-zero values.
Let NNZ denote the # of nonzero entries.[6]

• The arrays data (non-zero values) and col_indices (column indices) are of length NNZ.
• The array row_indptr has one element per row in the matrix and encodes the index in data where the given row starts.
• The first value of row_indptr is always 0 and the last is always set to NNZ.

Insantiate with:

1. csr_matrix(D) , with a dense matrix or rank-2 ndarray D
2. csr_matrix(S) , with another sparse matrix S (equivalent to S.tocsr())
3. csr_matrix((M, N), [dtype]) , to construct an empty matrix with shape (M, N) dtype is optional, defaulting to dtype=’d’.
4. csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)]), where data, row_ind and col_ind satisfy the relationship a[row_ind[k], col_ind[k]] = data[k].
5. csr_matrix((data, indices, indptr), [shape=(M, N)]).
• row = row_indptr[i: i+1]
• col = indices[row]
• A[row,col] = data[row]

• efficient arithmetic operations CSR + CSR, CSR * CSR, etc.
• efficient column slicing
• fast matrix vector products

• slow column slicing operations (consider CSC)
• changes to the sparsity structure are expensive (consider LIL or DOK)

## Compressed sparse column (CSC or CCS)

CSC is similar to CSR except that values are read first by column, storing row indices for each value.

• efficient arithmetic operations CSC + CSC, CSC * CSC, etc.
• efficient column slicing
• fast matrix vector products (CSR, BSR may be faster)

• slow row slicing operations (consider CSR)
• changes to the sparsity structure are expensive (consider LIL or DOK)

# Shuffle sparse matrix

When generating batches of training/test data, sklearn.utils.shuffle can help.

# Feed to TensorFlow

## Feed to tf.sparse.placeholder

After feeding into the tensorflow computational graph, use tf.sparse since SparseTensor cannot be directly feed into a Tensor op!

1. Use tf.sparse

2. Convert the SparseTensor to dense Tensor