Yekun's Note

Machine learning notes and writeup.

Fork me on GitHub

Shell Command Zoo in Data Processing

Bash scripts are powerful in data processing. This post records some tips that I encountered during the data processing, and will keep updated.

Data statistics

Handle duplicate

  • Obtain the unique lines of a file:

    1
    2
    3
    4
    sort -u "<filename>" -o "<output_filename>"

    # slightly lower
    sort "<filename>" | uniq > "<output_filename>"
  • Random sort

    1
    2
    # -R: sort by random order
    sort -R "<filename>"
  • Count the number of each lines

    1
    2
    3
    # -i: ignores the case 
    # -c: count the # of lines
    sort "<filename>" | uniq -ic
  • Print duplicate lines

    1
    2
    3
    4
    5
    6
    7
    8
    # print all duplicate lines
    sort "<filename>" | uniq -iD

    # print only uniq duplicate lines
    sort "<filename>" | uniq -iD |uniq -i
    # or case-sensitive:
    sort "<filename>" | uniq -D |uniq -i

Count the occurrence of a string “xxx”.

  1. vim mode
    1
    :%s/"xxx"//gn
  2. grep
    1
    $ grep -o "xxx" filename | wc -l
    If count strings [“xxx”, “yyy”]:
1
$ grep -o "xxx\|yyy" filename | wc -l

Training and dev set split

  1. Split train and test set:
    Given unsplit data file “data.tsv”
  • Shuffle among lines
    1
    2
    $ shuf "data.tsv" -o  "shuffle.tsv"

  1. Split into train/dev sets
  • Count # of lines:
    1
    $ wc -l "shuffle.tsv"
  • Split 90% into train set and 10% into devset:
    1
    2
    $ head -n <#0.9count> "shuffle.tsv" > "train.tsv"
    $ tail -n <#0.1count> "shuffle.tsv" > "dev.tsv"
  1. Split into train / dev / test sets
  • Count # of lines:
    1
    $ wc -l "shuffle.tsv"
  • Split 80%/10%/10% into train/dev/test sets:
    1
    2
    3
    $ head -n <#0.8count> "shuffle.tsv" > "train.tsv"
    $ sed -n "<#0.8count+1>,<#0.9count>p" shuffle.tsv > "dev.tsv"
    $ tail -n <#0.1count> "shuffle.tsv" > "dev.tsv"

Check data

  1. Check the specific line given files
    1
    sed -n <line_num>p <filename.txt>

Job running

Redirection

Redirect output to file

Redirect the standard output (stdout) and standard error (stderr) to an output file.

1
2
3
4
5
[bash_command] > "<out_file_name>" 2>&1

# 0 is stdin
# 1 is stdout
# 2 is stderr

  • File descriptor 1 is the standard output (stdout), and file descriptor 2 is the standard error (stderr).
  • & indicates that what follows is a file descriptor and not a filename.

Output the stdout to both screen and file

Copy standard input to each FILE, and also to standard output.

1
[bash_command] | tee "<out_filename>"

Run jobs in background

ampersand(&)

Ampersand(&) starts a subprocess (i.e. children process to the current bash session), and will terminate when exiting current session.

1
[bash_command] &

nohup

nohup caches the hangup signal, i.e. the subprocess will still run when closing the current process.

1
nohup [bash_command]

It can be stopped by press Ctrl + z. Ctrl + z does not woking when & exists.

nohup + ampersand(&) + redirection

1
nohup [bash_command] > "<out.filename>" 2>&1 &

File management

  • find files larger than 100M in the current directory:
    1
    2
    3
    4
    5
    $ find . -type <type-name> -size +/-<file-size>
    # e.g. find . -type f -size +100M
    # <type-name>: d: directory, f: file
    # +: >, -: <
    # <file-size>: k/M/G
  • find large directories

    1
    2
    3
    $ du -h --max-depth=1

    $ du -hm --max-depth=2 | sort -nr | head -12
  • Reports the amount of disk space

    1
    2
    3
    4
    5
    6
    # report all dirs
    cd /
    du -sh *

    # subdir
    du -lh --max-depth=1

Count

1
2
3
4
5
6
7
8
9
10
# count the # of lines in a file
$ wc -l <filepath>

# count the # of files in a directory (non-recursive)
$ ls | wc -l

# count the # of files recursively in a directory
$ find <directory> -type f | wc -l
# or
tree <directory>

Unzip Chinese character

1
unzip -O GBK <filename>.zip