Bash scripts are powerful in data processing. This post records some tips that I encountered during the data processing, and will keep updated.

Data statistics

Handle duplicate

Obtain the unique lines of a file:

sort -u "<filename>" -o "<output_filename>"

# slightly lower
sort "<filename>" | uniq > "<output_filename>"

Random sort

1 2	# -R: sort by random order sort -R "<filename>"

Count the number of each lines

1
2
3

# -i: ignores the case 
# -c: count the # of lines
sort "<filename>" | uniq -ic

Print duplicate lines

# print all duplicate lines
sort "<filename>" | uniq -iD

# print only uniq duplicate lines
sort "<filename>" | uniq -iD |uniq -i
# or case-sensitive:
sort "<filename>" | uniq -D |uniq -i

Count the occurrence of a string “xxx”.

vim mode
1
:%s/"xxx"//gn
grep
1
$ grep -o "xxx" filename | wc -l
If count strings [“xxx”, “yyy”]:

1	$ grep -o "xxx\\|yyy" filename \| wc -l

Training and dev set split

Split train and test set:
Given unsplit data file “data.tsv”

Shuffle among lines
1
2
$ shuf "data.tsv" -o "shuffle.tsv"

Split into train/dev sets

Count # of lines:
1
$ wc -l "shuffle.tsv"

Split 90% into train set and 10% into devset:

1 2	$ head -n <#0.9count> "shuffle.tsv" > "train.tsv" $ tail -n <#0.1count> "shuffle.tsv" > "dev.tsv"

Split into train / dev / test sets

Count # of lines:
1
$ wc -l "shuffle.tsv"

Split 80%/10%/10% into train/dev/test sets:

1
2
3

$ head -n <#0.8count> "shuffle.tsv" > "train.tsv"
$ sed -n "<#0.8count+1>,<#0.9count>p" shuffle.tsv > "dev.tsv"
$ tail -n <#0.1count> "shuffle.tsv" > "dev.tsv"

Check data

Check the specific line given files
1
sed -n <line_num>p <filename.txt>

Job running

Redirection

Redirect output to file

Redirect the standard output (stdout) and standard error (stderr) to an output file.

[bash_command] > "<out_file_name>" 2>&1

# 0 is stdin
# 1 is stdout
# 2 is stderr

File descriptor 1 is the standard output (stdout), and file descriptor 2 is the standard error (stderr).
& indicates that what follows is a file descriptor and not a filename.

Output the stdout to both screen and file

Copy standard input to each FILE, and also to standard output.

1	[bash_command] \| tee "<out_filename>"

Run jobs in background

ampersand(`&`)

Ampersand(&) starts a subprocess (i.e. children process to the current bash session), and will terminate when exiting current session.

1	[bash_command] &

`nohup`

nohup caches the hangup signal, i.e. the subprocess will still run when closing the current process.

1	nohup [bash_command]

It can be stopped by press Ctrl + z. Ctrl + z does not woking when & exists.

`nohup` + ampersand(`&`) + redirection

1	nohup [bash_command] > "<out.filename>" 2>&1 &

File management

find files larger than 100M in the current directory:

$ find . -type <type-name> -size +/-<file-size>
# e.g. find . -type f -size +100M 
# <type-name>: d: directory, f: file
# +: >, -: <
# <file-size>: k/M/G

find large directories

1
2
3

$ du -h --max-depth=1

$ du -hm --max-depth=2 | sort -nr | head -12

Reports the amount of disk space

# report all dirs
cd /
du -sh *

# subdir
du -lh --max-depth=1

Count

# count the # of lines in a file
$ wc -l <filepath>

# count the # of files in a directory (non-recursive)
$ ls | wc -l

# count the # of files recursively in a directory
$ find <directory> -type f | wc -l
# or
tree <directory>

Unzip Chinese character

1	unzip -O GBK <filename>.zip

Yekun's Note

Shell Command Zoo in Data Processing

Data statistics

Handle duplicate

Count the occurrence of a string “xxx”.

Training and dev set split

Check data

Job running

Redirection

Redirect output to file

Output the stdout to both screen and file

Run jobs in background

ampersand(`&`)

`nohup`

`nohup` + ampersand(`&`) + redirection

File management

Count

Unzip Chinese character

Data statistics

Handle duplicate

Count the occurrence of a string “xxx”.

Training and dev set split

Check data

Job running

Redirection

Redirect output to file

Output the stdout to both screen and file

Run jobs in background

ampersand(&)

nohup

nohup + ampersand(&) + redirection

File management

Count

Unzip Chinese character

ampersand(`&`)

`nohup`

`nohup` + ampersand(`&`) + redirection