Bash scripts are powerful in data processing. This post records some tips that I encountered during the data processing, and will keep updated.
Data statistics
Handle duplicate
Obtain the unique lines of a file:
1
2
3
4sort -u "<filename>" -o "<output_filename>"
# slightly lower
sort "<filename>" | uniq > "<output_filename>"Random sort
1
2# -R: sort by random order
sort -R "<filename>"Count the number of each lines
1
2
3# -i: ignores the case
# -c: count the # of lines
sort "<filename>" | uniq -icPrint duplicate lines
1
2
3
4
5
6
7# print all duplicate lines
sort "<filename>" | uniq -iD
# print only uniq duplicate lines
sort "<filename>" | uniq -iD |uniq -i
# or case-sensitive:
sort "<filename>" | uniq -D |uniq -i
Count the occurrence of a string “xxx”.
vim mode
1
:%s/"xxx"//gn
grep
1
$ grep -o "xxx" filename | wc -l
If count strings [“xxx”, “yyy”]:
1 | $ grep -o "xxx\|yyy" filename | wc -l |
Training and dev set split
- Split train and test set:
Given unsplit data file “data.tsv”
Shuffle among lines
1
$ shuf "data.tsv" -o "shuffle.tsv"
Count # of lines:
1
$ wc -l "shuffle.tsv"
Split 90% into train set and 10% into devset:
1
2$ head -n #0.9train "shuffle.tsv" > "train.tsv"
$ tail -n #0.1dev "shuffle.tsv" > "dev.tsv"
Check data
- Check the specific line given files
1
sed -n <line_num>p <filename.txt>
Job running
Redirection
Redirect output to file
Redirect the standard output (stdout
) and standard error (stderr
) to an output file.1
2
3
4
5[bash_command] > "<out_file_name>" 2>&1
# 0 is stdin
# 1 is stdout
# 2 is stderr
- File descriptor
1
is the standard output (stdout
), and file descriptor2
is the standard error (stderr
). &
indicates that what follows is a file descriptor and not a filename.
Output the stdout to both screen and file
Copy standard input to each FILE, and also to standard output.1
[bash_command] | tee "<out_filename>"
Run jobs in background
ampersand(&
)
Ampersand(&
) starts a subprocess (i.e. children process to the current bash session), and will terminate when exiting current session.1
[bash_command] &
nohup
nohup
caches the hangup signal, i.e. the subprocess will still run when closing the current process.1
nohup [bash_command]
It can be stopped by press Ctrl
+ z
. Ctrl
+ z
does not woking when &
exists.
nohup
+ ampersand(&
) + redirection
1 | nohup [bash_command] > "<out.filename>" 2>&1 & |
File management
find files larger than 100M in the current directory:
1
2
3
4
5$ find . -type <type-name> -size +/-<file-size>
# e.g. find . -type f -size +100M
# <type-name>: d: directory, f: file
# +: >, -: <
# <file-size>: k/M/Gfind large directories
1
2
3$ du -h --max-depth=1
$ du -hm --max-depth=2 | sort -nr | head -12Reports the amount of disk space
1
2
3
4
5
6# report all dirs
cd /
du -sh *
# subdir
du -lh --max-depth=1
Count
1 | # count the # of lines in a file |