软件介绍之Samtools / 四六文摘

咱们《生信技能树》的B站有一个lncRNA数据分析实战，缺乏配套笔记，所以我们安排了100个lncRNA组装案例文献分享，以及这个流程会用到的100个软件的实战笔记教程！

下面是100个lncRNA组装流程的软件的笔记教程

SAMtools是一个用于操作sam和bam文件的工具合集。能够实现二进制查看、格式转换、排序及合并等功能，结合sam格式中的flag、tag等信息，还可以完成比对结果的统计汇总。同时利用linux中的grep、awk等操作命令，还可以大大扩展samtools的使用范围与功能。包含有许多命令，我这里主要介绍几个:

samtools官网：http://www.htslib.org/

一、软件安装

使用conda安装

conda install samtools

二、Samtools的用法

1.samtools view

samtools view主要用来转换SAM/BAM/CRAM文件。将sam文件与bam文件互换；然后对bam文件进行各种操作，比如数据的排序(sort)和提取(这些操作是对bam文件进行的，因而当输入为sam文件的时候，不能进行该操作)；最后将排序或提取得到的数据输出为bam或sam（默认的）格式

常用参数：

-b output BAM # 该参数设置输出 BAM 格式，默认下输出是 SAM 格式文件 -h print header for the SAM output # 默认下输出的 sam 格式文件不带 header，该参数设定输出sam文件时带 header 信息 -H print SAM header only (no alignments) # 仅仅输出文件的头文件 -S input is SAM # 默认下输入是 BAM 文件，若是输入是 SAM 文件，则最好加该参数，否则有时候会报错。 -u uncompressed BAM output (force -b) # 该参数的使用需要有-b参数，能节约时间，但是需要更多磁盘空间。 -c print only the count of matching records # 仅输出匹配的统计记录 -L FILE only include reads overlapping this BED FILE [null] # 仅包括和bed文件存在overlap的reads -o FILE output file name [stdout] # 输出文件的名称 -F INT only include reads with none of the FLAGS in INT present [0] # 过滤flag，仅输出指定FLAG值的序列 -q INT only include reads with mapping quality >= INT [0] # 比对的最低质量值，一般认为20就为unique比对了，可以结合上述-bF参数使用使用提取特定的比对结果 -@ Number of additional threads to use [0] # 指使用的线程数

具体例子：

# 将sam文件转换成bam文件 samtools view -bS abc.sam > abc.bam


# BAM转换为SAM

samtools view -h -o out.sam out.bam
# 提取比对到参考序列上的比对结果

samtools view -bF 4 abc.bam > abc.F.bam
# 提取paired reads中两条reads都比对到参考序列上的比对结果，只需要把两个4+8的值12作为过滤参数即可

samtools view -bF 12 abc.bam > abc.F12.bam
# 提取没有比对到参考序列上的比对结果

samtools view -bf 4 abc.bam > abc.f.bam
# 提取bam文件中比对到caffold1上的比对结果，并保存到sam文件格式

samtools view abc.bam scaffold1 > scaffold1.sam
# 提取scaffold1上能比对到30k到100k区域的比对结果

samtools view abc.bam scaffold1:30000-100000 $gt; scaffold1_30k-100k.sam

# 根据fasta文件，将 header 加入到 sam 或 bam 文件中 samtools view -T genome.fasta -h scaffold1.sam > scaffold1.h.sam

具体例子：

2.samtools sort

samtools sort用来对SAM/BAM/CRAM文件进行排序，按最左坐标排序，或使用-n时按读取名称排序。默认情况下，排序后的输出被写到标准输出，或者在使用-o时写到指定的文件(out.bam)。此命令还将创建临时文件tmpprefixv .%d。当整个对齐数据无法装入内存时(通过-m选项控制)，根据需要使用bam。

常用参数：

-l INT # 设置输出文件压缩等级。0-9，0是不压缩，9是压缩等级最高。不设置此参数时，使用默认压缩等级； -m INT # 设置每个线程运行时的内存大小，可以使用K，M和G表示内存大小。 -n # 设置按照read名称进行排序； -o FILE # 设置最终排序后的输出文件名； -O FORMAT # 设置最终输出的文件格式，可以是bam，sam或者cram，默认为bam； -T PREFIX # 设置临时文件的前缀； -@ INT #设置排序和压缩的是线程数量，默认是单线程。# tmp.bam 按照序列名称排序，并将结果输出到tmp.sort.bam samtools sort -n tmp.bam -o tmp.sort.bam samtools view tmp.sort.bam

3.samtools index

必须对bam文件进行默认情况下的排序后，才能进行index。否则会报错。建立索引后将产生后缀为.bai的文件，用于快速的随机处理。很多情况下需要有bai文件的存在，特别是显示序列比对情况下。比如samtool的tview命令就需要；gbrowse2显示reads的比对图形的时候也需要。IGV显示比对情况也需要。

常用参数

-b # 创建一个BAI索引。当不使用格式选项时，这是当前的默认设置。 -c # 创建CSI索引。默认情况下，索引的最小间隔大小为2^14，与BAI格式使用的固定值相同。 -m INT # 创建CSI索引，最小间隔大小为2^INT。 -@, --threads INT # 除了主线程[0]之外，要使用的输入/输出压缩线程数。

4.samtools flagstat

samtools flagstat用于给出BAM文件的比对结果。

常用参数:

-@ INT # 设置读取文件时要使用的额外线程数。 -O FORMAT # 设置输出格式。FORMAT可以设置为'default'， 'json'或'tsv'来选择默认的，json或标签分隔值输出格式。如果不使用此选项，将选择默认格式。

3.merge和cat

merge将多个已经sort了的bam文件融合成一个bam文件。融合后的文件不需要则是已经sort过了的。而cat命令不需要将bam文件进行sort。Usage: samtools merge [-nurlf] [-h inh.sam] [-b <bamlist.fofn>] <out.bam> <in1.bam> [<in2.bam> ... <inN.bam>]

Options: -n Input files are sorted by read name # 输入文件是经过sort -n的 -t TAG Input files are sorted by TAG value # 输入文件是经过sort -t的 -r Attach RG tag (inferred from file names) # 添加上RG标签 -u Uncompressed BAM output # 输出未压缩的bam -f Overwrite the output BAM if exist # 覆盖已经存在的bam -1 Compress level 1 # 1倍压缩 -l INT Compression level, from 0 to 9 [-1] # 指定压缩倍数 -R STR Merge file in the specified region STR [all] -h FILE Copy the header in FILE to <out.bam> [in1.bam]$ samtools cat Usage: samtools cat [options] <in1.bam> [... <inN.bam>] samtools cat [options] <in1.cram> [... <inN.cram>]


Concatenate BAM or CRAM files, first those in <bamlist.fofn>, then those

on the command line.

Options: -b FILE list of input BAM/CRAM file names, one per line -h FILE copy the header from FILE [default is 1st input file] -o FILE output BAM/CRAM --no-PG do not add a PG line --output-fmt FORMAT[,OPT[=VAL]]... Specify output format (SAM, BAM, CRAM) --output-fmt-option OPT[=VAL] Specify a single output file format option in the form of OPTION or OPTION=VALUE -@, --threads INT Number of additional threads to use [0] --verbosity INT Set level of verbosity

三、SAMtools实战

1.格式转换(samtools view)

和hisat2连用，将结果sam文件转为bam并排序

批量运行脚本

mkdir 04.mapping ls /home/data/lihe/lncRNA_project/03.trim/*_1.fq.gz > 1 ls /home/data/lihe/lncRNA_project/03.trim/*_2.fq.gz > 2 ls /home/data/lihe/lncRNA_project/03.trim/*_1.fq.gz | cut -d "/" -f 7 | cut -d "_" -f 1 > 0 paste 0 1 2 > config


cd 04.mapping
cat > 04.hg38_mapping.sh

index=/home/data/server/reference/index/hisat/hg38/genome

config=$1

number1=$2

number2=$3

cat $1 | while read id

do

    if((i%$number1==$number2))

    then

    arr=(${id})

    fq1=${arr[1]}

    fq2=${arr[2]}

    sample=${arr[0]}

    hisat2 -p 2 -x ${index} -1 $fq1 -2 $fq2 | samtools sort -@ 6 -o ./${sample}.bam -

    fi    ## end for number1

    i=$((i+1))

done

for i in {0..6} do (nohup bash 04.hg38_mapping.sh config 7 $i 1>log.$i.txt 2>&1 & ) done

输出文件：

1.6G 2月 4 23:29 SRR8984523_aln.sorted.bam 1.4G 2月 4 23:41 SRR8984524_aln.sorted.bam 1.5G 2月 4 23:47 SRR8984525_aln.sorted.bam 1.5G 2月 4 23:55 SRR8984526_aln.sorted.bam 1.5G 2月 5 00:02 SRR8984527_aln.sorted.bam 1.7G 2月 4 23:43 SRR8984528_aln.sorted.bam 1.7G 2月 4 23:51 SRR8984529_aln.sorted.bam 1.7G 2月 5 00:00 SRR8984530_aln.sorted.bam 1.7G 2月 4 23:58 SRR8984531_aln.sorted.bam 1.5G 2月 4 23:50 SRR8984532_aln.sorted.bam 1.6G 2月 5 08:59 SRR8984533_aln.sorted.bam 1.6G 2月 4 23:43 SRR8984534_aln.sorted.bam

2. 建立索引(samtools index)和查看reads比对情况(samtools flagstat)**

批量运行脚本

ls /home/data/lihe/lncRNA_project/04.mapping/*.bam > 1 ls /home/data/lihe/lncRNA_project/04.mapping/*.bam | cut -d "/" -f 7 | cut -d "_" -f 1 | cut -d "." -f 1 > 0 paste 0 1 > config


cat > index.sh
config=$1

number1=$2

number2=$3

cat $1 | while read id

do

    if((i%$number1==$number2))

    then

    arr=(${id})

    input=${arr[1]}

    sample=${arr[0]}

    samtools index -@ 4 ${input} ./${sample}.bam.bai && samtools flagstat ${input}>$sample.txt

    fi    ## end for number1

    i=$((i+1))

done
for i  in {0..4}

do 

(nohup bash index.sh  config 5 $i 1>log.$i.txt 2>&1 & )

done

输出文件：

ll -lh *bai|cut -d " " -f 5- 3.1M 2月 4 23:29 SRR8984523_aln.sorted.bam.bai 3.0M 2月 4 23:41 SRR8984524_aln.sorted.bam.bai 3.0M 2月 4 23:48 SRR8984525_aln.sorted.bam.bai 3.0M 2月 4 23:55 SRR8984526_aln.sorted.bam.bai 3.0M 2月 5 00:02 SRR8984527_aln.sorted.bam.bai 3.1M 2月 4 23:43 SRR8984528_aln.sorted.bam.bai 3.0M 2月 4 23:51 SRR8984529_aln.sorted.bam.bai 3.0M 2月 5 00:00 SRR8984530_aln.sorted.bam.bai 3.0M 2月 5 09:16 SRR8984531_aln.sorted.bam.bai 2.9M 2月 4 23:51 SRR8984532_aln.sorted.bam.bai 2.8M 2月 5 09:00 SRR8984533_aln.sorted.bam.bai 2.9M 2月 4 23:43 SRR8984534_aln.sorted.bam.bai

samtools flagstat用于统计文件比对情况：

ll -lh SRR*txt|cut -d " " -f 5- 395 2月 4 23:30 SRR8984523.txt 394 2月 4 23:41 SRR8984524.txt 394 2月 4 23:48 SRR8984525.txt 394 2月 4 23:56 SRR8984526.txt 394 2月 5 00:03 SRR8984527.txt 394 2月 4 23:44 SRR8984528.txt 394 2月 4 23:52 SRR8984529.txt 394 2月 5 00:01 SRR8984530.txt 394 2月 5 09:16 SRR8984531.txt 394 2月 4 23:51 SRR8984532.txt 394 2月 5 09:00 SRR8984533.txt 394 2月 4 23:43 SRR8984534.txt

结果示例：SRR8984523.txt

51977802 + 0 in total (QC-passed reads + QC-failed reads) #总共的reads数 4063860 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 50969572 + 0 mapped (98.06% : N/A) #总体上reads的匹配率 47913942 + 0 paired in sequencing #有多少reads是属于paired reads 23956971 + 0 read1 #reads1中的reads数 23956971 + 0 read2 #reads2中的reads数 45980036 + 0 properly paired (95.96% : N/A) #完美匹配的reads数：比对到同一条参考序列，并且两条reads之间的距离符合设置的阈值 46308722 + 0 with itself and mate mapped #paired reads中两条都比对到参考序列上的reads数 596990 + 0 singletons (1.25% : N/A) #单独一条匹配到参考序列上的reads数，和上一个相加，则是总的匹配上的reads数。 232008 + 0 with mate mapped to a different chr #paired reads中两条分别比对到两条不同的参考序列的reads数 210169 + 0 with mate mapped to a different chr (mapQ>=5) #同上一个，只是其中比对质量>=5的reads的数量

样品比对信息统计脚本：

##secondary表示比对到参考基因组多个位置的reads数 dir='/home/lihe/lncrna/align' outfile='mapinfo.txt' cd $dir ##先判断输出文件是否存在，方便多次修改调试脚本 if [ -e $outfile ]; then rm -rf $outfile fi ls *txt|while read id;do(echo $id|awk -F "." '{print $1}'>>sample);done #记住'{printf "%'"'"'18.0f\n",$0}'表示用“,”作为千位分隔符分隔整数，18.2f表示保留两位小数 ls *txt|while read id;do(cat $id|cut -d " " -f 1|sed -n "1p"|awk '{printf "%'"'"'18.0f\n",$0}' >>totalreads);done ls *txt|while read id;do(cat $id|cut -d " " -f 1|sed -n "2p"|awk '{printf "%'"'"'18.0f\n",$0}'>>secondary);done ls *txt|while read id;do(cat $id|cut -d " " -f 4,5|sed -n '5p'|awk -F "(" '{print $2}' >>mapratio);done echo "sample totalreads secondary mapratio" >$outfile paste -d "\t" sample totalreads secondary mapratio >>$outfile #删除中间文件 rm sample totalreads secondary mapratio

样品比对信息统计表：mapinfo.txt

sample totalreads secondary mapratio SRR8984523 53,974,859 10,235,416 98.59% SRR8984524 47,342,726 8,640,278 98.60% SRR8984525 50,990,953 9,651,296 98.60% SRR8984526 50,927,636 8,225,220 98.51% SRR8984527 49,300,389 7,642,290 98.36% SRR8984528 57,093,942 9,488,991 98.56% SRR8984529 57,548,074 8,818,589 98.56% SRR8984530 57,896,032 9,355,084 98.58% SRR8984531 57,832,433 9,652,116 98.60% SRR8984532 52,957,312 8,960,173 98.70% SRR8984533 55,501,832 9,690,769 98.67% SRR8984534 54,603,442 9,155,215 98.63%

3.提取chr15:89495978-895100020

samtools view -h SRR10744252.bam chr15:89495978-895100020 | samtools view -Sb - > small.bam

# bam转fastq nohup bedtools -i small.bam -fq tmp1.fq -fq2 tmp2.fq &

4.使用samtools tview使用下面的命令查看chr1:160000-160100区域的比对情况

samtools tview -p chr1:160000-160100 SRR10744251.bam

5.使用less命令分别查看test.sam，test.bam文件，为什么bam文件会输出乱码？使用samtools view命令再试试看？

# less命令可以正常查看sam⽂件，但是不能正常查看bam⽂文件，因为bam⽂文件是二进制文件，所以需要使⽤ samtools view SRR10744251.bam

文末友情推荐

与十万人一起学生信，你值得拥有下面的学习班：

软件介绍之Samtools