转录组上游分析错误(报错大赏)
希望所有学员都可以站在生信技能树的舞台上发光发热!
因为平时做单细胞比较多。。突然想试试自己还会不会常规转录组的上游分析。。记忆一下各个软件的使用流程也是不错的
首先选择数据 SRX8632887: GSM4645172: M1h rep2; Mus musculus; RNA-Seq
因为 ebi 数据库没有这个数据的相关信息因为这个数据刚刚发布,所以我就使用了 prefetch 下载。
#是一个小鼠的 rna-seq 数据
prefetch SRR12108837
# 然后验证一下
vdb-validate SRR12108837.sra
#结果 显示 完整的没有问题!
2021-09-19T15:07:16 vdb-validate.2.10.7 info: Database 'SRR12108837.sra' metadata: md5 ok
2021-09-19T15:07:16 vdb-validate.2.10.7 info: Table 'SEQUENCE' metadata: md5 ok
2021-09-19T15:07:17 vdb-validate.2.10.7 info: Column 'QUALITY': checksums ok
2021-09-19T15:07:19 vdb-validate.2.10.7 info: Column 'READ': checksums ok
2021-09-19T15:07:19 vdb-validate.2.10.7 info: Column 'SPOT_GROUP': checksums ok
2021-09-19T15:07:19 vdb-validate.2.10.7 info: Database 'SRR12108837.sra' contains only unaligned reads
2021-09-19T15:07:19 vdb-validate.2.10.7 info: Database 'SRR12108837.sra' is consistent
下一步就是把这个 sra 转变为 fastqc
代码如下:
fastq-dump --gzip --split-files -O ${fqdir} SRR12108837.sra
#因为是双端测序生成如下两个文件!
SRR12108837_1.fastq.gz SRR12108837_2.fastq.gz
#看一下输出日志
(base) t010204@bio1:~/SRR12108837$ cat FASTQ_DUMP.txt
nohup: ignoring input
Read 23482567 spots for SRR12108837.sra
Written 23482567 spots for SRR12108837.sra
#没有问题。 进行下一步
数据质控!
# 用 fastqc + multiqc 质控
fastqc -t 10 -o $qcdir $fqdir/SRR*.fastq.gz
# 使用MultiQc整合FastQC结果
multiqc *.zip
质控显示这个数据还是有一定问题的!上图!fastqc+multiqc
可以看到这个测序数据有明显的的GC双峰!
接下来 trim 过滤
因为是双端测序,所以trim_galore命令需要加上--paired 的参数,其他参数可以--help搜一下~
trim_galore -j 20 --phred33 -q 30 --length 36 --stringency 3 --fastqc --paired --max_n 3 -o $cleandata $rawdata/SRR12108837_1.fastq.gz\ $rawdata/SRR12108837_2.fastq.gz
为了快点儿进行。。这里选择了20线程。。。。对不起。。
过滤之后进行数据比对!这里用hisat2
#因为我是小鼠的数据,所以选择 ensemble 数据库的
#选择了v100 版本的 fasta 和 100版本的gtf 文件
hisat2-build Mus_musculus.GRCm38.dna.primary_assembly.fa
Mus_musculus.GRCm38.dna.genome
#再构建一个 cdna 版本的 Mus_musculus.GRCm38.cdna.all.fa
hisat2-build Mus_musculus.GRCm38.cdna.all.fa
Mus_musculus.GRCm38.cdna
看一下日志,有无报错。
primary_assembly.fa 索引文件 1小时20min
Returning block of 242335964 for bucket 8
Exited GFM loop
fchr[A]: 0
fchr[C]: 773280124
fchr[G]: 1325927941
fchr[T]: 1878618059
fchr[$]: 2652783500
Exiting GFM::buildToDisk()
Returning from initFromVector
Wrote 888467914 bytes to primary GFM file: Mus_musculus.GRCm38.dna.genome.1.ht2
Wrote 663195880 bytes to secondary GFM file: Mus_musculus.GRCm38.dna.genome.2.ht2
Re-opening _in1 and _in2 as input streams
Returning from GFM constructor
Returning from initFromVector
Wrote 1165549977 bytes to primary GFM file: Mus_musculus.GRCm38.dna.genome.5.ht2
Wrote 675339278 bytes to secondary GFM file: Mus_musculus.GRCm38.dna.genome.6.ht2
Re-opening _in5 and _in5 as input streams
Returning from HierEbwt constructor
Headers:
len: 2652783500
gbwtLen: 2652783501
nodes: 2652783501
sz: 663195875
gbwtSz: 663195876
lineRate: 6
offRate: 4
offMask: 0xfffffff0
ftabChars: 10
eftabLen: 0
eftabSz: 0
ftabLen: 1048577
ftabSz: 4194308
offsLen: 165798969
offsSz: 663195876
lineSz: 64
sideSz: 64
sideGbwtSz: 48
sideGbwtLen: 192
numSides: 13816581
numLines: 13816581
gbwtTotLen: 884261184
gbwtTotSz: 884261184
reverse: 0
linearFM: Yes
Total time for call to driver() for forward index: 01:20:57
(base) t010204@bio1:~/cankaojiyinzu$
Mus_musculus.GRCm38.cdna.all.fa 版本 10分钟
Returning block of 40310005 for bucket 7
Exited GFM loop
fchr[A]: 0
fchr[C]: 57236287
fchr[G]: 110196845
fchr[T]: 164136962
fchr[$]: 218769745
Exiting GFM::buildToDisk()
Returning from initFromVector
Wrote 110015105 bytes to primary GFM file: Mus_musculus.GRCm38.cdna.1.ht2
Wrote 54692444 bytes to secondary GFM file: Mus_musculus.GRCm38.cdna.2.ht2
Re-opening _in1 and _in2 as input streams
Returning from GFM constructor
Returning from initFromVector
Wrote 1053264143 bytes to primary GFM file: Mus_musculus.GRCm38.cdna.5.ht2
Wrote 54828298 bytes to secondary GFM file: Mus_musculus.GRCm38.cdna.6.ht2
Re-opening _in5 and _in5 as input streams
Returning from HierEbwt constructor
Headers:
len: 218769745
gbwtLen: 218769746
nodes: 218769746
sz: 54692437
gbwtSz: 54692437
lineRate: 6
offRate: 4
offMask: 0xfffffff0
ftabChars: 10
eftabLen: 0
eftabSz: 0
ftabLen: 1048577
ftabSz: 4194308
offsLen: 13673110
offsSz: 54692440
lineSz: 64
sideSz: 64
sideGbwtSz: 48
sideGbwtLen: 192
numSides: 1139426
numLines: 1139426
gbwtTotLen: 72923264
gbwtTotSz: 72923264
reverse: 0
linearFM: Yes
Total time for call to driver() for forward index: 00:10:36
(base) t010204@bio1:~/cankaojiyinzu$
两种索引方法日志内容一样~~均无报错!接下来就进行比对啦!
写一个简单的 shell 脚本
# !/bin/bash
#index=~/cankaojiyinzu/Mus_musculus.GRCm38.dna.genome
index=~/cankaojiyinzu/Mus_musculus.GRCm38.cdna
inputdir=~/cleandata
outdir=~/cleandata/hisatt
# 单个样本比对
hisat2 -p 20 -x ${index} -1 ${inputdir}/SRR12108837_1_val_1.fq.gz -2 ${inputdir}/SRR12108837_2_val_2.fq.gz -S ${outdir}/SRR12108837.Hisat_cdna_aln.sam
看一下比对结果情况:先看dna的 比对到 96.9%
(base) t010204@bio1:~/cleandata$ cat hi_wrong.txt
nohup: ignoring input
23480224 reads; of these:
23480224 (100.00%) were paired; of these:
1295673 (5.52%) aligned concordantly 0 times
18771102 (79.94%) aligned concordantly exactly 1 time
3413449 (14.54%) aligned concordantly >1 times
----
1295673 pairs aligned concordantly 0 times; of these:
47378 (3.66%) aligned discordantly 1 time
----
1248295 pairs aligned 0 times concordantly or discordantly; of these:
2496590 mates make up the pairs; of these:
1455247 (58.29%) aligned 0 times
900163 (36.06%) aligned exactly 1 time
141180 (5.65%) aligned >1 times
96.90% overall alignment rate
(base) t010204@bio1:~/cleandata$
再看一下 cdna 的 比对到 92.12%
(base) t010204@bio1:~/cleandata$ cat hi_cdna_wrong.txt
nohup: ignoring input
23480224 reads; of these:
23480224 (100.00%) were paired; of these:
2727958 (11.62%) aligned concordantly 0 times
9665614 (41.16%) aligned concordantly exactly 1 time
11086652 (47.22%) aligned concordantly >1 times
----
2727958 pairs aligned concordantly 0 times; of these:
32392 (1.19%) aligned discordantly 1 time
----
2695566 pairs aligned 0 times concordantly or discordantly; of these:
5391132 mates make up the pairs; of these:
3699423 (68.62%) aligned 0 times
692537 (12.85%) aligned exactly 1 time
999172 (18.53%) aligned >1 times
92.12% overall alignment rate
用 Samtools 把比对的 Sam文件 转为二进制 bam 文件
# sam转bam
samtools sort -@ 20 -o SRR12108837.Hisat_aln.sorted.bam SRR12108837.Hisat_aln.sam
# 对bam建索引
samtools index SRR12108837.Hisat_aln.sorted.bam SRR12108837.Hisat_aln.sorted.bam.bai
看一下sam 文件和 bam 文件的大小情况
(base) t010204@bio1:~/cleandata/hisatt$ ll -ah
total 68G
drwxrwxr-x 2 t010204 t010204 4.0K 9月 20 00:29 ./
drwxrwxr-x 5 t010204 t010204 4.0K 9月 20 00:01 ../
-rw-rw-r-- 1 t010204 t010204 44 9月 19 21:21 index.txt
-rw-rw-r-- 1 t010204 t010204 87 9月 20 00:26 samtool_cdna2_sort.txt
-rw-rw-r-- 1 t010204 t010204 87 9月 19 21:04 samtool_cdna_sort.txt
-rw-rw-r-- 1 t010204 t010204 87 9月 19 21:00 samtool_sort.txt
-rw-rw-r-- 1 t010204 t010204 23G 9月 19 20:13 SRR12108837.Hisat_aln.sam
-rw-rw-r-- 1 t010204 t010204 1.9G 9月 19 21:02 SRR12108837.Hisat_aln.sorted.bam
-rw-rw-r-- 1 t010204 t010204 2.1M 9月 19 21:20 SRR12108837.Hisat_aln.sorted.bam.bai
-rw-rw-r-- 1 t010204 t010204 3.1G 9月 20 00:29 SRR12108837.Hisat_cdna2_aln.sorted.bam
-rw-rw-r-- 1 t010204 t010204 37G 9月 19 20:30 SRR12108837.Hisat_cdna_aln.sam
-rw-rw-r-- 1 t010204 t010204 3.1G 9月 19 21:06 SRR12108837.Hisat_cdna_aln.sorted.bam
-rw-rw-r-- 1 t010204 t010204 6.4M 9月 19 21:22 SRR12108837.Hisat_cdna_aln.sorted.bam.bai
可以看到 Sam 文件比 bam 大很多。而且 cDNA要比dna 比对得到的结果更大。(可以看到有两个cdna的bam文件,是因为害怕这里出错,所以用samtools 运行了两遍,看看两遍大小是否一致!)
走featureCounts定量流程
写一个 featurecounts shell 脚本因为之前用v100 .fa作为索引,所以 这里用v100 gtf 文件作为注释文件,保证版本一致性。
# !/bin/bash
gtf=~/cankaojiyinzu/Mus_musculus.GRCm38.100.chr.gtf.gz
inputdir=~/cleandata/hisatt
outdir=~/cleandata/feature_counts
featureCounts -T 20 -p -t exon -g gene_id -a $gtf -o $outdir/all.cdna_id.txt $inputdir/*.Hisat_cdna_aln.sorted.bam
看一下结果!dna 比对了 66.5% 0.39 分钟
nohup: ignoring input
========== _____ _ _ ____ _____ ______ _____
===== / ____| | | | _ \| __ \| ____| /\ | __ \
===== | (___ | | | | |_) | |__) | |__ / \ | | | |
==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
v2.0.1
//========================== featureCounts setting ===========================\\
|| ||
|| Input files : 1 BAM file ||
|| o SRR12108837.Hisat_aln.sorted.bam ||
|| ||
|| Output file : all.id.txt ||
|| Summary : all.id.txt.summary ||
|| Annotation : Mus_musculus.GRCm38.100.chr.gtf.gz (GTF) ||
|| Dir for temp files : /home/data/t010204/cleandata/feature_counts ||
|| ||
|| Threads : 20 ||
|| Level : meta-feature level ||
|| Paired-end : yes ||
|| Multimapping reads : not counted ||
|| Multi-overlapping reads : not counted ||
|| Min overlapping bases : 1 ||
|| ||
|| Chimeric reads : counted ||
|| Both ends mapped : not required ||
|| ||
\\============================================================================//
//================================= Running ==================================\\
|| ||
|| Load annotation file Mus_musculus.GRCm38.100.chr.gtf.gz ... ||
|| Features : 843402 ||
|| Meta-features : 55401 ||
|| Chromosomes/contigs : 22 ||
|| ||
|| Process BAM file SRR12108837.Hisat_aln.sorted.bam... ||
|| Paired-end reads are included. ||
|| Total alignments : 27509024 ||
|| Successfully assigned alignments : 18300209 (66.5%) ||
|| Running time : 0.39 minutes ||
|| ||
|| Write the final count table. ||
|| Write the read assignment summary. ||
|| ||
|| Summary of counting results can be found in file "/home/data/t010204/clea ||
|| ndata/feature_counts/all.id.txt.summary" ||
|| ||
\\============================================================================//
再看一下cdna 的 成功比对0 ?????
nohup: ignoring input
========== _____ _ _ ____ _____ ______ _____
===== / ____| | | | _ \| __ \| ____| /\ | __ \
===== | (___ | | | | |_) | |__) | |__ / \ | | | |
==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
v2.0.1
//========================== featureCounts setting ===========================\\
|| ||
|| Input files : 1 BAM file ||
|| o SRR12108837.Hisat_cdna2_aln.sorted.bam ||
|| ||
|| Output file : all.cdna2_id.txt ||
|| Summary : all.cdna2_id.txt.summary ||
|| Annotation : Mus_musculus.GRCm38.100.gtf.gz (GTF) ||
|| Dir for temp files : /home/data/t010204/cleandata/feature_counts ||
|| ||
|| Threads : 20 ||
|| Level : meta-feature level ||
|| Paired-end : yes ||
|| Multimapping reads : not counted ||
|| Multi-overlapping reads : not counted ||
|| Min overlapping bases : 1 ||
|| ||
|| Chimeric reads : counted ||
|| Both ends mapped : not required ||
|| ||
\\============================================================================//
//================================= Running ==================================\\
|| ||
|| Load annotation file Mus_musculus.GRCm38.100.gtf.gz ... ||
|| Features : 843712 ||
|| Meta-features : 55487 ||
|| Chromosomes/contigs : 45 ||
|| ||
|| Process BAM file SRR12108837.Hisat_cdna2_aln.sorted.bam... ||
|| Paired-end reads are included. ||
|| Total alignments : 45117185 ||
|| Successfully assigned alignments : 0 (0.0%) ||
|| Running time : 0.65 minutes ||
|| ||
|| Write the final count table. ||
|| Write the read assignment summary. ||
|| ||
|| Summary of counting results can be found in file "/home/data/t010204/clea ||
|| ndata/feature_counts/all.cdna2_id.txt.summary" ||
|| ||
\\============================================================================//
cdna的那个比对后,不能靠gtf定量, 因为gtf是描述基因在dna的坐标。
cdna是没有坐标的,染色体都没有!! 这才懂得为什么出现没有比对上的情况!!!
人傻了。。