希望所有学员都可以站在生信技能树的舞台上发光发热！

下面是2020生信入门班学员的随机投稿

因为平时做单细胞比较多。。突然想试试自己还会不会常规转录组的上游分析。。记忆一下各个软件的使用流程也是不错的

首先选择数据 SRX8632887: GSM4645172: M1h rep2; Mus musculus; RNA-Seq

因为 ebi 数据库没有这个数据的相关信息因为这个数据刚刚发布，所以我就使用了 prefetch 下载。

#是一个小鼠的 rna-seq 数据 prefetch SRR12108837 # 然后验证一下 vdb-validate SRR12108837.sra #结果显示完整的没有问题！ 2021-09-19T15:07:16 vdb-validate.2.10.7 info: Database 'SRR12108837.sra' metadata: md5 ok 2021-09-19T15:07:16 vdb-validate.2.10.7 info: Table 'SEQUENCE' metadata: md5 ok 2021-09-19T15:07:17 vdb-validate.2.10.7 info: Column 'QUALITY': checksums ok 2021-09-19T15:07:19 vdb-validate.2.10.7 info: Column 'READ': checksums ok 2021-09-19T15:07:19 vdb-validate.2.10.7 info: Column 'SPOT_GROUP': checksums ok 2021-09-19T15:07:19 vdb-validate.2.10.7 info: Database 'SRR12108837.sra' contains only unaligned reads 2021-09-19T15:07:19 vdb-validate.2.10.7 info: Database 'SRR12108837.sra' is consistent

下一步就是把这个 sra 转变为 fastqc

代码如下:

fastq-dump --gzip --split-files -O ${fqdir} SRR12108837.sra #因为是双端测序生成如下两个文件！ SRR12108837_1.fastq.gz SRR12108837_2.fastq.gz #看一下输出日志 (base) t010204@bio1:~/SRR12108837$ cat FASTQ_DUMP.txt nohup: ignoring input Read 23482567 spots for SRR12108837.sra Written 23482567 spots for SRR12108837.sra #没有问题。进行下一步

数据质控！

# 用 fastqc + multiqc 质控 fastqc -t 10 -o $qcdir $fqdir/SRR*.fastq.gz # 使用MultiQc整合FastQC结果 multiqc *.zip

质控显示这个数据还是有一定问题的！上图！fastqc+multiqc

可以看到这个测序数据有明显的的GC双峰!

接下来 trim 过滤

因为是双端测序，所以trim_galore命令需要加上--paired 的参数,其他参数可以--help搜一下~

trim_galore -j 20 --phred33 -q 30 --length 36 --stringency 3 --fastqc --paired --max_n 3 -o $cleandata $rawdata/SRR12108837_1.fastq.gz\ $rawdata/SRR12108837_2.fastq.gz

为了快点儿进行。。这里选择了20线程。。。。对不起。。

过滤之后进行数据比对！这里用hisat2

#因为我是小鼠的数据，所以选择 ensemble 数据库的 #选择了v100 版本的 fasta 和 100版本的gtf 文件

hisat2-build Mus_musculus.GRCm38.dna.primary_assembly.fa Mus_musculus.GRCm38.dna.genome #再构建一个 cdna 版本的 Mus_musculus.GRCm38.cdna.all.fa hisat2-build Mus_musculus.GRCm38.cdna.all.fa Mus_musculus.GRCm38.cdna

看一下日志，有无报错。

primary_assembly.fa 索引文件 1小时20min

Returning block of 242335964 for bucket 8 Exited GFM loop fchr[A]: 0 fchr[C]: 773280124 fchr[G]: 1325927941 fchr[T]: 1878618059 fchr[$]: 2652783500 Exiting GFM::buildToDisk() Returning from initFromVector Wrote 888467914 bytes to primary GFM file: Mus_musculus.GRCm38.dna.genome.1.ht2 Wrote 663195880 bytes to secondary GFM file: Mus_musculus.GRCm38.dna.genome.2.ht2 Re-opening _in1 and _in2 as input streams Returning from GFM constructor Returning from initFromVector Wrote 1165549977 bytes to primary GFM file: Mus_musculus.GRCm38.dna.genome.5.ht2 Wrote 675339278 bytes to secondary GFM file: Mus_musculus.GRCm38.dna.genome.6.ht2 Re-opening _in5 and _in5 as input streams Returning from HierEbwt constructor Headers: len: 2652783500 gbwtLen: 2652783501 nodes: 2652783501 sz: 663195875 gbwtSz: 663195876 lineRate: 6 offRate: 4 offMask: 0xfffffff0 ftabChars: 10 eftabLen: 0 eftabSz: 0 ftabLen: 1048577 ftabSz: 4194308 offsLen: 165798969 offsSz: 663195876 lineSz: 64 sideSz: 64 sideGbwtSz: 48 sideGbwtLen: 192 numSides: 13816581 numLines: 13816581 gbwtTotLen: 884261184 gbwtTotSz: 884261184 reverse: 0 linearFM: Yes Total time for call to driver() for forward index: 01:20:57 (base) t010204@bio1:~/cankaojiyinzu$

Mus_musculus.GRCm38.cdna.all.fa 版本 10分钟

Returning block of 40310005 for bucket 7 Exited GFM loop fchr[A]: 0 fchr[C]: 57236287 fchr[G]: 110196845 fchr[T]: 164136962 fchr[$]: 218769745 Exiting GFM::buildToDisk() Returning from initFromVector Wrote 110015105 bytes to primary GFM file: Mus_musculus.GRCm38.cdna.1.ht2 Wrote 54692444 bytes to secondary GFM file: Mus_musculus.GRCm38.cdna.2.ht2 Re-opening _in1 and _in2 as input streams Returning from GFM constructor Returning from initFromVector Wrote 1053264143 bytes to primary GFM file: Mus_musculus.GRCm38.cdna.5.ht2 Wrote 54828298 bytes to secondary GFM file: Mus_musculus.GRCm38.cdna.6.ht2 Re-opening _in5 and _in5 as input streams Returning from HierEbwt constructor Headers: len: 218769745 gbwtLen: 218769746 nodes: 218769746 sz: 54692437 gbwtSz: 54692437 lineRate: 6 offRate: 4 offMask: 0xfffffff0 ftabChars: 10 eftabLen: 0 eftabSz: 0 ftabLen: 1048577 ftabSz: 4194308 offsLen: 13673110 offsSz: 54692440 lineSz: 64 sideSz: 64 sideGbwtSz: 48 sideGbwtLen: 192 numSides: 1139426 numLines: 1139426 gbwtTotLen: 72923264 gbwtTotSz: 72923264 reverse: 0 linearFM: Yes Total time for call to driver() for forward index: 00:10:36 (base) t010204@bio1:~/cankaojiyinzu$

两种索引方法日志内容一样~~均无报错！接下来就进行比对啦！

写一个简单的 shell 脚本

# !/bin/bash #index=~/cankaojiyinzu/Mus_musculus.GRCm38.dna.genome index=~/cankaojiyinzu/Mus_musculus.GRCm38.cdna inputdir=~/cleandata outdir=~/cleandata/hisatt # 单个样本比对 hisat2 -p 20 -x ${index} -1 ${inputdir}/SRR12108837_1_val_1.fq.gz -2 ${inputdir}/SRR12108837_2_val_2.fq.gz -S ${outdir}/SRR12108837.Hisat_cdna_aln.sam

看一下比对结果情况:先看dna的比对到 96.9%

(base) t010204@bio1:~/cleandata$ cat hi_wrong.txt nohup: ignoring input 23480224 reads; of these: 23480224 (100.00%) were paired; of these: 1295673 (5.52%) aligned concordantly 0 times 18771102 (79.94%) aligned concordantly exactly 1 time 3413449 (14.54%) aligned concordantly >1 times ---- 1295673 pairs aligned concordantly 0 times; of these: 47378 (3.66%) aligned discordantly 1 time ---- 1248295 pairs aligned 0 times concordantly or discordantly; of these: 2496590 mates make up the pairs; of these: 1455247 (58.29%) aligned 0 times 900163 (36.06%) aligned exactly 1 time 141180 (5.65%) aligned >1 times 96.90% overall alignment rate (base) t010204@bio1:~/cleandata$

再看一下 cdna 的比对到 92.12%

(base) t010204@bio1:~/cleandata$ cat hi_cdna_wrong.txt nohup: ignoring input 23480224 reads; of these: 23480224 (100.00%) were paired; of these: 2727958 (11.62%) aligned concordantly 0 times 9665614 (41.16%) aligned concordantly exactly 1 time 11086652 (47.22%) aligned concordantly >1 times ---- 2727958 pairs aligned concordantly 0 times; of these: 32392 (1.19%) aligned discordantly 1 time ---- 2695566 pairs aligned 0 times concordantly or discordantly; of these: 5391132 mates make up the pairs; of these: 3699423 (68.62%) aligned 0 times 692537 (12.85%) aligned exactly 1 time 999172 (18.53%) aligned >1 times 92.12% overall alignment rate

用 Samtools 把比对的 Sam文件转为二进制 bam 文件

# sam转bam samtools sort -@ 20 -o SRR12108837.Hisat_aln.sorted.bam SRR12108837.Hisat_aln.sam # 对bam建索引 samtools index SRR12108837.Hisat_aln.sorted.bam SRR12108837.Hisat_aln.sorted.bam.bai

看一下sam 文件和 bam 文件的大小情况

(base) t010204@bio1:~/cleandata/hisatt$ ll -ah total 68G drwxrwxr-x 2 t010204 t010204 4.0K 9月 20 00:29 ./ drwxrwxr-x 5 t010204 t010204 4.0K 9月 20 00:01 ../ -rw-rw-r-- 1 t010204 t010204 44 9月 19 21:21 index.txt -rw-rw-r-- 1 t010204 t010204 87 9月 20 00:26 samtool_cdna2_sort.txt -rw-rw-r-- 1 t010204 t010204 87 9月 19 21:04 samtool_cdna_sort.txt -rw-rw-r-- 1 t010204 t010204 87 9月 19 21:00 samtool_sort.txt -rw-rw-r-- 1 t010204 t010204 23G 9月 19 20:13 SRR12108837.Hisat_aln.sam -rw-rw-r-- 1 t010204 t010204 1.9G 9月 19 21:02 SRR12108837.Hisat_aln.sorted.bam -rw-rw-r-- 1 t010204 t010204 2.1M 9月 19 21:20 SRR12108837.Hisat_aln.sorted.bam.bai -rw-rw-r-- 1 t010204 t010204 3.1G 9月 20 00:29 SRR12108837.Hisat_cdna2_aln.sorted.bam -rw-rw-r-- 1 t010204 t010204 37G 9月 19 20:30 SRR12108837.Hisat_cdna_aln.sam -rw-rw-r-- 1 t010204 t010204 3.1G 9月 19 21:06 SRR12108837.Hisat_cdna_aln.sorted.bam -rw-rw-r-- 1 t010204 t010204 6.4M 9月 19 21:22 SRR12108837.Hisat_cdna_aln.sorted.bam.bai

可以看到 Sam 文件比 bam 大很多。而且 cDNA要比dna 比对得到的结果更大。(可以看到有两个cdna的bam文件，是因为害怕这里出错，所以用samtools 运行了两遍，看看两遍大小是否一致！)

走featureCounts定量流程

写一个 featurecounts shell 脚本因为之前用v100 .fa作为索引，所以这里用v100 gtf 文件作为注释文件，保证版本一致性。

# !/bin/bash gtf=~/cankaojiyinzu/Mus_musculus.GRCm38.100.chr.gtf.gz inputdir=~/cleandata/hisatt outdir=~/cleandata/feature_counts


featureCounts -T 20 -p -t exon -g gene_id -a $gtf -o $outdir/all.cdna_id.txt  $inputdir/*.Hisat_cdna_aln.sorted.bam

看一下结果！dna 比对了 66.5% 0.39 分钟

nohup: ignoring input


==========     _____ _    _ ____  _____  ______          _____  

        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 

          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |

            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |

              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |

        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/

   v2.0.1
//========================== featureCounts setting ===========================\\

||                                                                            ||

||             Input files : 1 BAM file                                       ||

||                           o SRR12108837.Hisat_aln.sorted.bam               ||

||                                                                            ||

||             Output file : all.id.txt                                       ||

||                 Summary : all.id.txt.summary                               ||

||              Annotation : Mus_musculus.GRCm38.100.chr.gtf.gz (GTF)         ||

||      Dir for temp files : /home/data/t010204/cleandata/feature_counts      ||

||                                                                            ||

||                 Threads : 20                                               ||

||                   Level : meta-feature level                               ||

||              Paired-end : yes                                              ||

||      Multimapping reads : not counted                                      ||

|| Multi-overlapping reads : not counted                                      ||

||   Min overlapping bases : 1                                                ||

||                                                                            ||

||          Chimeric reads : counted                                          ||

||        Both ends mapped : not required                                     ||

||                                                                            ||

\\============================================================================//
//================================= Running ==================================\\

||                                                                            ||

|| Load annotation file Mus_musculus.GRCm38.100.chr.gtf.gz ...                ||

||    Features : 843402                                                       ||

||    Meta-features : 55401                                                   ||

||    Chromosomes/contigs : 22                                                ||

||                                                                            ||

|| Process BAM file SRR12108837.Hisat_aln.sorted.bam...                       ||

||    Paired-end reads are included.                                          ||

||    Total alignments : 27509024                                             ||

||    Successfully assigned alignments : 18300209 (66.5%)                     ||

||    Running time : 0.39 minutes                                             ||

||                                                                            ||

|| Write the final count table.                                               ||

|| Write the read assignment summary.                                         ||

||                                                                            ||

|| Summary of counting results can be found in file "/home/data/t010204/clea  ||

|| ndata/feature_counts/all.id.txt.summary"                                   ||

||                                                                            ||

\\============================================================================//

再看一下cdna 的成功比对0 ？？？？？