你真的懂Illumina数据质量控制吗?
1. FastQC察看
2. 进行reads的修剪和过滤
接头序列:
1 2 3 4 |
>PrefixPE/1 TACACTCTTTCCCTACACGACGCTCTTCCGATCT >PrefixPE/2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT |
Trimmomatic等通常的质控软件。
接头序列:technote_nextera_matepair_data_processing.pdf
针对此类数据的处理软件主要是:nextclip和skewer,从文章结果来看后者略优。【Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads】
处理软件:nextclip (同时移除PCR duplicates)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
./nextclip -d -i ~/AS/raw_reads/AS8K_R1.fastq -j ~/AS/raw_reads/AS8K_R2.fastq -o output
NextClip v1.3.2 n: 18 Creating hash tables for duplicate storage... Opening input filename /public/home/zpxu/AS/raw_reads/AS8K_R1.fastq |
如上,若出现too much rehashing!! Rehash=26的错误信息则增大[-n | --number_of_reads] Approximate number of reads (default 20,000,000)参数值;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 |
./nextclip -d -e -i ~/AS/raw_reads/AS3K_R1.fastq -j ~/AS/raw_reads/AS3K_R2.fastq -o 33AS3K -n 30000000
NextClip v1.3.2 n: 19 Creating hash tables for duplicate storage... Opening input filename /public/home/zpxu/AS/raw_reads/AS3K_R1.fastq Hash: Counting duplicates... 8% [======== ]Warning: count (999) exceeds maximum - treated as 999 SUMMARY Strict match parameters: 34, 18 Number of read pairs: 72966745 R1 Num reads with adaptor: 16173479 22.17 % R2 Num reads with adaptor: 14902833 20.42 % Total pairs in category A: 11389962 15.61 % Total pairs in category B: 3422082 4.69 % Total pairs in category C: 4565902 6.26 % Total pairs in category D: 8649889 11.85 % Total pairs in category E: 308404 0.42 % Total usable pairs: 10130967 13.88 % Category B became E: 90789 0.12 % Done. Completed in 4414 seconds. |
结果文件中的A,B和C category合并后用于后续分析。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
./skewer -m mp -i ~/AS/raw_reads/AS8K_R1.fastq ~/AS/raw_reads/AS8K_R2.fastq -o AS8K -t 5 .--. .-. : .--': :.-. `. `. : `'.' .--. .-..-..-. .--. .--. _`, :: . `.' '_.': `; `; :' '_.': ..' `.__.':_;:_;`.__.'`.__.__.'`.__.':_; skewer v0.2.2 [April 4, 2016] Parameters used: -- 3' end adapter sequence (-x): AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -- paired 3' end adapter sequence (-y): AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA -- junction adapter sequence (-j): CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG -- maximum error ratio allowed (-r): 0.100 -- maximum indel error ratio allowed (-d): 0.030 -- minimum read length allowed after trimming (-l): 18 -- file format (-f): Sanger/Illumina 1.8+ FASTQ (auto detected) -- minimum overlap length for junction adapter detection (-k): 19 -- redistribute reads based on junction information (-i): yes Sat Sep 2 21:43:52 2017 >> started |=================================================>| (100.00%) Sat Sep 2 22:53:41 2017 >> done (4120.338s) 61751767 read pairs processed; of these: 2750894 ( 4.45%) non-junction read pairs filtered out by contaminant control 6682698 (10.82%) short read pairs filtered out after trimming by size control 31458966 (50.94%) empty read pairs filtered out after trimming by size control 20859209 (33.78%) read pairs available; of these: 17141247 (82.18%) trimmed read pairs available after processing 3717962 (17.82%) untrimmed read pairs available after processing log has been saved to "/public/home/zpxu/AS/raw_reads/AS8K_R1-trimmed.log". |
After trimming and quality filtering, 56% of long-insert reads from each of the three mate-pair libraries and 95% of paired-end reads were retained on average.
实际同样数据运行结果比较
FastQ files | read length | median | mean | stdev | FF | FR | RF | RR |
---|---|---|---|---|---|---|---|---|
AS3K_R2_nextclip.fq.is.txt | 83 | 2510 | 2307.246 | 764.2089 | 36 | 781 | 9164 | 19 |
AS5K_R2_nextclip.fq.is.txt | 84 | 4444 | 3974.787 | 1506.76 | 54 | 773 | 9127 | 46 |
AS8K_R2_nextclip.fq.is.txt | 81 | 5825 | 4733.265 | 2530.765 | 150 | 1167 | 8543 | 140 |
AS3K-trimmed-pair2.fastq.is.txt.skewer | 102 | 2493 | 2319.141 | 725.1056 | 26 | 1912 | 8049 | 13 |
AS5K-trimmed-pair2.fastq.is.txt.skewer | 106 | 4460 | 4042.562 | 1443.632 | 30 | 2333 | 7608 | 29 |
AS8K-trimmed-pair2.fastq.is.txt.skewer | 111 | 5945 | 4935.393 | 2446.029 | 112 | 3607 | 6212 | 69 |
3. FastUniq 去除 paired reads 的PCR重复
建议先trim,然后在来用这个软件来去除dup,因为,这个软件是比较以后,随机保留相同的pair的中一个,如果不先trim,容易保留质量差的哪一个,而且即使trim后,它也能处理不同长度的pair。 【每日一生信—FastUniq去除paired reads的duplicates】
1 2 3 4 |
cat AS285.list AS285A_R1.clean.fastq AS285A_R2.clean.fastq fastuniq -i AS285.list -o AS285A_R1.rd.clean.fastq -p AS285A_R2.rd.clean.fastq |
或者多个文库写在同一个input_list.txt时输出结果会将多个文库合并成一个文件;
1 2 3 4 5 6 |
cat input_list.txt input_R1_1.fastq input_R1_2.fastq input_R2_1.fastq input_R2_2.fastq fastuniq -i input_list.txt -t q -o output_1.fastq -p output_2.fastq -c 1 |
报错:内存问题,在大内存节点运行。
1 | Error in Reading pair-end FASTQ sequence! |
4. 进行reads 的纠正
BLESS和Musket有相似的纠正结果,前者一直报错;
BLESS
1 2 3 4 |
source /public/home/software/.bashrc module load BLESS/1.02 cd /public/home/zpxu/AS/clean_reads bless -read1 AS3k_R1.rd.clean.fastq -read2 AS3k_R2.rd.clean.fastq -prefix ../bless/AS3k_R12.rd.clean -kmerlength 31 |
报错:
1 2 3 |
Checking input read files
ERROR: Irregular quality score range 35-75 |
Musket - a multistage k-mer spectrum based corrector
1 | musket AS485_R1.rd.clean.fastq AS485_R2.rd.clean.fastq -omulti AS485 -inorder -p 10 |
至此,经过Trim,去PCR duplicates和纠正后的reads可用于后续的基因组组装等其他分析。