你真的懂Illumina数据质量控制吗?

1. FastQC察看

2. 进行reads的修剪和过滤

Short-insert paired end reads

接头序列:

1
2
3
4
>PrefixPE/1
TACACTCTTTCCCTACACGACGCTCTTCCGATCT
>PrefixPE/2
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

Trimmomatic等通常的质控软件。

Long Mate Pair libraries

接头序列:technote_nextera_matepair_data_processing.pdf

针对此类数据的处理软件主要是:nextclip和skewer,从文章结果来看后者略优。【Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads】

处理软件:nextclip (同时移除PCR duplicates)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
./nextclip -d -i ~/AS/raw_reads/AS8K_R1.fastq -j ~/AS/raw_reads/AS8K_R2.fastq -o output

NextClip v1.3.2

n: 18
               b: 100
         Entries: 26214400
      Entry size: 24
 Memory required: 856 MB

Creating hash tables for duplicate storage...
Hash:
unique kmers: 0
Capacity: 26214400
Occupied: 0.00%
Pruned: 0 (-nan%)
Collisions:
Adaptor: CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG

Opening input filename /public/home/zpxu/AS/raw_reads/AS8K_R1.fastq
Opening input filename /public/home/zpxu/AS/raw_reads/AS8K_R2.fastq
Opening output file output_A_R1.fastq
Opening output file output_A_R2.fastq
Opening output file output_B_R1.fastq
Opening output file output_B_R2.fastq
Opening output file output_C_R1.fastq
Opening output file output_C_R2.fastq
Opening output file output_D_R1.fastq
Opening output file output_D_R2.fastq
Hash:
unique kmers: 25546550
Capacity: 26214400
Occupied: 97.45%
Pruned: 0 (0.00%)
Collisions:
        tries 0: 32059454
        tries 1: 649509
        tries 2: 192228
        tries 3: 76759
        tries 4: 35001
        tries 5: 17641
        tries 6: 9035
        tries 7: 4943
        tries 8: 2774
        tries 9: 1495
too much rehashing!! Rehash=26

如上,若出现too much rehashing!! Rehash=26的错误信息则增大[-n | --number_of_reads] Approximate number of reads (default 20,000,000)参数值;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
./nextclip -d -e -i ~/AS/raw_reads/AS3K_R1.fastq -j ~/AS/raw_reads/AS3K_R2.fastq -o 33AS3K -n 30000000

NextClip v1.3.2

n: 19
               b: 100
         Entries: 52428800
      Entry size: 24
 Memory required: 1456 MB

Creating hash tables for duplicate storage...
Hash:
unique kmers: 0
Capacity: 52428800
Occupied: 0.00%
Pruned: 0 (-nan%)
Collisions:
Adaptor: CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG

Opening input filename /public/home/zpxu/AS/raw_reads/AS3K_R1.fastq
Opening input filename /public/home/zpxu/AS/raw_reads/AS3K_R2.fastq
Opening output file 33AS3K_A_R1.fastq
Opening output file 33AS3K_A_R2.fastq
Opening output file 33AS3K_B_R1.fastq
Opening output file 33AS3K_B_R2.fastq
Opening output file 33AS3K_C_R1.fastq
Opening output file 33AS3K_C_R2.fastq
Opening output file 33AS3K_D_R1.fastq
Opening output file 33AS3K_D_R2.fastq
Opening output file 33AS3K_E_R1.fastq
Opening output file 33AS3K_E_R2.fastq
Warning: read shorter than minimum read size (64) - ignoring
GC bases: 9705583313  AT bases: 12168077887

Hash:
unique kmers: 28281698
Capacity: 52428800
Occupied: 53.94%
Pruned: 0 (0.00%)
Collisions:
        tries 0: 72912204

Counting duplicates...

8%      [========                                                                                            ]Warning: count (999) exceeds maximum - treated as 999
10%     [==========                                                                                          ]Warning: count (999) exceeds maximum - treated as 999
20%     [====================                                                                                ]Warning: count (999) exceeds maximum - treated as 999
23%     [=======================                                                                             ]Warning: count (999) exceeds maximum - treated as 999
25%     [=========================                                                                           ]Warning: count (999) exceeds maximum - treated as 999
67%     [===================================================================                                 ]Warning: count (999) exceeds maximum - treated as 999
83%     [===================================================================================                 ]Warning: count (999) exceeds maximum - treated as 999
100%    [====================================================================================================]

SUMMARY

Strict match parameters: 34, 18
   Relaxed match parameters: 32, 17
          Minimum read size: 25
                  Trim ends: 19

Number of read pairs: 72966745
  Number of duplicate pairs: 44626706  61.16 %
Number of pairs containing N: 54541     0.07 %

R1 Num reads with adaptor: 16173479  22.17 %
  R1 Num with external also: 4375021   6.00 %
      R1 long adaptor reads: 11092453  15.20 %
         R1 reads too short: 5081026   6.96 %
    R1 Num reads no adaptor: 12162760  16.67 %
 R1 no adaptor but external: 5248876   7.19 %

R2 Num reads with adaptor: 14902833  20.42 %
  R2 Num with external also: 4406543   6.04 %
      R2 long adaptor reads: 9987006   13.69 %
         R2 reads too short: 4915827   6.74 %
    R2 Num reads no adaptor: 13433406  18.41 %
 R2 no adaptor but external: 5653578   7.75 %

Total pairs in category A: 11389962  15.61 %
        A pairs long enough: 5627734   7.71 %
          A pairs too short: 5762228   7.90 %
A external clip in 1 or both: 18225     0.02 %
    A bases before clipping: 3416988600
      A total bases written: 749338798

Total pairs in category B: 3422082   4.69 %
        B pairs long enough: 1695273   2.32 %
          B pairs too short: 1726809   2.37 %
B external clip in 1 or both: 47947     0.07 %
    B bases before clipping: 1026624600
      B total bases written: 323696037

Total pairs in category C: 4565902   6.26 %
        C pairs long enough: 2610991   3.58 %
          C pairs too short: 1954911   2.68 %
C external clip in 1 or both: 143843    0.20 %
    C bases before clipping: 1369770600
      C total bases written: 509149505

Total pairs in category D: 8649889   11.85 %
        D pairs long enough: 3667738   5.03 %
          D pairs too short: 4982151   6.83 %
D external clip in 1 or both: 5627647   7.71 %
    D bases before clipping: 2594966700
      D total bases written: 899148840

Total pairs in category E: 308404    0.42 %
        E pairs long enough: 196969    0.27 %
          E pairs too short: 111435    0.15 %
E external clip in 1 or both: 37111     0.05 %
    E bases before clipping: 92521200
      E total bases written: 29751268

Total usable pairs: 10130967  13.88 %
            All long enough: 13798705  18.91 %
   All categories too short: 14537534  19.92 %
     Duplicates not written: 44630506  61.17 %

Category B became E: 90789     0.12 %
        Category C became E: 217615    0.30 %
         Overall GC content: 44.37 %

Done. Completed in 4414 seconds.

结果文件中的A,B和C category合并后用于后续分析。

处理软件: skewer
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
./skewer -m mp -i ~/AS/raw_reads/AS8K_R1.fastq ~/AS/raw_reads/AS8K_R2.fastq -o AS8K -t 5
.--. .-.
: .--': :.-.
`. `. : `'.' .--. .-..-..-. .--. .--.
_`, :: . `.' '_.': `; `; :' '_.': ..'
`.__.':_;:_;`.__.'`.__.__.'`.__.':_;
skewer v0.2.2 [April 4, 2016]
Parameters used:
-- 3' end adapter sequence (-x):        AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
-- paired 3' end adapter sequence (-y): AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA
-- junction adapter sequence (-j):      CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG
-- maximum error ratio allowed (-r):    0.100
-- maximum indel error ratio allowed (-d):      0.030
-- minimum read length allowed after trimming (-l):     18
-- file format (-f):            Sanger/Illumina 1.8+ FASTQ (auto detected)
-- minimum overlap length for junction adapter detection (-k):  19
-- redistribute reads based on junction information (-i):       yes
Sat Sep  2 21:43:52 2017 >> started
|=================================================>| (100.00%)
Sat Sep  2 22:53:41 2017 >> done (4120.338s)
61751767 read pairs processed; of these:
2750894 ( 4.45%) non-junction read pairs filtered out by contaminant control
6682698 (10.82%) short read pairs filtered out after trimming by size control
31458966 (50.94%) empty read pairs filtered out after trimming by size control
20859209 (33.78%) read pairs available; of these:
17141247 (82.18%) trimmed read pairs available after processing
3717962 (17.82%) untrimmed read pairs available after processing
log has been saved to "/public/home/zpxu/AS/raw_reads/AS8K_R1-trimmed.log".
两类reads的去除比例

After trimming and quality filtering, 56% of long-insert reads from each of the three mate-pair libraries and 95% of paired-end reads were retained on average.

实际同样数据运行结果比较

FastQ files read length median mean stdev FF FR RF RR
AS3K_R2_nextclip.fq.is.txt 83 2510 2307.246 764.2089 36 781 9164 19
AS5K_R2_nextclip.fq.is.txt 84 4444 3974.787 1506.76 54 773 9127 46
AS8K_R2_nextclip.fq.is.txt 81 5825 4733.265 2530.765 150 1167 8543 140
AS3K-trimmed-pair2.fastq.is.txt.skewer 102 2493 2319.141 725.1056 26 1912 8049 13
AS5K-trimmed-pair2.fastq.is.txt.skewer 106 4460 4042.562 1443.632 30 2333 7608 29
AS8K-trimmed-pair2.fastq.is.txt.skewer 111 5945 4935.393 2446.029 112 3607 6212 69

3. FastUniq 去除 paired reads 的PCR重复

建议先trim,然后在来用这个软件来去除dup,因为,这个软件是比较以后,随机保留相同的pair的中一个,如果不先trim,容易保留质量差的哪一个,而且即使trim后,它也能处理不同长度的pair。 【每日一生信—FastUniq去除paired reads的duplicates】

单个文库/input_list.txt 多次运行
1
2
3
4
cat AS285.list
AS285A_R1.clean.fastq
AS285A_R2.clean.fastq
fastuniq -i AS285.list -o AS285A_R1.rd.clean.fastq -p AS285A_R2.rd.clean.fastq

或者多个文库写在同一个input_list.txt时输出结果会将多个文库合并成一个文件;

1
2
3
4
5
6
cat input_list.txt
input_R1_1.fastq
input_R1_2.fastq
input_R2_1.fastq
input_R2_2.fastq
fastuniq -i input_list.txt -t q -o output_1.fastq -p output_2.fastq -c 1

报错:内存问题,在大内存节点运行。

1 Error in Reading pair-end FASTQ sequence!

4. 进行reads 的纠正

BLESSMusket有相似的纠正结果,前者一直报错;

BLESS

1
2
3
4
source /public/home/software/.bashrc
module load BLESS/1.02
cd /public/home/zpxu/AS/clean_reads
bless -read1 AS3k_R1.rd.clean.fastq -read2 AS3k_R2.rd.clean.fastq -prefix ../bless/AS3k_R12.rd.clean -kmerlength 31

报错:

1
2
3
Checking input read files

ERROR: Irregular quality score range 35-75

Musket - a multistage k-mer spectrum based corrector

1 musket AS485_R1.rd.clean.fastq AS485_R2.rd.clean.fastq -omulti AS485 -inorder -p 10

至此,经过Trim,去PCR duplicates和纠正后的reads可用于后续的基因组组装等其他分析。

(0)

相关推荐