生信编程9.bedtools从bam文件中提取目的序列
有一些五六年前的学生们都成长为了各个生物信息学相关公司的小领导,而且他们都有了自己的公众号,知乎号,也算是一番人物。最近他们跟我反馈面试找不到或者说很难直接考核筛选到认真干活的生信工程师,挺有意思的。让我想起来了早在生信技能树论坛创立之初我为了引流,而规划的200个生信工程师面试题。值得继续分享:
使用bedtools进行坐标映射:
将bam文件转换为bed文件
bedtools bamtobed -i ../CA_N_805_RNA_recal.bam > CA_N_805_RNA_recal.bed
结果文件如下
chr1 13146 13296 ST-E00192:663:HL5C3CCXY:4:2221:19055:55561/1 60 +
chr1 13244 13394 ST-E00192:663:HL5C3CCXY:4:2221:19055:55561/2 60 -
chr1 14184 14334 ST-E00192:663:HL5C3CCXY:4:1207:15168:3436/1 60 +
取出bed文件中与编辑位点有交集的记录,并同时输出两个文件的原始记录
bedtools intersect -a CA_N_805_RNA_recal.bed -b ../CA_N_805_RNA.AtoI_nonAlu_rmHomopolymer.avinput -wo > CA_N_805_RNA.AtoI_nonAlu_rmHomopolymer.bed
结果文件如下
chr1 14462 14612 ST-E00192:663:HL5C3CCXY:4:2205:20811:67709/1 60 + chr1 14610 14610 T C het 495.64 51 0
chr1 14462 14612 ST-E00192:663:HL5C3CCXY:4:2205:20882:67867/1 60 + chr1 14610 14610 T C het 495.64 51 0
chr1 14466 14616 ST-E00192:663:HL5C3CCXY:4:2101:13514:7778/1 60 + chr1 14610 14610 T C het 495.64 51 0
将bam文件转换为fastq文件
bedtools bamtofastq -i ../CA_N_805_RNA_recal.bam -fq CA_N_805_RNA_recal_1.fq -fq2 CA_N_805_RNA_recal_2.fq
结果文件如下
@ST-E00192:663:HL5C3CCXY:4:2221:19055:55561/1
CTGAGGAAGGAGAAGGGGATGCACTGTTGGGGAGGCAGCTGTAACTCAAAGCCTTAGCCTCTGTTCCCACGAAGGCAGGGCCATCAGGCACCAAAGGGATTCTGCCAGCATAGTGCTCCTGGACCAGTGATACACCCGGCACCCTGTCCT
+
RR7;8;EEBDFCFFCDDDFFBCFCFBCGBDDDFCDCFCCFBCDFCFDEFFCC@FGDCC@FDFABFD@@FC;FFCCCFCDDC?EECFCCBEB?EEEBCCEEFCEAB?EBBEECBBABEC?EADEB?EBBAEECBEB??;CBEB??EABDNT
将fastq文件转换为fasta文件
perl -alne '{print $_ if $.%4==1|$.%4==2}' CA_N_805_RNA_recal_1.fq | perl -alne '{s/@/>/g;print}' > CA_N_805_RNA_recal_1.fa
perl -alne '{print $_ if $.%4==1|$.%4==2}' CA_N_805_RNA_recal_2.fq | perl -alne '{s/@/>/g;print}' > CA_N_805_RNA_recal_2.fa
结果文件如下
>ST-E00192:663:HL5C3CCXY:4:2221:19055:55561/1
CTGAGGAAGGAGAAGGGGATGCACTGTTGGGGAGGCAGCTGTAACTCAAAGCCTTAGCCTCTGTTCCCACGAAGGCAGGGCCATCAGGCACCAAAGGGATTCTGCCAGCATAGTGCTCCTGGACCAGTGATACACCCGGCACCCTGTCCT
>ST-E00192:663:HL5C3CCXY:4:1207:15168:3436/1
TGTCACTGAAACCTTTTTTGTGGGAGACTATTCCTCCCATCTGCAACAGCTGCCCCTGCTGACTGCCCTTCTCTCCTCCCTCTCATCCCAGAGAAACAGGTCAGCTGGGAGCTTCTGCCCCCACTGCCTAGGGACCAACAGGGGCAGGAG
合并两个fasta文件
cat CA_N_805_RNA_recal_1.fa CA_N_805_RNA_recal_2.fa > CA_N_805_RNA_recal.fa
使用python脚本根据bed文件中与编辑位点有交集的记录从fasta文件中提取出序列
import argparse
import pysam
parser = argparse.ArgumentParser()
parser.add_argument('bed',help='input bed file',type=str)
parser.add_argument('fasta',help='input fasta',type=str)
args = parser.parse_args()
bed = args.bed
fasta = args.fasta
fa = pysam.FastaFile(fasta)
with open(bed,'r') as f:
for i in f.readlines():
if i.strip() != '':
arr = i.strip().split('\t')
if arr[3] in fa.references:
seq = fa.fetch(arr[3])
print(">%s \n %s" % (arr[3],seq))
python extract_seq.py CA_N_805_RNA.AtoI_nonAlu_rmHomopolymer.bed CA_N_805_RNA_recal.fa > CA_N_805_RNA_recal_to_blat.fa
得到的目的序列就是之前想通过python脚本输出的序列,用来进行blat比对
文末友情推荐