lncRNA组装流程的软件介绍之bedtools
咱们《生信技能树》的B站有一个lncRNA数据分析实战,缺乏配套笔记,所以我们安排了100个lncRNA组装案例文献分享,以及这个流程会用到的100个软件的实战笔记教程!
BEDTools是可用于genomic features的比较,相关操作及进行注释的工具。而genomic features通常使用Browser Extensible Data (BED) 或者 General Feature Format (GFF)文件表示,用UCSC Genome Browser进行可视化比较。bedtools总共有二三十个工具/命令来处理基因组数据。
intersect Find overlapping intervals in various ways.
window Find overlapping intervals within a window around an interval.
closest Find the closest, potentially non-overlapping interval.
coverage Compute the coverage over defined intervals.
map Apply a function to a column for each overlapping interval.
genomecov Compute the coverage over an entire genome.
merge Combine overlapping/nearby intervals into a single interval.
cluster Cluster (but don't merge) overlapping/nearby intervals.
complement Extract intervals _not_ represented by an interval file.
shift Adjust the position of intervals.
subtract Remove intervals based on overlaps b/w two files.
slop Adjust the size of intervals.
flank Create new intervals from the flanks of existing intervals.
sort Order the intervals in a file.
random Generate random intervals in a genome.
shuffle Randomly redistribute intervals in a genome.
sample Sample random records from file using reservoir sampling.
spacing Report the gap lengths between intervals in a file.
annotate Annotate coverage of features from multiple files.
比较典型而且常用的功能举例如下:
格式转换,bam转bed(bamToBed),bed转其他格式(bedToBam,bedToIgv);
对基因组坐标的逻辑运算,包括:交集(intersectBed,windowBed),”邻集“(closestBed),补集(complementBed),并集(mergeBed),差集(subtractBed);
计算覆盖度(coverage)(coverageBed,genomeCoverageBed);
一、软件安装
使用conda安装
conda install bedtools
二、bedtools window 的用法
安装完成以后,可以使用bedtools window -h来查看软件的帮助文档。
1. 软件用法:
2. 常用参数:
三、输入文件
bed/gff/vcf文件
四、软件运行命令
与bedtools intersect类似,window 在A和B中搜索重叠的特征。
However, window adds a specified number (1000, by default) of base pairs upstream and downstream of each feature in A. In effect, this allows features in B that are “near” features in A to be detected.
bedtools window -a DEL.gtf \
-b protein_coding_gene.gtf \
-l 10000 -r 10000 > test.txt
参数说明:
-a DEL.gtf -b protein_coding_gene.gtf # 把DEL.gtf比对到protein_coding_gene.gtf寻找overlap
-l 10000 # 寻找范围,上游10000bp
-r 10000 #寻找范围,下游10000bp
五、输出结果
chr1 8416627 8422722 + transcript_id "MSTRG.299.44" chr1 8352397 8848921 -
gene_name "RERE"
chr1 16142499 16142858 + transcript_id "MSTRG.518.1" chr1 16124337 16156069 -
gene_name "EPHA2"
chr1 20981406 20984251 + transcript_id "MSTRG.624.1" chr1 20806292 21176888 - gene_name "EIF4G3"
chr1 39634613 39639494 + transcript_id "MSTRG.1249.4" chr1 39623435 39639643 gene_name "HEYL"
chr1 44423896 44512709 + transcript_id "MSTRG.1392.8" chr1 44405194 44651724 + gene_name "RNF220"
chr1 53720323 53734052 + transcript_id "MSTRG.1665.7" chr1 53506237 53738106 - gene_name "GLIS1"
文末友情推荐
与十万人一起学生信,你值得拥有下面的学习班: