师妹 你要的都在这里了

写在前面

最近接二连三带了几个实习生和轮转生,每带一个人感觉总要说很多重复内容。因为可以预见后面几年还会再带不少人,想了想还是把一些东西写下来后面就可以直接丢链接。

本文列举生物信息部分常用工具和几个神奇网站,基本上每个工具都给出(或中文或英文)简要功能介绍和官网地址。

生物信息学常用工具

fastq格式相关

  • SRAtoolkit

    • 网址 https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc

    • SRA数据库下载公用数据时的工具

  • fastx toolkit

    • a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing

    • 有各种各样的小功能,比如提取反向互补序列等等。

    • 网址 http://hannonlab.cshl.edu/fastx_toolkit/

  • fastqc

    • A quality control tool for high throughput sequence data

    • 评估测序数据质量

    • 网址 https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

  • MultQC

    • Aggregate results from bioinformatics analyses across many samples into a single report

    • 一次同时生成多个数据质量报告,省时省力方便对比,支持fastqc

    • 网址 https://github.com/ewels/MultiQC

    • 网址 http://multiqc.info/docs/

  • Trim Galore

    • around Cutadapt and FastQC to consistently apply quality and adapter trimming to FastQ files,

    • with some extra functionality for MspI-digested RRBS-type (Reduced Representation Bisufite-Seq) libraries

    • 和fastqc出自一家,可以和fastqc结合使用,用来清洗原始数据。

    • 网址 https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/

  • Trimmomatic

    • A flexible read trimming tool for Illumina NGS data

    • 专门清洗illumina测序数据的工具

    • 网址 http://www.usadellab.org/cms/index.php?page=trimmomatic

  • khmer

    • working with DNA shotgun sequencing data from genomes, transcriptomes, metagenomes, and single cells.

    • 可以对原始测序数据进行过滤等

    • 网址 http://khmer.readthedocs.io/en/v2.1.1/user/scripts.htm


BED格式相关

  • bedops

    • the fast, highly scalable and easily-parallelizable genome analysis toolkit

    • 网址 https://bedops.readthedocs.io/en/latest/index.html

    • 玩转bed格式文件,速度比bedtools快

  • bedtools

    • a powerful toolset for genome arithmetic

    • 网址 http://bedtools.readthedocs.io/en/latest/index.html

    • 最知名的bed文件相关工具,但是和samtools并非出自一家


SAM/BAM格式相关

  • samtools

    • Utilities for the Sequence Alignment/Map (SAM) format

    • 网址 http://www.htslib.org/doc/samtools.html

    • 有这一个就够了


SNP(VCF/BCF)相关

  • GATK

    • 网址 https://software.broadinstitute.org/gatk/documentation/

    • 使用率最高的软件

  • bcftools

    • utilities for variant calling and manipulating VCFs and BCFs

    • 网址 http://www.htslib.org/doc/bcftools.html

    • 对vcf格式的文件进行各种操作

  • vcftools

    • 网址 https://vcftools.github.io/man_latest.html

    • 和bcftools类似

  • snpEFF

    • Genetic variant annotation and effect prediction toolbox

    • 适合用来进行snp注释

    • 用法 http://snpeff.sourceforge.net/SnpEff_manual.html

    • 网址 http://snpeff.sourceforge.net/

    • 也可以注释ChIP-seq

    • 支持非编码注释,如组蛋白修饰

  • samtools mpileup

    • Utilities for the Sequence Alignment/Map (SAM) format

    • 网址 http://www.htslib.org/doc/samtools.html


ChIP-seq/motif

  • MACS

    • Model-based Analysis of ChIP-Seq

    • 主要用于组蛋白修饰产生的narrow peaks(H3K4me3 and H3K9/27ac)

    • transcription factors which are usually associated with sharp and solated peaks

    • 网址 http://liulab.dfci.harvard.edu/MACS/README.html

  • MACS2

    • 网址 https://github.com/taoliu/MACS

    • MACS的升级版本,也可以用来找broad peak

  • SICER

    • highly recommended for a practical ChIP-seq experiment design and can be used to account for local biases resulting from read mappability, DNA repeats, local GC content

    • 网址https://www.genomatix.de/online_help/help_regionminer/sicer.html

    • 出来怼MACS,主要用来找一些比较宽的peak,类似于H3K9me3 和 H3K36me3。


large sequences alignment

长序列比对常用的几个软件

  • MUMer

    • rapid alignment of very large DNA and amino acid sequences

    • 网址 http://mummer.sourceforge.net/examples/

    • 网址 http://mummer.sourceforge.net/manual/

  • GMAP

    • GMAP: A Genomic Mapping and Alignment Program for mRNA and EST Sequences

    • 网址 http://research-pub.gene.com/gmap/

  • BLAT

    • Blat produces two major classes of alignments:at the DNA level between two sequences that are of 95% or greater identity, but which may include large inserts;at the protein or translated DNA level between sequences that are of 80% or greater identity and may also include large inserts.

    • 网址 https://genome.ucsc.edu/goldenpath/help/blatSpec.html


short reads alignment

短序列比对,二代测序数据比对

  • BWA

    • Burrows-Wheeler Alignment Tool

    • mapping low-divergent sequences against a large reference genome

    • It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM.

    • 网址 http://bio-bwa.sourceforge.net/bwa.shtml

    • 网址 https://github.com/lh3/bwa

  • GSNAP

    • Genomic Short-read Nucleotide Alignment Program

    • 网址 http://research-pub.gene.com/gmap/

  • Bowtie

    • works best when aligning short reads to large genomes

    • not yet report gapped alignments

    • 网址 http://bowtie-bio.sourceforge.net/manual.shtml

  • Bowtie2

    • ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences

    • supports gapped, local, and paired-end alignment modes

    • 网址 http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#reporting

    • 和上一代的区别在于支持gapped alignments

  • HISAT2

    • Tophat的继任者,基于HISAT和Bowtie2

    • HISAT2的速度比STAR快一些

    • 网址 http://ccb.jhu.edu/software/hisat2/manual.shtml

  • STAR

    • Spliced Transcripts Alignment to a Reference

    • 网址 https://github.com/alexdobin/STAR

    • 网址 https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf


genome guide assemble

  • stringtie

    • highly efficient assembler of RNA-Seq alignments into potential transcripts

    • 对于可变剪切的发现相对准确

    • 网址 https://ccb.jhu.edu/software/stringtie/

  • Cufflinks

    • 基本不用了。。。

  • IDP

    • Isoform Detection and Prediction tool

    • gmap+hisat2,也就是长短序列比对相结合,效果不错

    • 网址 https://www.healthcare.uiowa.edu/labs/au/IDP/IDP_manual.asp


de novo assemble/gene prediction

下面几个软件结合起来就是一个从组装到注释到计算效率的过程

  • trintiy

    • 倾向于预测长的可变剪接

    • 新版本从之前的过度预测越来越倾向于有所保留

    • 比较耗资源,一般1个CPU最好分配6G-10G

    • 可以有参或者无参转录组拼接

    • 网址 https://github.com/trinityrnaseq/trinityrnaseq/wiki

  • oases

    • 通常得到的N50比较高

    • 检测低表达的基因有一定优势

    • De novo transcriptome assembler for very short reads

    • 网址 https://github.com/dzerbino/oases

  • PASA(内包括BLAT和GMAP)

    • 得到拼接好的fasta文件后可以用pasa进行基因结构预测

    • Gene Structure Annotation and Analysis Using PASA

    • 网址 http://pasapipeline.github.io/

  • Maker

    • 基因预测

    • can be used for de novo annotation of newly sequenced genomes, for updating existing annotations to reflect new evidence, or just to combine annotations, evidence, and quality control statistics

    • 网址 http://www.yandell-lab.org/software/maker.html

  • TransRate

    • reference free quality assessment of de novo transcriptome assemblies

    • 网址 http://hibberdlab.com/transrate/

    • 专业的拼接质量评估软件,有三种评估模式。

  • DETONATE

    • DE novo TranscriptOme rNa-seq Assembly with or without the Truth Evaluation

    • 网址 https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0553-5

  • BUSCO

    • 它的评估模式和上面两个不太一样

    • based on evolutionarily-informed expectations of gene content from near-universal single-copy orthologs


Estimating transcript abundance

可以分为基于比对和不基于比对两种,其中RSEM和eXpress是基于比对的,另外两种是基于比对的。

  • RSEM

    • RNA-Seq by Expectation-Maximization

    • 网址 https://deweylab.github.io/RSEM/README.html

  • eXpress

    • quantifying the abundances of a set of target sequences from sampled subsequences

    • 网址 https://pachterlab.github.io/eXpress/overview.html

  • kallisto

    • 快到飞起

    • 丰度估计中样本特异性和读长偏好性低

    • quantifying abundances of transcripts from RNA-Seq data

    • 网址 https://pachterlab.github.io/kallisto/

  • salmon

    • 网址 quantifying the expression of transcripts using RNA-seq data

    • https://combine-lab.github.io/salmon/

    • 也是很快


Read count

  • htseq-count

    • 网址 http://htseq.readthedocs.io/en/release_0.9.1/

    • 数read, 有它就够了


Difference expression

和之前的步骤对应,这里也可以分为基于read和基于组装以及不基于比对三类工具。

  • limma

    • Linear Models for Microarray Data

    • 网址 http://bioconductor.org/packages/release/bioc/html/limma.html

    • 用于分析芯片数据

  • DEseq

    • 网址 http://bioconductor.org/packages/release/bioc/html/DESeq.html

  • DEseq2

    • 效果在几个工具中相对好

    • 网址 http://bioconductor.org/packages/release/bioc/html/DESeq2.html

  • DEGseq

    • Identify Differentially Expressed Genes from RNA-seq data

    • 网址 http://www.bioconductor.org/packages/2.6/bioc/html/DEGseq.html

  • edgeR

    • Empirical Analysis of Digital Gene Expression Data in R

    • 网址 http://www.bioconductor.org/packages/release/bioc/html/edgeR.html

  • Ballgown

    • 准确度有时不是很好

    • facilitate flexible differential expression analysis of RNA-Seq data

    • organize, visualize, and analyze the expression measurements for your transcriptome assembly.

    • 网址 https://github.com/alyssafrazee/ballgown

  • sleuth

    • 用来配合kallisto使用

    • 网址 https://pachterlab.github.io/sleuth/about


Data visualization

数据可视化的工具可以分为本地版本和在线版本

  • IGV

    • Integrative Genomics Viewer

    • 网址 http://software.broadinstitute.org/software/igv/

    • 本地展示分析结果的不二选择

  • jbrowse

    • 公开展示数据或者给合作者分享时的不二选择,快且好看。

    • 网址 http://jbrowse.org/code/JBrowse-1.10.2/docs/tutorial/

  • DEIVA

    • Interactive Visual Analysis of differential gene expression test results

    • 网址 http://hypercubed.github.io/DEIVA/

    • 差异表达的可视化在线工具

  • Heatmapper

    • expression-based heat maps

    • pairwise distance maps

    • correlation maps

    • 网址 http://www.heatmapper.ca/

    • 用来话各种热图的在线工具

  • START

    • visualize RNA-seq data starting with count data

    • 网址 https://kcvi.shinyapps.io/START/

    • 基于shinny的一套RNA-seq数据可视化工具


几个神奇的网站

  • biostars

    • 网址 https://www.biostars.org

  • R book

    • 网址 http://r4ds.had.co.nz/

  • python guide

    • 网址 http://docs.python-guide.org/en/latest/

  • bioptyhon

    • 网址 http://biopython.org/DIST/docs/tutorial/Tutorial.html

  • Rosalind

    • 网址 http://rosalind.info/problems/list-view/

  • bioinformatics tools

    • 网址 https://omictools.com/

    • 网址 https://bioinformatics.ca/links_directory/

  • data visualistion catalogue

    • 网址 http://datavizcatalogue.com/index.html

暂时就写这么多,还有一些自己平时也很少用的就不放进来给他人增加负担,后面再进行补充。

(0)

相关推荐