使用inferCNV来推断airway转录组数据的拷贝数变异
前面我们介绍了单细胞转录组表达矩阵可以推断CNV的文献出处及历史,也简单过了一半broad开发的inferCNV软件,但是只运行了其测试数据,远远不够。单细胞转录组数据分析CNV使用broad出品的inferCNV来对单细胞转录组数据推断CNV信息现在我们来,使用inferCNV来推断airway转录组数据的拷贝数变异,其实主要就是如何制作input文件给inferCNV这个软件,要制作的数据文件如下:测试数据example/example_expression.txt是一个含有两万多个基因,七百多个细胞的表达矩阵。另外一个文件 example/example_genomic_positions.txt 是表达矩阵里面的两万多个基因的基因组坐标。首先制作表达矩阵这里自己去搜索了解airway转录组数据数据集咯,代码如下:library(airway)library(edgeR)library(DESeq2)data(airway)airwayexprSet=assay(airway)geneLists=rownames(exprSet)keepGene=rowSums(cpm(exprSet)>0) >=2table(keepGene);dim(exprSet)dim(exprSet[keepGene,])exprSet=exprSet[keepGene,]rownames(exprSet)=geneLists[keepGene]boxplot(exprSet,las=2)# CPM normalized counts.exprSet=log2(cpm(exprSet)+1)boxplot(exprSet,las=2)exprSet[1:4,1:4]可以看到表达矩阵如下;> exprSet[1:4,1:4]SRR1039508 SRR1039509 SRR1039512 SRR1039513ENSG00000000003 5.083273 4.633369 5.147338 4.802578ENSG00000000419 4.562474 4.826860 4.672375 4.647983ENSG00000000457 3.765373 3.610970 3.507874 3.562638ENSG00000000460 1.966187 1.972398 1.366276 1.726062接下来制作基因坐标文件这里需要下载gencode数据库或者其他数据库的人类基因组注释文件,我这里直接给代码吧:cat gencode.v25.annotation.gtf|perl -alne '{next unless $F[2] eq "gene";print}'|grep -w HAVANA |cut -f 1,4,5,9| cut -d";" -f 1,2,4|sed 's/gene_id//g'|sed 's/gene_type//g'|sed 's/gene_name//g'|sed 's/;//g'| sed 's/\"//g'|perl -alne '{/(ENSG\d+)/;print "$1\t$_"}' >human.gene.positions可能并不是太好理解,但是代码可以直接运行,得到的基因组坐标文件如下;ENSG00000223972 chr1 11869 14409 ENSG00000223972.5 transcribed_unprocessed_pseudogene DDX11L1ENSG00000227232 chr1 14404 29570 ENSG00000227232.5 unprocessed_pseudogene WASH7PENSG00000243485 chr1 29554 31109 ENSG00000243485.4 lincRNA MIR1302-2ENSG00000237613 chr1 3455436081 ENSG00000237613.2 lincRNA FAM138AENSG00000268020 chr1 52473 53312 ENSG00000268020.3 unprocessed_pseudogene OR4G4PENSG00000240361 chr1 62948 63887 ENSG00000240361.1 unprocessed_pseudogene OR4G11PENSG00000186092 chr1 69091 70008 ENSG00000186092.4 protein_coding OR4F5ENSG00000238009 chr1 89295 133723 ENSG00000238009.6 lincRNA RP11-34P13.7ENSG00000239945 chr1 89551 91105 ENSG00000239945.1 lincRNA RP11-34P13.8ENSG00000233750 chr1 131025 134836 ENSG00000233750.3 processed_pseudogene CICP27这些坐标有五万多个基因信息,但是上面的表达矩阵只有三万多个基因,所以需要对应一下,代码如下;pos=read.table('human.gene.positions')exprSet=exprSet[rownames(exprSet) %in% pos[,1],]write.table(exprSet,'airway_exprSet.txt',quote = F,sep = '\t')pos = pos[match(rownames(exprSet),pos[,1]),1:4]write.table(pos,'airway_pos.txt',row.names = F,col.names = F,sep = '\t',quote = F)dim(exprSet)dim(pos)信息如下:> head(pos)V1 V2 V3 V449139 ENSG00000000003 chrX 100627109 10063999145840 ENSG00000000419 chr20 50934867 509585553181 ENSG00000000457 chr1 169849631 1698942673176 ENSG00000000460 chr1 169662007 169854080756 ENSG00000000938 chr1 27612064 276352773509 ENSG00000000971 chr1 196651878 196747504> dim(exprSet)[1] 26599 8> dim(pos)[1] 26599 4运行inferCNV软件咯到这里,准备工作就结束啦,根据前面的教程,我们直接运行该软件,但是这个时候不能仅仅是用官网的测试参数了,实际上想成功运行这个程序也是非常麻烦的。软件由一个 500 行左右的运行脚本加上 2000多行的实际代码组成,参数实际介绍如下:data:Expression matrix (genes X samples), assumed to be log2(TPM+1) .gene_order:Ordering of the genes (data's rows) according to their genomic location To include all genes use 0.cutoff:Cut-off for the average expression of genes to be used for CNV inference.reference_obs:Column names of the subset of samples (data's columns) that should be used as references. If not given, the average of all samples will be the reference.transform_data:Indicator to log2 + 1 transformwindow_length:Length of the window for the moving average (smoothing). Should be an odd integer.max_centered_threshold:The maximum value a a value can have after centering. Also sets a lower bound of -1 * this value.noise_threshold:The minimum difference a value can be from the average reference in order for it not to be removed as noise.num_ref_groups:The number of reference groups of a list of indicies for each group of reference indices in relation to reference_obs.out_path:The path to what to save the pdf as. The raw data is also written to this path but with the extension .txt .plot_steps:If true turns on plotting intermediate steps.contig_tail:Length of the tail removed from the ends of contigs.method_bound:Method to use for bounding values in the visualization.lower_bound_vis:Lower bound to normalize data to for visualization.upper_bound_vis:Upper bound to normalize data to for visualization.根据我的理解,这个airway数据是bulk转录组测序,就8个样本,所以就没有ref了,我的运行代码如下;../inferCNV/scripts/inferCNV.R \--cutoff 4.5 \--noise_filter 0.3 \--output_dir test \--vis_bound_threshold " -1,1" \airway_exprSet.txt airway_pos.txt软件倒是可以成功运行,就是出来的图有点尴尬。
有两个可能原因,首先是因为我们没有规定者8个样本哪些是正常对照,所以软件曲8个样本的平均值作为对照,这样的话这8个样本本来就不是癌症样本,CNV事件本来就少的可怜。另一个可能是我对软件运行的两个参数理解不够透彻,阈值没有设置好。不过不要紧,接下来我们就拿CCLE的表达矩阵测试一下。