生信编程8.ID转换 / 四六文摘

有一些五六年前的学生们都成长为了各个生物信息学相关公司的小领导，而且他们都有了自己的公众号，知乎号，也算是一番人物。最近他们跟我反馈面试找不到或者说很难直接考核筛选到认真干活的生信工程师，挺有意思的。让我想起来了早在生信技能树论坛创立之初我为了引流，而规划的200个生信工程师面试题。值得继续分享：

200个生信工程师面试考题

为什么要进行ID转化？

简单来说，ID转换就是找到对应的关系表，然后用bash或者字典对应一下即可。但是也可以很复杂：

在进行ID转换之前，我们需要回答一些问题：

有多少种ID？

IDs	解释	来源
entrez ID	自于NCBI旗下的Entrez gene数据库所使用的编号	Entrez Gene数据库（NCBI中的Gene数据库）
EnsembleID	Ensembl数据库的ID编号	Ensembl基因组数据库
Gene Symbol	HUGO Gene Symbol(也叫做HGNC Symbol，即基因符号)是HGNC组织对基因进行命名描述的一个缩写标识符（如:TP53）	HGNC（(HUGO Gene Nomenclature Committee）人类基因命名委员会
RefSeq ID	RefSeq 有一套特殊的 Accesion Number（就是我们通常用的RefSeq ID）	RefSeq参考序列数据库
probeset ID	芯片数据中的探针ID
PubmedID	相当于文献的身份证号
[Omim ID]	OMIM中收集整理的表型（疾病）和基因均会有一个唯一的ID

什么ID最权威

Entrez ID是目前国际上最权威的Gene ID编号

ID是一一对应的吗？

gene的symbol与entrez ID绝对不是一一对应的

suppressMessages(library(org.Hs.eg.db)) all_symbol=mappedkeys(org.Hs.egSYMBOL2EG)

all_EGID =mappedkeys(org.Hs.egSYMBOL) tmp=as.list(org.Hs.egSYMBOL2EG[all_symbol]) #tmp=as.list(org.Hs.egSYMBOL[all_EGID ]) tmp=unlist(lapply(tmp,length)) tmp=tmp[tmp>1] as.list(org.Hs.egSYMBOL2EG[names(tmp)])

结果展示：

$HBD [1] "3045" "100187828"


$RNR1

[1] "4549" "6052"
$RNR2

[1] "4550" "6053"
$TEC

[1] "7006"      "100124696"
$MEMO1

[1] "7795"  "51072"
$MMD2

[1] "221938"    "100505381"
$DEL11P13

[1] "100528024" "107648861"

$`TRNAV-CAC` [1] "107985614" "107985615" "107985753"

这说明有多个entrez ID对应一个symbol的现象出现，但是没有一个symbol对应多个entrez ID的现象，而且entrez ID也会过期

ID有版本吗？

有的，ID版本还会过期。Ensembl 数据库非常贴心的为我们提供了ID History Converter工具帮助使用者进行ID的新旧版本的转换。

多种ID的对应关系

rm(list = ls()) library(org.Hs.eg.db) eg2symbol=toTable(org.Hs.egSYMBOL) #gene_id->symbol eg2name=toTable(org.Hs.egGENENAME) #gene id->gene name eg2alis = toTable(org.Hs.egALIAS2EG) #gene id-> alias gene ID基因别名(多个基因别名对应一个gene id)


#split函数的功能是将向量x中的数据根据f进行分组

eg2alis_list = lapply(split(eg2alis, eg2alis$gene_id), function(x)

  {paste0(x[,2],collapse = ";")}) #将geneID与alis的对应关系进
GeneList=mappedLkeys(org.Hs.egSYMBOL)
if(GeneList[1] %in% eg2symbol$symbol){

  symbols=GeneList

  geneIDs=eg2symbol[match(symbols,eg2symbol$symbol),'gene_id']

}else{

  geneIDs=GeneList

  symbols=eg2symbol[match(geneIDs,eg2symbol$gene_id),'symbol']

}
#match函数返回它的第一个参数在第二个参数中的(第一个)匹配的位置

geneNames = eg2name[match(geneIDs,eg2name$gene_id),'gene_name']

geneAlias=sapply(geneIDs,function(x) {

  ifelse(is.null(eg2alis_list[[x]]),"no_alias",eg2alis_list[[x]])

})
createLink <- function(base,val){

  sprintf('<a href="%s" class="btn btn-link" target="_blank">%s</a>',base,val)

}
gene_info = data.frame(symbols=symbols,

                       geneIDs=createLink(paste0("http://www.ncbi.nlm.nih.gov/gene/",geneIDs),geneIDs),

                       geneNames=geneNames,

                       geneAlias=geneAlias,

                       stringsAsFactors = F)
file='all_gene_bioconductor.html'

y <- DT::datatable(gene_info,escape = F,rownames=F)

library(DT)

y <- DT::datatable(gene_info,escape = F,rownames = F)

DT::saveWidget(y,file)

结果文件如下：

转换affymetrix芯片的探针ID：得到的变量my_probe就是我们需要的探针ID

rm(list = ls()) library("hgu95av2.db") ls("package:hgu95av2.db") probe2entrezID = toTable(hgu95av2ENTREZID) probe2symbol=toTable(hgu95av2SYMBOL) probe2genename=toTable(hgu95av2GENENAME)


my_probe = sample(unique(mappedLkeys(hgu95av2ENTREZID)),30)

tmp1 = probe2symbol[match(my_probe,probe2symbol$probe_id),]

tmp2 = probe2entrezID[match(my_probe,probe2entrezID$probe_id),]

tmp3 = probe2genename[match(my_probe,probe2genename$probe_id),]

all_IDs<-merge(merge(tmp1,tmp2,by='probe_id',all=TRUE),tmp3,by='probe_id',all=T)

结果文件如下：

probe_id symbol gene_id gene_name 1 1324_at RAD9A 5883 RAD9 checkpoint clamp component A 2 1583_at TNFRSF1B 7133 TNF receptor superfamily member 1B 3 32194_at CEBPZ 10153 CCAAT enhancer binding protein zeta 4 32227_at SRGN 5552 serglycin 5 32658_at VPS52 6293 VPS52 subunit of GARP complex 6 32745_at MRPL40 64976 mitochondrial ribosomal protein L40

转换illumina探针ID

library("illuminaHumanv4.db") ls('package:illuminaHumanv4.db') probe2entrezID = toTable(illuminaHumanv4ENTREZID) probe2symbol = toTable(illuminaHumanv4SYMBOL) probe2genename = toTable(illuminaHumanv4GENENAME)


my_probe = sample(unique(mappedLkeys(illuminaHumanv4ENTREZID)),30)

illu_1<-probe2symbol[match(my_probe,probe2symbol$probe_id),] illu_2<-probe2entrezID[match(my_probe,probe2entrezID$probe_id),] illu_3<-probe2genename[match(my_probe,probe2genename$probe_id),] all_illu<-merge(merge(illu_1,illu_2,by='probe_id',all=T),illu_3,by='probe_id',all=T)

结果文件如下：

probe_id symbol gene_id gene_name 1 ILMN_1651950 TPST1 8460 tyrosylprotein sulfotransferase 1 2 ILMN_1652037 UPP2 151531 uridine phosphorylase 2 3 ILMN_1670343 MAGI1 9223 membrane associated guanylate kinase, WW and PDZ domain containing 1 4 ILMN_1684256 EFNA3 1944 ephrin A3 5 ILMN_1685045 CHI3L2 1117 chitinase 3 like 2 6 ILMN_1689625 RAX 30062 retina and anterior neural fold homeobox

芯片探针与基因的对应关系的获取方式

去基因芯片厂家的官网直接下载manifest文件
从NCBI的GPL平台下载：http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL6947
直接使用bioconductor的包，根据对应的GPL平台号下载对应的探针ID

library(Biobase) library(GEOquery) gset <- getGEO("GSE119054", GSEMatrix = TRUE, getGPL = FALSE) #下载GEO数据集 if(length(gset) > 1) idx <- grep("GPL19615", attr(gset, "names")) else idx <- 1 #从中抓取出目标的数据集 gset <- gset[[idx]] class(gset) annotation(gset) #过的芯片数据的平台号，获得注释好


#安装注释包AnnoProbe,包含探针和基因symbol

library(devtools)

#install_github("jmzeng1314/AnnoProbe")

library(AnnoProbe) 

gpl <- "GPL19615" #通过annotation(gset)获得

probe2gene = idmap(gpl, type = "pipe") #根据自己的平台号获取对象的probeID和symbol

head(probe2gene)

dim(probe2gene)

#进行ID转换 expr <- exprs(gset) #获得表达矩阵，行名就是probe_ID expr <- cbind(expr, rownames(expr)) #将行名作为独立列，便于之后匹配ID colnames(expr)[ncol(expr)] = c("probe_id") #保持expr中的probe_id的列名与probe2gene中的一样 head(expr) expr_symbol <- merge(expr, probe2gene,by="probe_id") #根据probe_id进行合并，检查最后两列，就能得到你需要的probe_ID和symbol head(expr_symbol) write.csv(expr_symbol,"D:/music/Documents/GSE119054_expr_symbol.csv")

Python实现ID转换

以affymetrix芯片探针为例：

需要的文件：

probe_id.txt

403_s_at 36238_at 1324_at 40876_at 38501_s_at 36179_at 39922_at 36649_at 503_at 32194_at 35767_at

probe2entrezID.txt

1000_at 5595 1001_at 7075 1002_f_at 1557 1003_s_at 643 1004_at 643 1005_at 1843 1006_at 4319 1008_f_at 5610 1009_at 3094 100_g_at 5875

probe2genename.txt

1000_at mitogen-activated protein kinase 3 1001_at tyrosine kinase with immunoglobulin like and EGF like domains 1 1002_f_at cytochrome P450 family 2 subfamily C member 19 1003_s_at C-X-C motif chemokine receptor 5 1004_at C-X-C motif chemokine receptor 5 1005_at dual specificity phosphatase 1 1006_at matrix metallopeptidase 10 1008_f_at eukaryotic translation initiation factor 2 alpha kinase 2 1009_at histidine triad nucleotide binding protein 1 100_g_at Rab geranylgeranyltransferase subunit alpha

'probe2symbol.txt'

1000_at MAPK3 1001_at TIE1 1002_f_at CYP2C19 1003_s_at CXCR5 1004_at CXCR5 1005_at DUSP1 1006_at MMP10 1008_f_at EIF2AK2 1009_at HINT1 100_g_at RABGGTA

python代码：

import pandas as pd from collections import OrderedDict


probe2entrezID=OrderedDict()

with open('probe2entrezID.txt','r') as f:

     for line in f:

         arr = line.strip().split(' ')

         if arr[0] not in probe2entrezID:

            probe2entrezID[arr[0]] = []

            probe2entrezID[arr[0]].append(arr[1])

         else:

            probe2entrezID[arr[0]].append(arr[1])
probe2genename=OrderedDict()

with open('probe2genename.txt','r') as f:

      for line in f:

          arr = line.strip().split(' ')

          if arr[0] in probe2entrezID:

             probe2entrezID[arr[0]].append(arr[1])

          else:

             probe2genename[arr[0]] = []

             probe2genename[arr[0]].append(arr[1])
probe2symbol=OrderedDict()

with open('probe2symbol.txt','r') as f:

     for line in f:

         arr = line.strip().split(' ')

         if arr[0] in probe2entrezID:

            probe2entrezID[arr[0]].append(arr[1])

         elif arr[0] in probe2genename:

            probe2genename[arr[0]].append(arr[1])

         else:

            probe2symbol[arr[0]] = []

            probe2symbol[arr[0]].append(arr[1])

            

probes=[]

with open('my_probe.txt','r') as f:

     for line in f:

         probes.append(line.strip())
Aimed_IDs={key:value for key,value in probe2entrezID.items() if key in probes}

print(Aimed_IDs)
IDs_pd=pd.DataFrame.from_dict(Aimed_IDs,orient='index',dtype=None)

IDs_pd.columns=['EntrezID','GeneName','Symbol']

print(IDs_pd)

结果如下：

EntrezID GeneName Symbol 1324_at 5883 RAD9 RAD9A 1583_at 7133 TNF TNFRSF1B 32194_at 10153 CCAAT CEBPZ 32227_at 5552 serglycin SRGN 32658_at 6293 VPS52 VPS52 32745_at 64976 mitochondrial MRPL40 33547_i_at 2524 fucosyltransferase FUT2 34435_at 366 aquaporin AQP9 34935_at 2327 flavin FMO2 35767_at 11345 GABA GABARAPL2

生信编程8.ID转换

为什么要进行ID转化？

多种ID的对应关系

转换affymetrix芯片的探针ID：得到的变量my_probe就是我们需要的探针ID

转换illumina探针ID

芯片探针与基因的对应关系的获取方式

Python实现ID转换

文末友情推荐

相关推荐