生信笔记 | 文本挖掘的一般流程

一.文本挖掘的一般过程

参考:

http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know

看看这45篇文章有啥规律

把tcga大计划的CNS级别文章标题画一个词云

Step 1: Create a text file

本地文件,或者来源于网络。

Step 2 : Install and load the required packages

# Installinstall.packages("tm") # for text mininginstall.packages("SnowballC") # for text stemminginstall.packages("wordcloud") # word-cloud generator install.packages("RColorBrewer") # color palettes# Loadlibrary("tm")library("SnowballC")library("wordcloud")library("RColorBrewer")

Step 3 : Text mining

#读入本地文件text <- readLines('data/text/text.txt')# Read the text file from internetfilePath <- "http://www.sthda.com/sthda/RDoc/example-files/martin-luther-king-i-have-a-dream-speech.txt"text <- readLines(filePath)# Load the data as a corpusdocs <- Corpus(VectorSource(text))

VectorSource(x)函数:向量源将向量x的每个元素解释为一个文档。

其他读入函数:

  • read.csv() isused for reading comma-separated value (csv) files, where a comma “,” is used a field separator

  • read.delim() is used for reading tab-separated values (.txt) files

Inspect the content of the document

inspect(docs)

文本转换

清理文本数据首先要进行转换,比如从文本中删除特殊字符。这是通过使用tm_map()函数将特殊字符如“/”、“@”和“|”替换为空格来完成的。下一步是删除不必要的空格,并将文本转换为小写。

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))docs <- tm_map(docs, toSpace, "/")docs <- tm_map(docs, toSpace, "@")docs <- tm_map(docs, toSpace, "\\|")

tm_map()函数用于删除不必要的空格,将文本转换为小写,删除常见的停止词,如' The ', " we "。

“stopwords”的信息值接近于零,因为它们在语言中非常常见。在进一步分析之前,删除这类词是有用的。对于“stopwords”,支持的语言是丹麦语,荷兰语,英语,芬兰语,法语,德语,匈牙利语,意大利语,挪威语,葡萄牙语,俄语,西班牙语和瑞典语。语言名称区分大小写。

您还可以使用removeNumbers和removePunctuation参数删除数字和标点符号。

另一个重要的预处理步骤是使文本词干化,将单词还原为词根形式。换句话说,这个过程去掉单词的后缀,使其变得简单,并获得共同的起源。例如,词干提取过程将单词“moving”、“moved”和“movement”还原为词根词“move”。

# 将文本转换为小写docs <- tm_map(docs, content_transformer(tolower))# Remove numbersdocs <- tm_map(docs, removeNumbers)# 去掉英语中常见的停顿词docs <- tm_map(docs, removeWords, stopwords("english"))# Remove your own stop word# specify your stopwords as a character vectordocs <- tm_map(docs, removeWords, c("blabla1", "blabla2")) # Remove punctuationsdocs <- tm_map(docs, removePunctuation)# Eliminate extra white spacesdocs <- tm_map(docs, stripWhitespace)# Text stemming# docs <- tm_map(docs, stemDocument)

Step 4 : Build a term-document matrix

清理完文本数据后,下一步是统计每个单词出现的次数,以确定流行或趋势主题。使用文本挖掘包中的函数TermDocumentMatrix(),您可以构建一个文档矩阵——一个包含单词频率的表。

TermDocumentMatrix()可以如下使用:

dtm <- TermDocumentMatrix(docs)m <- as.matrix(dtm)v <- sort(rowSums(m),decreasing=TRUE)d <- data.frame(word = names(v),freq=v)head(d, 10)

Step 5 :

Generate the Word cloud

单词的重要性可以用单词云来说明,如下所示:

set.seed(1234)wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))

Word Association

相关性是一种统计技术,它可以证明成对的变量是否以及多大程度上是相关的。这种技术可以有效地用于分析哪些单词与调查回答中最频繁出现的单词联系在一起,这有助于查看这些单词周围的上下文。

# Find associations findAssocs(TextDoc_dtm, terms = c("good","work","health"), corlimit = 0.25)

您可以修改上述脚本,以查找与出现至少50次或以上的单词相关的术语,而不必在脚本中硬编码这些术语。

# Find associations for words that occur at least 50 timesfindAssocs(TextDoc_dtm, terms = findFreqTerms(TextDoc_dtm, lowfreq = 50), corlimit = 0.25)

Sentiment Scores

【自己觉得不适合自然科学,对社会科学比较实用】情绪可以分为积极的、中性的和消极的。它们也可以用数字表示,以便更好地表达文本主体中所包含的情绪的积极或消极程度。

这个例子使用Syuzhet包来生成情感分数,它有四个情感词典,并提供了一种访问斯坦福大学NLP小组开发的情感抽取工具的方法。get_sentiment函数接受两个参数:一个字符向量(句子或单词)和一个方法。所选择的方法决定了将使用四种可用的情感提取方法中的哪一种。这四个方法是syuzhet(这是默认的)、bing、afinn和nrc。每种方法使用不同的刻度,因此返回的结果略有不同。请注意,nrc方法的结果不仅仅是一个数值分数,需要额外的解释,超出了本文的范围。get_sentiment函数的描述来源于:

https://cran.r-project.org/web/packages/syuzhet/vignettes/syuzhet-vignette.html?

# regular sentiment score using get_sentiment() function and method of your choice# please note that different methods may have different scalessyuzhet_vector <- get_sentiment(text, method="syuzhet")# see the first row of the vectorhead(syuzhet_vector)# see summary statistics of the vectorsummary(syuzhet_vector)

更多参考:https://www.red-gate.com/simple-talk/databases/sql-server/bi-sql-server/text-mining-and-sentiment-analysis-with-r/

二. 作业一:查看一下这些文献有什么规律

A Novel Copolymer Poly(Lactide-co-b-Malic Acid).pdfA novel miRNA identified in GRSF1 complex drives the metastasis via the PIK3R3_AKT_NF-百B and TIMP3_MMP9 pathways in cervical cancer cells.pdfA novel microRNA identified in hepatocellular carcinomas is __responsive to LEF1 and facilitates proliferation and epithelial- mesenchymal transition via targeting of NFIX.pdfB4GALT3 up-regulation by miR-27a contributes to the oncogenic.pdfC14orf28 downregulated by miR-519d contributes to oncogenicity and regulates apoptosis and EMT in colorectal __cancer.pdfContribution of hydrophobichydrophilic modification on cationic chains.pdfDCLK1 promotes epithelial-mesenchymal transition via the PI3K_Akt_NF-百B pathway in colorectal cancer.pdfDNA Methylation-mediated Repression of miR-941 Enhances.pdfDownregulation of PPP2R5E expression by miR-23a suppresses.pdfDownregulation of TNFRSF19 and RAB43 by a novel miRNA, miR-HCC3, promotes proliferation and epithelial–mesenchymal __transition in hepatocellular carcinoma cells.pdfGRSF1-mediated MIR-G-1 promotes malignant behavior and nuclear autophagy by directly upregulating TMED5 and LMNB1 __in cervical cancer cells.pdfHBV-encoded miR-2 functions as an oncogene by downregulating TRIM35 but upregulating RAN in liver cancer __cells.pdfHBx-induced MiR-1269b in NF-κB dependent manner upregulates cell division cycle 40 homolog (CDC40) to promote proliferation and migration in hepatoma cells.pdfICP4-induced miR-101 attenuates HSV-1 replication.pdfINPP1 up-regulation by miR-27a contributes to the growth, migration and invasion of human cervical cancer.pdfKDM4B-mediated epigenetic silencing of miRNA-615-5p augments RAB24 to facilitate malignancy of hepatoma cells.pdfLncRNA RSU1P2 contributes to tumorigenesis by acting as a _ceRNA against let-7a in cervical cancer cells.pdfLncRNA n335586_miR-924_CKMT1A axis contributes to cell migration and invasion in hepatocellular carcinoma cells.pdfLong non-coding RNA Unigene56159 promotes.pdfMiR-124 represses vasculogenic mimicry and cell motility by.pdfMiR-23a Facilitates the Replication of.pdfMiR-346 Up-regulates Argonaute 2 (AGO2) Protein Expression to Augment the Activity of.pdfMiR-HCC2 Up-regulates BAMBI and ELMO1 Expression to Facilitate the Proliferation and EMT of Hepatocellular Carcinoma Cells.pdfMicroRNA-142-3p, a new regulator of RAC1, suppresses the migration.pdfMicroRNA-19a and -19b regulate cervical carcinoma cell proliferation.pdfMicroRNA-214 Suppresses Growth and Invasiveness.pdfNF-¦ÊB-modulated miR-130a targets TNF in cervical cancer cells.pdfPIWIL4 regulates cervical cancer cell line growth and is involved in.pdfTCDD-induced antagonism of MEHP-mediated migration and __invasion partly involves aryl hydrocarbon receptor in MCF7 breast cancer cells.pdfUSP14 de-ubiquitinates vimentin and miR-320a modulates USP14 and vimentin to contribute to malignancy in gastric _cancer cells.pdfmiR-10a suppresses colorectal cancer metastasis by modulating __the epithelial-to-mesenchymal transition and anoikis.pdfmiR-1228 promotes the proliferation and.pdfmiR-17-5p up-regulates YES1 to modulate the cell cycle progression and apoptosis in ovarian cancer cell lines.pdfmiR-212132.pdfmiR-23a Targets Interferon Regulatory Factor 1 and.pdfmiR-23a promotes IKKa expression but.pdfmiR-24-3p Suppresses Malignant Behavior of Lacrimal Adenoid Cystic Carcinoma by Targeting PRKCH to Regulate p53_p21 Pathway.pdfmiR-30a reverses TGF-¦Ā2-induced migration and EMT in posterior capsular opacification by targeting Smad2.pdfmiR-371-5p down-regulates pre mRNA processing factor 4 homolog B.pdfmiR-377-3p drives malignancy characteristics via upregulating GSK-3 expression and activating NF-κB pathway in hCRC cells.pdfmiR-3928v is induced by HBx via NF-kB_EGR1 and contributes __to hepatocellular carcinoma malignancy by down-regulating VDAC3.pdfmiR-484 suppresses proliferation and epithelial–mesenchymal __transition by targeting ZEB1 and SMAD2 in cervical cancer cells.pdfmiR-490-3p Modulates Cell Growth and Epithelial to Mesenchymal Transition of Epithelial to Mesenchymal Transition of Targeting Endoplasmic Reticulum-Golgi Intermediate Compartment Protein 3 (ERGIC3).pdfmiR-639 Expression Is Silenced by DNMT3A-Mediated Hypermethylation and Functions as a Tumor Suppressor in Liver Cancer Cells.pdfmicroRNA-34a-Upregulated Retinoic Acid-Inducible Gene-I Promotes Apoptosis and Delays Cell Cycle Transition in Cervical Cancer Cells.pdf

完整代码:

#install.packages('wordcloud2')#devtools::install_github("lchiffon/wordcloud2")#最终采用本地安装wordcloud2 0.2.0版本# Install# install.packages("tm") # for text mining# install.packages("SnowballC") # for text stemming# install.packages("wordcloud") # word-cloud generator# install.packages("RColorBrewer") # color palettes# Loadlibrary("tm")library("SnowballC")library("wordcloud")library("RColorBrewer")
text=readLines('data/text/text.txt')# Load the data as a corpusdocs <- Corpus(VectorSource(text))toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))docs <- tm_map(docs, toSpace, "/")docs <- tm_map(docs, toSpace, "@")docs <- tm_map(docs, toSpace, "\\|")docs <- tm_map(docs, toSpace, ".pdf")# Convert the text to lower casedocs <- tm_map(docs, content_transformer(tolower))# Remove numbersdocs <- tm_map(docs, removeNumbers)# Remove english common stopwordsdocs <- tm_map(docs, removeWords, stopwords("english"))# Remove your own stop word# specify your stopwords as a character vectordocs <- tm_map(docs, removeWords, c("characterization", "molecular", "comprehensive",'cell', 'analysis','landscape'))# Remove punctuationsdocs <- tm_map(docs, removePunctuation)# Eliminate extra white spacesdocs <- tm_map(docs, stripWhitespace)# Text stemming# docs <- tm_map(docs, stemDocument)
dtm <- TermDocumentMatrix(docs)m <- as.matrix(dtm)v <- sort(rowSums(m),decreasing=TRUE)d <- data.frame(word = names(v),freq=v)head(d, 10)wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, shape = 'pentagon',size=0.7, colors=brewer.pal(8, "Dark2"))

再看一下关键词比较多的cancer和mir较近的词

findAssocs(dtm, terms = c("cancer","mir"), corlimit = 0.25)
findAssocs(dtm, terms = findFreqTerms(dtm, lowfreq = 10), corlimit = 0.25)

$cancer
cervical apoptosis colorectal metastasis
0.56 0.36 0.36 0.29
epithelialmesenchymal functions liver
0.29 0.29 0.29

三. 作业二:TCGA project官方文章

TCGA计划官方文章在:https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga/publications

Comprehensive genomic characterization defines human glioblastoma genes and core pathwaysIntegrated genomic analyses of ovarian carcinomaComprehensive molecular characterization of human colon and rectal cancerComprehensive molecular portraits of human breast tumoursComprehensive genomic characterization of squamous cell lung cancersIntegrated genomic characterization of endometrial carcinomaGenomic and epigenomic landscapes of adult de novo acute myeloid leukemiaComprehensive molecular characterization of clear cell renal cell carcinomaThe Cancer Genome Atlas Pan-Cancer analysis projectThe somatic genomic landscape of glioblastomaComprehensive molecular characterization of urothelial bladder carcinomaComprehensive molecular profiling of lung adenocarcinomaMultiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of originThe Somatic Genomic Landscape of Chromophobe Renal Cell CarcinomaComprehensive molecular characterization of gastric adenocarcinomaIntegrated genomic characterization of papillary thyroid carcinomaComprehensive genomic characterization of head and neck squamous cell carcinomasGenomic Classification of Cutaneous MelanomaComprehensive, Integrative Genomic Analysis of Diffuse Lower-Grade GliomasComprehensive Molecular Portraits of Invasive Lobular Breast CancerThe Molecular Taxonomy of Primary Prostate CancerComprehensive Molecular Characterization of Papillary Renal-Cell CarcinomaComprehensive Pan-Genomic Characterization of Adrenocortical CarcinomaDistinct patterns of somatic genome alterations in lung adenocarcinomas and squamous cell carcinomasIntegrated genomic characterization of oesophageal carcinomaComprehensive Molecular Characterization of Pheochromocytoma and ParagangliomaIntegrated Molecular Characterization of Uterine CarcinosarcomaIntegrative Genomic Analysis of Cholangiocarcinoma Identifies Distinct IDH-Mutant Molecular ProfilesIntegrated genomic and molecular characterization of cervical cancerComprehensive and Integrative Genomic Characterization of Hepatocellular CarcinomaIntegrative Analysis Identifies Four Molecular and Clinical Subsets in Uveal MelanomaIntegrated Genomic Characterization of Pancreatic Ductal AdenocarcinomaComprehensive Molecular Characterization of Muscle-Invasive Bladder CancerComprehensive and Integrated Genomic Characterization of Adult Soft Tissue SarcomasThe Integrated Genomic Landscape of Thymic Epithelial TumorsPan-cancer Alterations of the MYC Oncogene and Its Proximal Network across the Cancer Genome AtlasScalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic PipelinesMolecular Characterization and Clinical Relevance of Metabolic Expression Subtypes in Human CancersSystematic Analysis of Splice-Site-Creating Mutations in CancerSomatic Mutational Landscape of Splicing Factor Genes and Their Functional Consequences across 33 Cancer TypesThe Cancer Genome Atlas Comprehensive Molecular Characterization of Renal Cell CarcinomaPan-Cancer Analysis of lncRNA Regulation Supports Their Targeting of Cancer Genes in Each Tumor ContextSpatial Organization and Molecular Correlation of Tumor-Infiltrating Lymphocytes Using Deep Learning on Pathology ImagesMachine Learning Detects Pan-cancer Ras Pathway Activation in The Cancer Genome AtlasGenomic and Molecular Landscape of DNA Damage Repair Deficiency across The Cancer Genome AtlasDriver Fusions and Their Implications in the Development and Treatment of Human CancersGenomic, Pathway Network, and Immunologic Features Distinguishing Squamous CarcinomasIntegrated Genomic Analysis of the Ubiquitin Pathway across Cancer TypesSnapShot: TCGA-Analyzed TumorsThe Cancer Genome Atlas: Creating Lasting Value beyond Its DataMachine Learning Identifies Stemness Features Associated with Oncogenic DedifferentiationOncogenic Signaling Pathways in The Cancer Genome AtlasPerspective on Oncogenic Processes at the End of the Beginning of Cancer GenomicsComprehensive Characterization of Cancer Driver Genes and MutationsAn Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome AnalyticsPathogenic Germline Variants in 10,389 Adult CancersA Pan-Cancer Analysis of Enhancer Expression in Nearly 9000 Patient SamplesGenomic and Functional Approaches to Understanding Cancer AneuploidyA Comprehensive Pan-Cancer Molecular Study of Gynecologic and Breast CancersComparative Molecular Analysis of Gastrointestinal AdenocarcinomaslncRNA Epigenetic Landscape Analysis Identifies EPIC1 as an Oncogenic lncRNA that Interacts with MYC and Promotes Cell-Cycle Progression in CancerThe Immune Landscape of CancerIntegrated Molecular Characterization of Testicular Germ Cell TumorsComprehensive Analysis of Alternative Splicing Across Tumors from 8,705 PatientsA Pan-Cancer Analysis Reveals High-Frequency Genetic Alterations in Mediators of Signaling by the TGF-β SuperfamilyIntegrative Molecular Characterization of Malignant Pleural MesotheliomaThe chromatin accessibility landscape of primary human cancersComprehensive Molecular Characterization of the Hippo Signaling Pathway in CancerBefore and After: Comparison of Legacy and Harmonized TCGA Genomic Data Commons'DataComprehensive Analysis of Genetic Ancestry and Its Molecular Correlates in CancerWhole-genome characterization of lung adenocarcinomas lacking alterations in the RTK/RAS/RAF pathway

完整代码:

TCGALett <- readLines('data/text/TCGA-literature.txt')
docs <- Corpus(VectorSource(TCGALett))toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))docs <- tm_map(docs, toSpace, "/")docs <- tm_map(docs, toSpace, "-")docs <- tm_map(docs, toSpace, "\\|")
# Convert the text to lower casedocs <- tm_map(docs, content_transformer(tolower))# Remove numbersdocs <- tm_map(docs, removeNumbers)# Remove english common stopwordsdocs <- tm_map(docs, removeWords, stopwords("english"))# Remove your own stop word# specify your stopwords as a character vectordocs <- tm_map(docs, removeWords, c("characterization", "molecular", "comprehensive",'cell', 'analysis','landscape'))# Remove punctuationsdocs <- tm_map(docs, removePunctuation)# Eliminate extra white spacesdocs <- tm_map(docs, stripWhitespace)# Text stemming# docs <- tm_map(docs, stemDocument)
dtm <- TermDocumentMatrix(docs)m <- as.matrix(dtm)v <- sort(rowSums(m),decreasing=TRUE)d <- data.frame(word = names(v),freq=v)head(d, 10)wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, shape = 'pentagon',size=0.7, colors=brewer.pal(8, "Dark2"))
findAssocs(dtm, terms = c("cancer","genomic"), corlimit = 0.25)findAssocs(dtm, terms = findFreqTerms(dtm, lowfreq = 10), corlimit = 0.25)wordcloud2(d,size = 0.6)
> findAssocs(dtm, terms = findFreqTerms(dtm, lowfreq = 10), corlimit = 0.25)$genomicnumeric(0)
$carcinoma renal papillary analyses ovarian endometrial 0.57 0.40 0.28 0.28 0.28 clear urothelial chromophobe thyroid adrenocortical 0.28 0.28 0.28 0.28 0.28 oesophageal hepatocellular 0.28 0.28
$integrated analyses ovarian endometrial thyroid oesophageal 0.27 0.27 0.27 0.27 0.27 carcinosarcoma uterine cervical ductal pancreatic 0.27 0.27 0.27 0.27 0.27 sarcomas soft tissue epithelial thymic 0.27 0.27 0.27 0.27 0.27 ubiquitin analytics drive outcome quality 0.27 0.27 0.27 0.27 0.27 resource survival germ testicular 0.27 0.27 0.27 0.27
$cancer pan atlas genome project oncogene proximal context regulation 0.56 0.54 0.42 0.31 0.31 0.31 0.31 0.31 supports targeting activation detects myc across 0.31 0.31 0.31 0.31 0.30 0.28

关于词云图如何绘制的好看,参考文章:R绘图笔记 | 词云图的绘制

(0)

相关推荐