生信笔记 | 文本挖掘的一般流程






Step 1: Create a text file


Step 2 : Install and load the required packages

# Installinstall.packages("tm") # for text mininginstall.packages("SnowballC") # for text stemminginstall.packages("wordcloud") # word-cloud generator install.packages("RColorBrewer") # color palettes# Loadlibrary("tm")library("SnowballC")library("wordcloud")library("RColorBrewer")

Step 3 : Text mining

#读入本地文件text <- readLines('data/text/text.txt')# Read the text file from internetfilePath <- "http://www.sthda.com/sthda/RDoc/example-files/martin-luther-king-i-have-a-dream-speech.txt"text <- readLines(filePath)# Load the data as a corpusdocs <- Corpus(VectorSource(text))



  • read.csv() isused for reading comma-separated value (csv) files, where a comma “,” is used a field separator

  • read.delim() is used for reading tab-separated values (.txt) files

Inspect the content of the document




toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))docs <- tm_map(docs, toSpace, "/")docs <- tm_map(docs, toSpace, "@")docs <- tm_map(docs, toSpace, "\\|")

tm_map()函数用于删除不必要的空格,将文本转换为小写,删除常见的停止词,如' The ', " we "。




# 将文本转换为小写docs <- tm_map(docs, content_transformer(tolower))# Remove numbersdocs <- tm_map(docs, removeNumbers)# 去掉英语中常见的停顿词docs <- tm_map(docs, removeWords, stopwords("english"))# Remove your own stop word# specify your stopwords as a character vectordocs <- tm_map(docs, removeWords, c("blabla1", "blabla2")) # Remove punctuationsdocs <- tm_map(docs, removePunctuation)# Eliminate extra white spacesdocs <- tm_map(docs, stripWhitespace)# Text stemming# docs <- tm_map(docs, stemDocument)

Step 4 : Build a term-document matrix



dtm <- TermDocumentMatrix(docs)m <- as.matrix(dtm)v <- sort(rowSums(m),decreasing=TRUE)d <- data.frame(word = names(v),freq=v)head(d, 10)

Step 5 :

Generate the Word cloud


set.seed(1234)wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))

Word Association


# Find associations findAssocs(TextDoc_dtm, terms = c("good","work","health"), corlimit = 0.25)


# Find associations for words that occur at least 50 timesfindAssocs(TextDoc_dtm, terms = findFreqTerms(TextDoc_dtm, lowfreq = 50), corlimit = 0.25)

Sentiment Scores




# regular sentiment score using get_sentiment() function and method of your choice# please note that different methods may have different scalessyuzhet_vector <- get_sentiment(text, method="syuzhet")# see the first row of the vectorhead(syuzhet_vector)# see summary statistics of the vectorsummary(syuzhet_vector)


二. 作业一:查看一下这些文献有什么规律

A Novel Copolymer Poly(Lactide-co-b-Malic Acid).pdfA novel miRNA identified in GRSF1 complex drives the metastasis via the PIK3R3_AKT_NF-百B and TIMP3_MMP9 pathways in cervical cancer cells.pdfA novel microRNA identified in hepatocellular carcinomas is __responsive to LEF1 and facilitates proliferation and epithelial- mesenchymal transition via targeting of NFIX.pdfB4GALT3 up-regulation by miR-27a contributes to the oncogenic.pdfC14orf28 downregulated by miR-519d contributes to oncogenicity and regulates apoptosis and EMT in colorectal __cancer.pdfContribution of hydrophobichydrophilic modification on cationic chains.pdfDCLK1 promotes epithelial-mesenchymal transition via the PI3K_Akt_NF-百B pathway in colorectal cancer.pdfDNA Methylation-mediated Repression of miR-941 Enhances.pdfDownregulation of PPP2R5E expression by miR-23a suppresses.pdfDownregulation of TNFRSF19 and RAB43 by a novel miRNA, miR-HCC3, promotes proliferation and epithelial–mesenchymal __transition in hepatocellular carcinoma cells.pdfGRSF1-mediated MIR-G-1 promotes malignant behavior and nuclear autophagy by directly upregulating TMED5 and LMNB1 __in cervical cancer cells.pdfHBV-encoded miR-2 functions as an oncogene by downregulating TRIM35 but upregulating RAN in liver cancer __cells.pdfHBx-induced MiR-1269b in NF-κB dependent manner upregulates cell division cycle 40 homolog (CDC40) to promote proliferation and migration in hepatoma cells.pdfICP4-induced miR-101 attenuates HSV-1 replication.pdfINPP1 up-regulation by miR-27a contributes to the growth, migration and invasion of human cervical cancer.pdfKDM4B-mediated epigenetic silencing of miRNA-615-5p augments RAB24 to facilitate malignancy of hepatoma cells.pdfLncRNA RSU1P2 contributes to tumorigenesis by acting as a _ceRNA against let-7a in cervical cancer cells.pdfLncRNA n335586_miR-924_CKMT1A axis contributes to cell migration and invasion in hepatocellular carcinoma cells.pdfLong non-coding RNA Unigene56159 promotes.pdfMiR-124 represses vasculogenic mimicry and cell motility by.pdfMiR-23a Facilitates the Replication of.pdfMiR-346 Up-regulates Argonaute 2 (AGO2) Protein Expression to Augment the Activity of.pdfMiR-HCC2 Up-regulates BAMBI and ELMO1 Expression to Facilitate the Proliferation and EMT of Hepatocellular Carcinoma Cells.pdfMicroRNA-142-3p, a new regulator of RAC1, suppresses the migration.pdfMicroRNA-19a and -19b regulate cervical carcinoma cell proliferation.pdfMicroRNA-214 Suppresses Growth and Invasiveness.pdfNF-¦ÊB-modulated miR-130a targets TNF in cervical cancer cells.pdfPIWIL4 regulates cervical cancer cell line growth and is involved in.pdfTCDD-induced antagonism of MEHP-mediated migration and __invasion partly involves aryl hydrocarbon receptor in MCF7 breast cancer cells.pdfUSP14 de-ubiquitinates vimentin and miR-320a modulates USP14 and vimentin to contribute to malignancy in gastric _cancer cells.pdfmiR-10a suppresses colorectal cancer metastasis by modulating __the epithelial-to-mesenchymal transition and anoikis.pdfmiR-1228 promotes the proliferation and.pdfmiR-17-5p up-regulates YES1 to modulate the cell cycle progression and apoptosis in ovarian cancer cell lines.pdfmiR-212132.pdfmiR-23a Targets Interferon Regulatory Factor 1 and.pdfmiR-23a promotes IKKa expression but.pdfmiR-24-3p Suppresses Malignant Behavior of Lacrimal Adenoid Cystic Carcinoma by Targeting PRKCH to Regulate p53_p21 Pathway.pdfmiR-30a reverses TGF-¦Ā2-induced migration and EMT in posterior capsular opacification by targeting Smad2.pdfmiR-371-5p down-regulates pre mRNA processing factor 4 homolog B.pdfmiR-377-3p drives malignancy characteristics via upregulating GSK-3 expression and activating NF-κB pathway in hCRC cells.pdfmiR-3928v is induced by HBx via NF-kB_EGR1 and contributes __to hepatocellular carcinoma malignancy by down-regulating VDAC3.pdfmiR-484 suppresses proliferation and epithelial–mesenchymal __transition by targeting ZEB1 and SMAD2 in cervical cancer cells.pdfmiR-490-3p Modulates Cell Growth and Epithelial to Mesenchymal Transition of Epithelial to Mesenchymal Transition of Targeting Endoplasmic Reticulum-Golgi Intermediate Compartment Protein 3 (ERGIC3).pdfmiR-639 Expression Is Silenced by DNMT3A-Mediated Hypermethylation and Functions as a Tumor Suppressor in Liver Cancer Cells.pdfmicroRNA-34a-Upregulated Retinoic Acid-Inducible Gene-I Promotes Apoptosis and Delays Cell Cycle Transition in Cervical Cancer Cells.pdf


#install.packages('wordcloud2')#devtools::install_github("lchiffon/wordcloud2")#最终采用本地安装wordcloud2 0.2.0版本# Install# install.packages("tm") # for text mining# install.packages("SnowballC") # for text stemming# install.packages("wordcloud") # word-cloud generator# install.packages("RColorBrewer") # color palettes# Loadlibrary("tm")library("SnowballC")library("wordcloud")library("RColorBrewer")
text=readLines('data/text/text.txt')# Load the data as a corpusdocs <- Corpus(VectorSource(text))toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))docs <- tm_map(docs, toSpace, "/")docs <- tm_map(docs, toSpace, "@")docs <- tm_map(docs, toSpace, "\\|")docs <- tm_map(docs, toSpace, ".pdf")# Convert the text to lower casedocs <- tm_map(docs, content_transformer(tolower))# Remove numbersdocs <- tm_map(docs, removeNumbers)# Remove english common stopwordsdocs <- tm_map(docs, removeWords, stopwords("english"))# Remove your own stop word# specify your stopwords as a character vectordocs <- tm_map(docs, removeWords, c("characterization", "molecular", "comprehensive",'cell', 'analysis','landscape'))# Remove punctuationsdocs <- tm_map(docs, removePunctuation)# Eliminate extra white spacesdocs <- tm_map(docs, stripWhitespace)# Text stemming# docs <- tm_map(docs, stemDocument)
dtm <- TermDocumentMatrix(docs)m <- as.matrix(dtm)v <- sort(rowSums(m),decreasing=TRUE)d <- data.frame(word = names(v),freq=v)head(d, 10)wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, shape = 'pentagon',size=0.7, colors=brewer.pal(8, "Dark2"))


findAssocs(dtm, terms = c("cancer","mir"), corlimit = 0.25)
findAssocs(dtm, terms = findFreqTerms(dtm, lowfreq = 10), corlimit = 0.25)

cervical apoptosis colorectal metastasis
0.56 0.36 0.36 0.29
epithelialmesenchymal functions liver
0.29 0.29 0.29

三. 作业二:TCGA project官方文章


Comprehensive genomic characterization defines human glioblastoma genes and core pathwaysIntegrated genomic analyses of ovarian carcinomaComprehensive molecular characterization of human colon and rectal cancerComprehensive molecular portraits of human breast tumoursComprehensive genomic characterization of squamous cell lung cancersIntegrated genomic characterization of endometrial carcinomaGenomic and epigenomic landscapes of adult de novo acute myeloid leukemiaComprehensive molecular characterization of clear cell renal cell carcinomaThe Cancer Genome Atlas Pan-Cancer analysis projectThe somatic genomic landscape of glioblastomaComprehensive molecular characterization of urothelial bladder carcinomaComprehensive molecular profiling of lung adenocarcinomaMultiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of originThe Somatic Genomic Landscape of Chromophobe Renal Cell CarcinomaComprehensive molecular characterization of gastric adenocarcinomaIntegrated genomic characterization of papillary thyroid carcinomaComprehensive genomic characterization of head and neck squamous cell carcinomasGenomic Classification of Cutaneous MelanomaComprehensive, Integrative Genomic Analysis of Diffuse Lower-Grade GliomasComprehensive Molecular Portraits of Invasive Lobular Breast CancerThe Molecular Taxonomy of Primary Prostate CancerComprehensive Molecular Characterization of Papillary Renal-Cell CarcinomaComprehensive Pan-Genomic Characterization of Adrenocortical CarcinomaDistinct patterns of somatic genome alterations in lung adenocarcinomas and squamous cell carcinomasIntegrated genomic characterization of oesophageal carcinomaComprehensive Molecular Characterization of Pheochromocytoma and ParagangliomaIntegrated Molecular Characterization of Uterine CarcinosarcomaIntegrative Genomic Analysis of Cholangiocarcinoma Identifies Distinct IDH-Mutant Molecular ProfilesIntegrated genomic and molecular characterization of cervical cancerComprehensive and Integrative Genomic Characterization of Hepatocellular CarcinomaIntegrative Analysis Identifies Four Molecular and Clinical Subsets in Uveal MelanomaIntegrated Genomic Characterization of Pancreatic Ductal AdenocarcinomaComprehensive Molecular Characterization of Muscle-Invasive Bladder CancerComprehensive and Integrated Genomic Characterization of Adult Soft Tissue SarcomasThe Integrated Genomic Landscape of Thymic Epithelial TumorsPan-cancer Alterations of the MYC Oncogene and Its Proximal Network across the Cancer Genome AtlasScalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic PipelinesMolecular Characterization and Clinical Relevance of Metabolic Expression Subtypes in Human CancersSystematic Analysis of Splice-Site-Creating Mutations in CancerSomatic Mutational Landscape of Splicing Factor Genes and Their Functional Consequences across 33 Cancer TypesThe Cancer Genome Atlas Comprehensive Molecular Characterization of Renal Cell CarcinomaPan-Cancer Analysis of lncRNA Regulation Supports Their Targeting of Cancer Genes in Each Tumor ContextSpatial Organization and Molecular Correlation of Tumor-Infiltrating Lymphocytes Using Deep Learning on Pathology ImagesMachine Learning Detects Pan-cancer Ras Pathway Activation in The Cancer Genome AtlasGenomic and Molecular Landscape of DNA Damage Repair Deficiency across The Cancer Genome AtlasDriver Fusions and Their Implications in the Development and Treatment of Human CancersGenomic, Pathway Network, and Immunologic Features Distinguishing Squamous CarcinomasIntegrated Genomic Analysis of the Ubiquitin Pathway across Cancer TypesSnapShot: TCGA-Analyzed TumorsThe Cancer Genome Atlas: Creating Lasting Value beyond Its DataMachine Learning Identifies Stemness Features Associated with Oncogenic DedifferentiationOncogenic Signaling Pathways in The Cancer Genome AtlasPerspective on Oncogenic Processes at the End of the Beginning of Cancer GenomicsComprehensive Characterization of Cancer Driver Genes and MutationsAn Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome AnalyticsPathogenic Germline Variants in 10,389 Adult CancersA Pan-Cancer Analysis of Enhancer Expression in Nearly 9000 Patient SamplesGenomic and Functional Approaches to Understanding Cancer AneuploidyA Comprehensive Pan-Cancer Molecular Study of Gynecologic and Breast CancersComparative Molecular Analysis of Gastrointestinal AdenocarcinomaslncRNA Epigenetic Landscape Analysis Identifies EPIC1 as an Oncogenic lncRNA that Interacts with MYC and Promotes Cell-Cycle Progression in CancerThe Immune Landscape of CancerIntegrated Molecular Characterization of Testicular Germ Cell TumorsComprehensive Analysis of Alternative Splicing Across Tumors from 8,705 PatientsA Pan-Cancer Analysis Reveals High-Frequency Genetic Alterations in Mediators of Signaling by the TGF-β SuperfamilyIntegrative Molecular Characterization of Malignant Pleural MesotheliomaThe chromatin accessibility landscape of primary human cancersComprehensive Molecular Characterization of the Hippo Signaling Pathway in CancerBefore and After: Comparison of Legacy and Harmonized TCGA Genomic Data Commons'DataComprehensive Analysis of Genetic Ancestry and Its Molecular Correlates in CancerWhole-genome characterization of lung adenocarcinomas lacking alterations in the RTK/RAS/RAF pathway


TCGALett <- readLines('data/text/TCGA-literature.txt')
docs <- Corpus(VectorSource(TCGALett))toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))docs <- tm_map(docs, toSpace, "/")docs <- tm_map(docs, toSpace, "-")docs <- tm_map(docs, toSpace, "\\|")
# Convert the text to lower casedocs <- tm_map(docs, content_transformer(tolower))# Remove numbersdocs <- tm_map(docs, removeNumbers)# Remove english common stopwordsdocs <- tm_map(docs, removeWords, stopwords("english"))# Remove your own stop word# specify your stopwords as a character vectordocs <- tm_map(docs, removeWords, c("characterization", "molecular", "comprehensive",'cell', 'analysis','landscape'))# Remove punctuationsdocs <- tm_map(docs, removePunctuation)# Eliminate extra white spacesdocs <- tm_map(docs, stripWhitespace)# Text stemming# docs <- tm_map(docs, stemDocument)
dtm <- TermDocumentMatrix(docs)m <- as.matrix(dtm)v <- sort(rowSums(m),decreasing=TRUE)d <- data.frame(word = names(v),freq=v)head(d, 10)wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, shape = 'pentagon',size=0.7, colors=brewer.pal(8, "Dark2"))
findAssocs(dtm, terms = c("cancer","genomic"), corlimit = 0.25)findAssocs(dtm, terms = findFreqTerms(dtm, lowfreq = 10), corlimit = 0.25)wordcloud2(d,size = 0.6)
> findAssocs(dtm, terms = findFreqTerms(dtm, lowfreq = 10), corlimit = 0.25)$genomicnumeric(0)
$carcinoma renal papillary analyses ovarian endometrial 0.57 0.40 0.28 0.28 0.28 clear urothelial chromophobe thyroid adrenocortical 0.28 0.28 0.28 0.28 0.28 oesophageal hepatocellular 0.28 0.28
$integrated analyses ovarian endometrial thyroid oesophageal 0.27 0.27 0.27 0.27 0.27 carcinosarcoma uterine cervical ductal pancreatic 0.27 0.27 0.27 0.27 0.27 sarcomas soft tissue epithelial thymic 0.27 0.27 0.27 0.27 0.27 ubiquitin analytics drive outcome quality 0.27 0.27 0.27 0.27 0.27 resource survival germ testicular 0.27 0.27 0.27 0.27
$cancer pan atlas genome project oncogene proximal context regulation 0.56 0.54 0.42 0.31 0.31 0.31 0.31 0.31 supports targeting activation detects myc across 0.31 0.31 0.31 0.31 0.30 0.28

关于词云图如何绘制的好看,参考文章:R绘图笔记 | 词云图的绘制

