RePLeT-Textmining, Sentiment, Topic Modeling, Word2Vec

[데이터 전처리] Stemming

Author : tmlab / Date : 2016. 10. 26. 20:35 / Category : Text Mining/R

패키지 자동 인스톨 및 호출하기

현재 깔려 있는 library와 해당 패키지를 비교
없으면 인스톨 후 호출
있으면 그냥 호출

rcv<-function (x) 
{
    for (i in x) {
        if (!is.element(i, .packages(all.available = TRUE))) {
            install.packages(i)
        }
        library(i, character.only = TRUE)
    }
}

Stemming 하기

twitteR사용
해당 검색어의 트윗들을 List로 가져옴

rcv(c("twitteR","SnowballC","tm"))

## Loading required package: NLP

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

## [1] "Using direct authentication"

tweets <- userTimeline("RDataMining",n=3200)

데이터 확인
twListToDF는 위의 결과를 dataframe으로 바꿔주는 함수

(n.tweet <- length(tweets))

## [1] 466

tweets[1:5]

## [[1]]
## [1] "RDataMining: Seminar: Exploring causal relationships in observational data, Prof. Jiuyong Li. Canberra, 4:15pm Wed 16 Nov 2016. https://t.co/tXBWBfv01J"
## 
## [[2]]
## [1] "RDataMining: Three Research Scholarships (PhD or Research Master) in Data Science &amp; Analytics, based in Canberra. Apply by 24 Oct https://t.co/CUj6IRoWzg"
## 
## [[3]]
## [1] "RDataMining: Slides and other materials for the R and Data Mining Short Course at University of Canberra are now available at https://t.co/xKwtcnvjj7"
## 
## [[4]]
## [1] "RDataMining: Canberra HealthHack 2016, 14th - 16th October https://t.co/YJiAKCODdJ"
## 
## [[5]]
## [1] "RDataMining: @AliMAllaith sorry, seems not. Please check the link for details."

tweets.df <- twListToDF(tweets)
str(tweets.df)

## 'data.frame':    466 obs. of  16 variables:
##  $ text         : chr  "Seminar: Exploring causal relationships in observational data, Prof. Jiuyong Li. Canberra, 4:15pm Wed 16 Nov 2016. https://t.co"| __truncated__ "Three Research Scholarships (PhD or Research Master) in Data Science &amp; Analytics, based in Canberra. Apply by 24 Oct https:"| __truncated__ "Slides and other materials for the R and Data Mining Short Course at University of Canberra are now available at https://t.co/x"| __truncated__ "Canberra HealthHack 2016, 14th - 16th October https://t.co/YJiAKCODdJ" ...
##  $ favorited    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ favoriteCount: num  3 2 10 0 0 3 1 0 4 8 ...
##  $ replyToSN    : chr  NA NA NA NA ...
##  $ created      : POSIXct, format: "2016-10-26 08:36:01" "2016-10-17 14:01:54" ...
##  $ truncated    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ replyToSID   : chr  NA NA NA NA ...
##  $ id           : chr  "791196673649233920" "788017193019518976" "783785179684798464" "781420267885072384" ...
##  $ replyToUID   : chr  NA NA NA NA ...
##  $ statusSource : chr  "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>" "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>" "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>" "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>" ...
##  $ screenName   : chr  "RDataMining" "RDataMining" "RDataMining" "RDataMining" ...
##  $ retweetCount : num  1 2 6 1 0 1 1 0 1 7 ...
##  $ isRetweet    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ retweeted    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ longitude    : logi  NA NA NA NA NA NA ...
##  $ latitude     : logi  NA NA NA NA NA NA ...

tm 패키지를 사용하여 Corpus화 함
content_transformer를 사용하여 Corpus에 함수적용
- tolower는 모두 소문자로 변환시키는 함수
- Punctuation, 숫자, URL 등을 제거하는 전처리 수행

myCorpus <- Corpus(VectorSource(tweets.df$text))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)

removeURL <- function(x)gsub("http[[:alnum:]]*","",x)
myCorpus <- tm_map(myCorpus, removeURL)

myStopwords <- c(stopwords("english"),"available","via")
myStopwords <- setdiff(myStopwords,c("r","big"))
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
myCorpus <- tm_map(myCorpus, stripWhitespace)
myCorpus <- tm_map(myCorpus, PlainTextDocument)
myCorpusCopy <- myCorpus

for ( i in 1:10 ){
  cat(paste("[[",i,"]]",sep=""," "))
  writeLines(as.character(myCorpus[[i]]))
}

## [[1]] seminar exploring causal relationships observational data prof jiuyong li canberra pm wed nov 
## [[2]] three research scholarships phd research master data science amp analytics based canberra apply oct 
## [[3]] slides materials r data mining short course university canberra now 
## [[4]] canberra healthhack th th october 
## [[5]] alimallaith sorry seems please check link details
## [[6]] phd scholarships data science analytics canberra australia 
## [[7]] free halfday short course r data mining university canberra ampm fri oct seats limited 
## [[8]] hamishbr will travel canberra future interested giving talk canberra data scientists meetup thanks
## [[9]] getting started apache spark free ebook 
## [[10]] using natural language processing nontextual data mllib presentation hadoop summit melbourne

stemDocument

어근을 찾아 변환시킴

myCorpus1 <- tm_map(myCorpus,stemDocument)

for ( i in 1:10 ){
  cat(paste("[[",i,"]]",sep=""))
  writeLines(as.character(myCorpus1[[i]]))
}

## [[1]]seminar explor causal relationship observ data prof jiuyong li canberra pm wed nov
## [[2]]three research scholarship phd research master data scienc amp analyt base canberra appli oct
## [[3]]slide materi r data mine short cours univers canberra now
## [[4]]canberra healthhack th th octob
## [[5]]alimallaith sorri seem pleas check link detail
## [[6]]phd scholarship data scienc analyt canberra australia
## [[7]]free halfday short cours r data mine univers canberra ampm fri oct seat limit
## [[8]]hamishbr will travel canberra futur interest give talk canberra data scientist meetup thank
## [[9]]get start apach spark free ebook
## [[10]]use natur languag process nontextu data mllib present hadoop summit melbourn

StemCompletion

어근화한 단어를 기존의 문서를 참고하여 사용된 단어 중 하나로 변환

stemCompletion_mod <- function(x,dict=dictCorpus) {
  PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}

myCorpus2<-NULL

for (i in 1:n.tweet){
  myCorpus2[[i]]<-stemCompletion_mod(myCorpus1[[i]],myCorpusCopy)
}

myCorpus2 <- Corpus(VectorSource(myCorpus2))

변환 된 문서를 확인

for ( i in 1:10 ){
  cat("변환 전: ",as.character(myCorpus[[i]]),"\n")
  cat("변환 후: ",as.character(myCorpus2[[i]]),"\n")
  print("------------")
}

## 변환 전:  seminar exploring causal relationships observational data prof jiuyong li canberra pm wed nov  
## 변환 후:  seminar exploring causal relationships observational data prof jiuyong li canberra pm wed nov 
## [1] "------------"
## 변환 전:  three research scholarships phd research master data science amp analytics based canberra apply oct  
## 변환 후:  three research scholarships phd research master data scienc amp analytics based canberra applied oct 
## [1] "------------"
## 변환 전:  slides materials r data mining short course university canberra now  
## 변환 후:  slide materials r data miner short course university canberra now 
## [1] "------------"
## 변환 전:  canberra healthhack th th october  
## 변환 후:  canberra healthhack th th october 
## [1] "------------"
## 변환 전:  alimallaith sorry seems please check link details 
## 변환 후:  alimallaith NA seems please check link details 
## [1] "------------"
## 변환 전:  phd scholarships data science analytics canberra australia  
## 변환 후:  phd scholarships data scienc analytics canberra australia 
## [1] "------------"
## 변환 전:  free halfday short course r data mining university canberra ampm fri oct seats limited  
## 변환 후:  free halfday short course r data miner university canberra ampm fri oct seats limit 
## [1] "------------"
## 변환 전:  hamishbr will travel canberra future interested giving talk canberra data scientists meetup thanks 
## 변환 후:  hamishbr will travel canberra future interested give talk canberra data scientist meetup thank 
## [1] "------------"
## 변환 전:  getting started apache spark free ebook  
## 변환 후:  get start apache spark free ebook 
## [1] "------------"
## 변환 전:  using natural language processing nontextual data mllib presentation hadoop summit melbourne  
## 변환 후:  use natural language process nontextual data mllib presenting hadoop summit melbourne 
## [1] "------------"

저작자표시 비영리 변경금지

'Text Mining/R' 관련 글

[감성사전] LIWC R 연동 사용예제

Date : 2016.10.27

[토픽모델링] 토픽트렌드 구하기(수정)

Date : 2016.10.18

[텍스트마이닝] 빈도 분석 및 동시출현단어 분석

Date : 2016.10.17

[감성 분석] 감성사전을 활용한 감성분석

Date : 2016.10.17

Admin

05-15 17:11

Contact Us

Address
경기도 수원시 영통구 원천동 산5번지 아주대학교 다산관 429호

E-mail
textminings@gmail.com

Phone
031-219-2910

[데이터 전처리] Stemming

stemming

패키지 자동 인스톨 및 호출하기

Stemming 하기

stemDocument

StemCompletion

'Text Mining/R' 관련 글

[감성사전] LIWC R 연동 사용예제

[토픽모델링] 토픽트렌드 구하기(수정)

[텍스트마이닝] 빈도 분석 및 동시출현단어 분석

[감성 분석] 감성사전을 활용한 감성분석

Category

Recent

Archives

Links

Admin

Contact Us

Tags

Calendar

Copyright © All Rights Reserved

Designed by CMSFactory.NET

티스토리툴바

« 2024/05 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31