[nltk Books] Chapter 2. Accessing Text Corpora and Lexical Resources

Author : tmlab / Date : 2016. 10. 27. 17:55 / Category : Text Mining/Python

2. Accessing Text Corpora and Lexical Resources

자연어처리의 실질적 작업은 보통 large bodies of linguistic data나 corpora를 사용함.

Goal of this chapter
1. 유용한 text corpora와 lexical resources는 무엇이 있는가?, 그리고 Python에서 어떻게 접근하는가?
2. 이 작업을 위한 가장 유용한 Python construct들은 무엇이 있는가?
3. Python코드를 작성할 때 우리가 스스로를 반복하는 것을 어떻게 피할 수 있는가?(How do we avoid repeating ourselves when writing Python code?)

1. Accessing Text Corpora

1.1 Gutenberg Corpus

약 25,000 개의 무료 전자책을 포함한 Project Gutenberg electronic text archive에 접근하여 일부만 가져옴

In [1]:

import nltk

In [2]:

from nltk.corpus import gutenberg
gutenberg.fileids()

Out[2]:

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [3]:

emma = gutenberg.words("austen-emma.txt")
len(emma)

Out[3]:

In [4]:

print(gutenberg.raw("austen-emma.txt")[:1000])

[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.

Sixteen years had Miss Taylor been in Mr. Woodhouse's family,
less as a governess than a friend, very fond of both daughters,
but particularly of Emma.  Between _them_ it was more the intimacy
of sisters.  Even before Miss Taylor had ceased to hold the nominal
office of governess, the mildness o

text indexing하기
- chap1과는 다르게, nltk.corpus를 calling하여 concordance 명령어를 쓰기 위해선 nltk.Text로 감싸줄 필요가 있다
예제

In [5]:

emma1 = nltk.Text(emma)
emma1.concordance("surprize")

Displaying 25 of 37 matches:
er father , was sometimes taken by surprize at his being still able to pity ` 
hem do the other any good ." " You surprize me ! Emma must do Harriet good : a
Knightley actually looked red with surprize and displeasure , as he stood up ,
r . Elton , and found to his great surprize , that Mr . Elton was actually on 
d aid ." Emma saw Mrs . Weston ' s surprize , and felt that it must be great ,
father was quite taken up with the surprize of so sudden a journey , and his f
y , in all the favouring warmth of surprize and conjecture . She was , moreove
he appeared , to have her share of surprize , introduction , and pleasure . Th
ir plans ; and it was an agreeable surprize to her , therefore , to perceive t
talking aunt had taken me quite by surprize , it must have been the death of m
f all the dialogue which ensued of surprize , and inquiry , and congratulation
 the present . They might chuse to surprize her ." Mrs . Cole had many to agre
the mode of it , the mystery , the surprize , is more like a young woman ' s s
 to her song took her agreeably by surprize -- a second , slightly but correct
" " Oh ! no -- there is nothing to surprize one at all .-- A pretty fortune ; 
t to be considered . Emma ' s only surprize was that Jane Fairfax should accep
of your admiration may take you by surprize some day or other ." Mr . Knightle
ation for her will ever take me by surprize .-- I never had a thought of her i
 expected by the best judges , for surprize -- but there was great joy . Mr . 
 sound of at first , without great surprize . " So unreasonably early !" she w
d Frank Churchill , with a look of surprize and displeasure .-- " That is easy
; and Emma could imagine with what surprize and mortification she must be retu
tled that Jane should go . Quite a surprize to me ! I had not the least idea !
 . It is impossible to express our surprize . He came to speak to his father o
g engaged !" Emma even jumped with surprize ;-- and , horror - struck , exclai

Gutenberg가 포함하고 있는 파일들의 간략한 정보 살펴보기

In [6]:

for fileid in gutenberg.fileids():
    num_chars = len(gutenberg.raw(fileid))
    num_words = len(gutenberg.words(fileid))
    num_sents = len(gutenberg.sents(fileid))
    num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
    print(round(num_chars/num_words),round(num_words/num_sents),round(num_words/num_vocab),fileid)

5 25 26 austen-emma.txt
5 26 17 austen-persuasion.txt
5 28 22 austen-sense.txt
4 34 79 bible-kjv.txt
5 19 5 blake-poems.txt
4 19 14 bryant-stories.txt
4 18 12 burgess-busterbrown.txt
4 20 13 carroll-alice.txt
5 20 12 chesterton-ball.txt
5 23 11 chesterton-brown.txt
5 18 11 chesterton-thursday.txt
4 21 25 edgeworth-parents.txt
5 26 15 melville-moby_dick.txt
5 52 11 milton-paradise.txt
4 12 9 shakespeare-caesar.txt
4 12 8 shakespeare-hamlet.txt
4 12 7 shakespeare-macbeth.txt
5 36 12 whitman-leaves.txt

gutenberg.sents

In [7]:

macbeth_sentences = gutenberg.sents("shakespeare-macbeth.txt")
print(macbeth_sentences,"\n")
print(macbeth_sentences[1116],"\n")
longest_len = max(len(s) for s in macbeth_sentences)
print([s for s in macbeth_sentences if len(s) == longest_len])

[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...] 

['Double', ',', 'double', ',', 'toile', 'and', 'trouble', ';', 'Fire', 'burne', ',', 'and', 'Cauldron', 'bubble'] 

[['Doubtfull', 'it', 'stood', ',', 'As', 'two', 'spent', 'Swimmers', ',', 'that', 'doe', 'cling', 'together', ',', 'And', 'choake', 'their', 'Art', ':', 'The', 'mercilesse', 'Macdonwald', '(', 'Worthie', 'to', 'be', 'a', 'Rebell', ',', 'for', 'to', 'that', 'The', 'multiplying', 'Villanies', 'of', 'Nature', 'Doe', 'swarme', 'vpon', 'him', ')', 'from', 'the', 'Westerne', 'Isles', 'Of', 'Kernes', 'and', 'Gallowgrosses', 'is', 'supply', "'", 'd', ',', 'And', 'Fortune', 'on', 'his', 'damned', 'Quarry', 'smiling', ',', 'Shew', "'", 'd', 'like', 'a', 'Rebells', 'Whore', ':', 'but', 'all', "'", 's', 'too', 'weake', ':', 'For', 'braue', 'Macbeth', '(', 'well', 'hee', 'deserues', 'that', 'Name', ')', 'Disdayning', 'Fortune', ',', 'with', 'his', 'brandisht', 'Steele', ',', 'Which', 'smoak', "'", 'd', 'with', 'bloody', 'execution', '(', 'Like', 'Valours', 'Minion', ')', 'caru', "'", 'd', 'out', 'his', 'passage', ',', 'Till', 'hee', 'fac', "'", 'd', 'the', 'Slaue', ':', 'Which', 'neu', "'", 'r', 'shooke', 'hands', ',', 'nor', 'bad', 'farwell', 'to', 'him', ',', 'Till', 'he', 'vnseam', "'", 'd', 'him', 'from', 'the', 'Naue', 'toth', "'", 'Chops', ',', 'And', 'fix', "'", 'd', 'his', 'Head', 'vpon', 'our', 'Battlements']]

nltk.corpus의 words(), raw(), sents()로부터 접근할 수 있는 method들
- part-of-speech tags, dialogue tags, syntactic trees 등
- 후에 나옴

1.2 Web and Chat Text

Web data: 총 6개
1. Firefox discussion forum
2. (추측) king arthur의 대본
3. conversations overheard in New York
4. the movie script of Pirates of the Carribean
5. personal advertisements
6. wine reviews

In [8]:

from nltk.corpus import webtext as web
print(web.fileids(),"\n\n")
for fileid in web.fileids():
    print(fileid, web.raw(fileid)[:65],"\n")

['firefox.txt', 'grail.txt', 'overheard.txt', 'pirates.txt', 'singles.txt', 'wine.txt'] 


firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se 

grail.txt SCENE 1: [wind] [clop clop clop] 
KING ARTHUR: Whoa there!  [clop 

overheard.txt White guy: So, do you have any plans for this evening?
Asian girl 

pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr 

singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun 

wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawb

chat data
- data source
  - instant messaging chat sessions, originally collected by the Naval Postgraduate School for research on automatic detection of Internet predators
- 10,000 Posts, 유저이름 대체됨, 개인정보 삭제됨
- 총 15개의 파일, 데이터 수집날짜 및 나이대 수집
- 파일 이름은 날짜, 채팅방, post 갯수
  - e.g., 10-19-20s-706posts는 2006/10/19, 20대 대화방, 706포스트
- 2006년도 기준

In [9]:

from nltk.corpus import nps_chat as chat
print("FILE: ",chat.fileids(),"\n")
chatroom = chat.posts('10-19-20s_706posts.xml')
print("예제: ",chatroom[123])

FILE:  ['10-19-20s_706posts.xml', '10-19-30s_705posts.xml', '10-19-40s_686posts.xml', '10-19-adults_706posts.xml', '10-24-40s_706posts.xml', '10-26-teens_706posts.xml', '11-06-adults_706posts.xml', '11-08-20s_705posts.xml', '11-08-40s_706posts.xml', '11-08-adults_705posts.xml', '11-08-teens_706posts.xml', '11-09-20s_706posts.xml', '11-09-40s_706posts.xml', '11-09-adults_706posts.xml', '11-09-teens_706posts.xml'] 

예제:  ['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',', 'I', 'can', 'look', 'in', 'a', 'mirror', '.']

1.3 Brown Corpus

first million-word electronic corpus of English, created in 1961 at Brown University
다양한 장르의 500개의 소스로 만들어진 corpus
e.g.,

ID	File	Genre	Description
A16	ca16	news	Chicago Tribune: Society Reportage
B02	cb02	editorial	Christian Science Monitor: Editorials
C17	cc17	reviews	Time Magazine: Reviews
D12	cd12	religion	Underwood: Probing the Ethics of Realtors
E36	ce36	hobbies	Norling: Renting a Car in Europe
F25	cf25	lore	Boroff: Jewish Teenage Culture
G22	cg22	belles_lettres	Reiner: Coping with Runaway Technology
H15	ch15	government	US Office of Civil and Defence Mobilization: The Family Fallout Shelter
J17	cj19	learned	Mosteller: Probability with Statistical Applications
K04	ck04	fiction	W.E.B. Du Bois: Worlds of Color
L13	cl13	mystery	Hitchens: Footsteps in the Night
M01	cm01	science_fiction	Heinlein: Stranger in a Strange Land
N14	cn15	adventure	Field: Rattlesnake Ridge
P12	cp12	romance	Callaghan: A Passion in Rome
R06	cr06	humor	Thurber: The Future, If Any, of Comedy

In [10]:

from nltk.corpus import brown
print(brown.categories(),"\n")
print(brown.words(categories="news"),"\n")
print(brown.words(fileids=['cg22']),"\n")
print(brown.sents(categories=["news","editorial","reviews"]))

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] 

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] 

['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...] 

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

brown corpus는 서로다른 장르의 systematic한 차이점을 보는데 유용함
- 특정 modal(법조동사)의 사용 빈도를 알아보기

In [11]:

news_text = brown.words(categories="news")
fdist = nltk.FreqDist(w.lower() for w in news_text)
modals = ["can","could","may","might","must","will"]
for m in modals:
    print(m+":",fdist[m],end = " ")

can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

각 장르별로 알아보기

In [12]:

cfd = nltk.ConditionalFreqDist(
    (genre, word)
    for genre in brown.categories()
    for word in brown.words(categories = genre))
genres = ["news","religion","hobbies","science_fiction","romance","humor"]
cfd.tabulate(conditions=genres,samples=modals)

                  can could   may might  must  will 
           news    93    86    66    38    50   389 
       religion    82    59    78    12    54    71 
        hobbies   268    58   131    22    83   264 
science_fiction    16    49     4    12     8    16 
        romance    74   193    11    51    45    43 
          humor    16    30     8     8     9    13

1.4 Reuters Corpus

10,788개의 뉴스 문서, 130만개의 단어로 이루어짐
90개의 토픽, test/training set으로 구분 됨

In [13]:

from nltk.corpus import reuters as rt
print(rt.fileids()[:6],"\n")
print(rt.categories())

['test/14826', 'test/14828', 'test/14829', 'test/14832', 'test/14833', 'test/14839'] 

['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', 'dmk', 'earn', 'fuel', 'gas', 'gnp', 'gold', 'grain', 'groundnut', 'groundnut-oil', 'heat', 'hog', 'housing', 'income', 'instal-debt', 'interest', 'ipi', 'iron-steel', 'jet', 'jobs', 'l-cattle', 'lead', 'lei', 'lin-oil', 'livestock', 'lumber', 'meal-feed', 'money-fx', 'money-supply', 'naphtha', 'nat-gas', 'nickel', 'nkr', 'nzdlr', 'oat', 'oilseed', 'orange', 'palladium', 'palm-oil', 'palmkernel', 'pet-chem', 'platinum', 'potato', 'propane', 'rand', 'rape-oil', 'rapeseed', 'reserves', 'retail', 'rice', 'rubber', 'rye', 'ship', 'silver', 'sorghum', 'soy-meal', 'soy-oil', 'soybean', 'strategic-metal', 'sugar', 'sun-meal', 'sun-oil', 'sunseed', 'tea', 'tin', 'trade', 'veg-oil', 'wheat', 'wpi', 'yen', 'zinc']

brown과 다르게 category가 중첩됨

In [14]:

print(rt.categories('training/9865'),"\n")
print(rt.categories('training/9880'),"\n")
print(rt.fileids("barley"),"\n")

['barley', 'corn', 'grain', 'wheat'] 

['money-fx'] 

['test/15618', 'test/15649', 'test/15676', 'test/15728', 'test/15871', 'test/15875', 'test/15952', 'test/17767', 'test/17769', 'test/18024', 'test/18263', 'test/18908', 'test/19275', 'test/19668', 'training/10175', 'training/1067', 'training/11208', 'training/11316', 'training/11885', 'training/12428', 'training/13099', 'training/13744', 'training/13795', 'training/13852', 'training/13856', 'training/1652', 'training/1970', 'training/2044', 'training/2171', 'training/2172', 'training/2191', 'training/2217', 'training/2232', 'training/3132', 'training/3324', 'training/395', 'training/4280', 'training/4296', 'training/5', 'training/501', 'training/5467', 'training/5610', 'training/5640', 'training/6626', 'training/7205', 'training/7579', 'training/8213', 'training/8257', 'training/8759', 'training/9865', 'training/9958']

또한 타이틀도 포함되어 있음
- 대문자로 저장됨

In [15]:

print(rt.words("training/9865")[:14])

['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', 'BIDS', 'DETAILED', 'French', 'operators', 'have', 'requested', 'licences', 'to', 'export']

1.5 Inaugural Address Corpus

55개의 텍스트

In [16]:

import matplotlib.pyplot as plt
%matplotlib nbagg

In [17]:

>>> from nltk.corpus import inaugural
>>> print(inaugural.fileids(),"\n")

>>> print([fileid[:4] for fileid in inaugural.fileids()])

['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monroe.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.txt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', '1849-Taylor.txt', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt', '1865-Lincoln.txt', '1869-Grant.txt', '1873-Grant.txt', '1877-Hayes.txt', '1881-Garfield.txt', '1885-Cleveland.txt', '1889-Harrison.txt', '1893-Cleveland.txt', '1897-McKinley.txt', '1901-McKinley.txt', '1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt', '1917-Wilson.txt', '1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt', '1933-Roosevelt.txt', '1937-Roosevelt.txt', '1941-Roosevelt.txt', '1945-Roosevelt.txt', '1949-Truman.txt', '1953-Eisenhower.txt', '1957-Eisenhower.txt', '1961-Kennedy.txt', '1965-Johnson.txt', '1969-Nixon.txt', '1973-Nixon.txt', '1977-Carter.txt', '1981-Reagan.txt', '1985-Reagan.txt', '1989-Bush.txt', '1993-Clinton.txt', '1997-Clinton.txt', '2001-Bush.txt', '2005-Bush.txt', '2009-Obama.txt'] 

['1789', '1793', '1797', '1801', '1805', '1809', '1813', '1817', '1821', '1825', '1829', '1833', '1837', '1841', '1845', '1849', '1853', '1857', '1861', '1865', '1869', '1873', '1877', '1881', '1885', '1889', '1893', '1897', '1901', '1905', '1909', '1913', '1917', '1921', '1925', '1929', '1933', '1937', '1941', '1945', '1949', '1953', '1957', '1961', '1965', '1969', '1973', '1977', '1981', '1985', '1989', '1993', '1997', '2001', '2005', '2009']

america와 citizen의 빈도 보기

In [18]:

>>> cfd = nltk.ConditionalFreqDist(
...           (target, fileid[:4])
...           for fileid in inaugural.fileids()
...           for w in inaugural.words(fileid)
...           for target in ['america', 'citizen']
...           if w.lower().startswith(target))
>>> cfd.plot()

1.6 Annotated Text Corpora

nltk에서 접근가능한 corpus들 예시

Corpus	Compiler	Contents
Brown Corpus	Francis, Kucera	15 genres, 1.15M words, tagged, categorized
CESS Treebanks	CLiC-UB	1M words, tagged and parsed (Catalan, Spanish)
Chat-80 Data Files	Pereira & Warren	World Geographic Database
CMU Pronouncing Dictionary	CMU	127k entries
CoNLL 2000 Chunking Data	CoNLL	270k words, tagged and chunked
CoNLL 2002 Named Entity	CoNLL	700k words, pos- and named-entity-tagged (Dutch, Spanish)
CoNLL 2007 Dependency Treebanks (sel)	CoNLL	150k words, dependency parsed (Basque, Catalan)
Dependency Treebank	Narad	Dependency parsed version of Penn Treebank sample
FrameNet	Fillmore, Baker et al	10k word senses, 170k manually annotated sentences
Floresta Treebank	Diana Santos et al	9k sentences, tagged and parsed (Portuguese)
Gazetteer Lists	Various	Lists of cities and countries
Genesis Corpus	Misc web sources	6 texts, 200k words, 6 languages
Gutenberg (selections)	Hart, Newby, et al	18 texts, 2M words
Inaugural Address Corpus	CSpan	US Presidential Inaugural Addresses (1789-present)
Indian POS-Tagged Corpus	Kumaran et al	60k words, tagged (Bangla, Hindi, Marathi, Telugu)
MacMorpho Corpus	NILC, USP, Brazil	1M words, tagged (Brazilian Portuguese)
Movie Reviews	Pang, Lee	2k movie reviews with sentiment polarity classification
Names Corpus	Kantrowitz, Ross	8k male and female names
NIST 1999 Info Extr (selections)	Garofolo	63k words, newswire and named-entity SGML markup
Nombank	Meyers	115k propositions, 1400 noun frames
NPS Chat Corpus	Forsyth, Martell	10k IM chat posts, POS-tagged and dialogue-act tagged
Open Multilingual WordNet	Bond et al	15 languages, aligned to English WordNet
PP Attachment Corpus	Ratnaparkhi	28k prepositional phrases, tagged as noun or verb modifiers
Proposition Bank	Palmer	113k propositions, 3300 verb frames
Question Classification	Li, Roth	6k questions, categorized
Reuters Corpus	Reuters	1.3M words, 10k news documents, categorized
Roget's Thesaurus	Project Gutenberg	200k words, formatted text
RTE Textual Entailment	Dagan et al	8k sentence pairs, categorized
SEMCOR	Rus, Mihalcea	880k words, part-of-speech and sense tagged
Senseval 2 Corpus	Pedersen	600k words, part-of-speech and sense tagged
SentiWordNet	Esuli, Sebastiani	sentiment scores for 145k WordNet synonym sets
Shakespeare texts (selections)	Bosak	8 books in XML format
State of the Union Corpus	CSPAN	485k words, formatted text
Stopwords Corpus	Porter et al	2,400 stopwords for 11 languages
Swadesh Corpus	Wiktionary	comparative wordlists in 24 languages
Switchboard Corpus (selections)	LDC	36 phonecalls, transcribed, parsed
Univ Decl of Human Rights	United Nations	480k words, 300+ languages
Penn Treebank (selections)	LDC	40k words, tagged and parsed
TIMIT Corpus (selections)	NIST/LDC	audio files and transcripts for 16 speakers
VerbNet 2.1	Palmer et al	5k verbs, hierarchically organized, linked to WordNet
Wordlist Corpus	OpenOffice.org et al	960k words and 20k affixes for 8 languages
WordNet 3.0 (English)	Miller, Fellbaum	145k synonym sets

1.7 Corpora in Other Languages

다양한 언어의 corpora들이 있음

In [19]:

print(nltk.corpus.cess_esp.words(),"\n")
print(nltk.corpus.floresta.words(),"\n")
print(nltk.corpus.indian.words('hindi.pos'),"\n")
print(nltk.corpus.udhr.fileids()[:10],"\n")
print(nltk.corpus.udhr.words("Korean_Hankuko-UTF8")[:14],"\n")
print(nltk.corpus.udhr.words('Javanese-Latin1')[11:],"\n")

['El', 'grupo', 'estatal', 'Electricité_de_France', ...] 

['Um', 'revivalismo', 'refrescante', 'O', '7_e_Meio', ...] 

['पूर्ण', 'प्रतिबंध', 'हटाओ', ':', 'इराक', 'संयुक्त', ...] 

['Abkhaz-Cyrillic+Abkh', 'Abkhaz-UTF8', 'Achehnese-Latin1', 'Achuar-Shiwiar-Latin1', 'Adja-UTF8', 'Afaan_Oromo_Oromiffa-Latin1', 'Afrikaans-Latin1', 'Aguaruna-Latin1', 'Akuapem_Twi-UTF8', 'Albanian_Shqip-Latin1'] 

['세', '계', '인', '권', '선', '언', '전', '문', '모든', '인류', '구성원의', '천부의', '존엄성과', '동등하고'] 

['Saben', 'umat', 'manungsa', 'lair', 'kanthi', 'hak', ...]

udhr
- the Universal Declaration of Human Rights in over 300 languages

In [20]:

>>> from nltk.corpus import udhr
>>> languages = ['Chickasaw', 'English', 'German_Deutsch',
...     'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
>>> cfd = nltk.ConditionalFreqDist(
...           (lang, len(word))
...           for lang in languages
...           for word in udhr.words(lang + '-Latin1'))
>>> cfd.plot(cumulative=True)

1.8 Text Corpus Structure

e.g.,

nltk의 기초 함수

Example	Description
fileids()	the files of the corpus
fileids([categories])	the files of the corpus corresponding to these categories
categories()	the categories of the corpus
categories([fileids])	the categories of the corpus corresponding to these files
raw()	the raw content of the corpus
raw(fileids=[f1,f2,f3])	the raw content of the specified files
raw(categories=[c1,c2])	the raw content of the specified categories
words()	the words of the whole corpus
words(fileids=[f1,f2,f3])	the words of the specified fileids
words(categories=[c1,c2])	the words of the specified categories
sents()	the sentences of the whole corpus
sents(fileids=[f1,f2,f3])	the sentences of the specified fileids
sents(categories=[c1,c2])	the sentences of the specified categories
abspath(fileid)	the location of the given file on disk
encoding(fileid)	the encoding of the file (if known)
open(fileid)	open a stream for reading the given corpus file
root	if the path to the root of locally installed corpus
readme()	the contents of the README file of the corpus

1.9 Loading your own Corpus

PlaintextCorpusReader사용
1. corpus_root 설정
2. PlaintextCorpusReader사용

In [21]:

from nltk.corpus import PlaintextCorpusReader as pcr
corpus_root = "./"
wordlists = pcr(corpus_root,".*")

In [22]:

print(wordlists.fileids(),"\n")
print(wordlists.words("thesis.txt"))

['.ipynb_checkpoints/nltk chap2-checkpoint.ipynb', 'nltk chap2.ipynb', 'thesis.txt'] 

['국문', '요약', '최근', '주요언론에서도', '다룰', '정도로', '부정적인', ...]

In [23]:

wordlists.sents()

Out[23]:

[['{', '"', 'cells', '":', '[', '{', '"', 'cell_type', '":', '"', 'markdown', '",', '"', 'metadata', '":', '{},', '"', 'source', '":', '[', '"#', '2', '.'], ['Accessing', 'Text', 'Corpora', 'and', 'Lexical', 'Resources', '\\', 'n', '",', '"\\', 'n', '",', '"', '자연어처리의', '실질적', '작업은', '보통', 'large', 'bodies', 'of', 'linguistic', 'data나', 'corpora를', '사용함', '.\\', 'n', '",', '"\\', 'n', '",', '"+', 'Goal', 'of', 'this', 'chapter', '\\', 'n', '",', '"', '1', '.'], ...]

BracketParseCorpusReader 사용

In [ ]:

>>> from nltk.corpus import BracketParseCorpusReader
>>> corpus_root = r"C:\corpora\penntreebank\parsed\mrg\wsj"
>>> file_pattern = r".*/wsj_.*\.mrg"
>>> ptb = BracketParseCorpusReader(corpus_root, file_pattern)
>>> ptb.fileids()

['00/wsj_0001.mrg', '00/wsj_0002.mrg', '00/wsj_0003.mrg', '00/wsj_0004.mrg', ...]

>>> len(ptb.sents())

49208

>>> ptb.sents(fileids='20/wsj_2013.mrg')[19]

['The', '55-year-old', 'Mr.', 'Noriega', 'is', "n't", 'as', 'smooth', 'as', 'the',
'shah', 'of', 'Iran', ',', 'as', 'well-born', 'as', 'Nicaragua', "'s", 'Anastasio',
'Somoza', ',', 'as', 'imperial', 'as', 'Ferdinand', 'Marcos', 'of', 'the', 'Philippines',
'or', 'as', 'bloody', 'as', 'Haiti', "'s", 'Baby', Doc', 'Duvalier', '.']

2. Conditional Frequency Distributions

단순 빈도 분석이 아니라, 서로다른 장르나 카테고리에서 단어들의 빈도를 비교분석 하는 것
e.g.,

2.1 Conditions and Events

각 조건 당 이벤트가 발생하는 것
- pair임

2.2 Counting Words by Genre

In [24]:

>>> from nltk.corpus import brown
>>> cfd = nltk.ConditionalFreqDist(
...           (genre, word)
...           for genre in brown.categories()
...           for word in brown.words(categories=genre))

파일이 너무 큼, 특정 장르만 선택하기

In [25]:

genre_word = [(genre,word)
             for genre in ["news","romance"]
             for word in brown.words(categories=genre)]
len(genre_word)

Out[25]:

In [26]:

print(genre_word[:4],"\n")
print(genre_word[-4:])

[('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')] 

[('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')]

해당 데이터만 가지고 CFD구하기

In [27]:

cfd = nltk.ConditionalFreqDist(genre_word)
print(cfd)
print(cfd.conditions())

<ConditionalFreqDist with 2 conditions>
['news', 'romance']

In [28]:

print(cfd["news"])
print(cfd["romance"])
print(cfd["romance"].most_common(10))

<FreqDist with 14394 samples and 100554 outcomes>
<FreqDist with 8452 samples and 70022 outcomes>
[(',', 3899), ('.', 3736), ('the', 2758), ('and', 1776), ('to', 1502), ('a', 1335), ('of', 1186), ('``', 1045), ("''", 1044), ('was', 993)]

2.3 Plotting and Tabulating Distributions

상기 취임연설문

In [29]:

>>> from nltk.corpus import inaugural
>>> cfd = nltk.ConditionalFreqDist(
...           (target, fileid[:4])
...           for fileid in inaugural.fileids()
...           for w in inaugural.words(fileid)
...           for target in ['america', 'citizen']
...           if w.lower().startswith(target))

상기 세계 인권 선언문

In [30]:

>>> from nltk.corpus import udhr
>>> languages = ['Chickasaw', 'English', 'German_Deutsch',
...     'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
>>> cfd = nltk.ConditionalFreqDist(
...           (lang, len(word))
...           for lang in languages
...           for word in udhr.words(lang + '-Latin1'))

영어와 독일어 비교
- conditions
  - 조건을 주는 parameter, default는 all
- samples
  - sample 갯수 제한(글자 수)

In [31]:

>>> cfd.tabulate(conditions=['English', 'German_Deutsch'],
...              samples=range(10), cumulative=True)

                  0    1    2    3    4    5    6    7    8    9 
       English    0  185  525  883  997 1166 1283 1440 1558 1638 
German_Deutsch    0  171  263  614  717  894 1013 1110 1213 1275

상기의 결과는 Set처리 안한 결과임

2.4 Generating Random Text with Bigrams

In [32]:

from nltk.util import bigrams
>>> sent = ['In', 'the', 'beginning', 'God', 'created', 'the', 'heaven',
...   'and', 'the', 'earth', '.']
>>> list(nltk.bigrams(sent))

Out[32]:

[('In', 'the'),
 ('the', 'beginning'),
 ('beginning', 'God'),
 ('God', 'created'),
 ('created', 'the'),
 ('the', 'heaven'),
 ('heaven', 'and'),
 ('and', 'the'),
 ('the', 'earth'),
 ('earth', '.')]

In [33]:

def generate_model(cfdist, word, num=15):
    for i in range(num):
        print(word, end=' ')
        word = cfdist[word].max()

text = nltk.corpus.genesis.words('english-kjv.txt')
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)

In [34]:

cfd["living"]

Out[34]:

FreqDist({',': 1,
          '.': 1,
          'creature': 7,
          'soul': 1,
          'substance': 2,
          'thing': 4})

In [35]:

generate_model(cfd,"living")

living creature that he said , and the land of the land of the land

CFD의 일반적으로 사용되는 method들

Example	Description
cfdist = ConditionalFreqDist(pairs)	create a conditional frequency distribution from a list of pairs
cfdist.conditions()	the conditions
cfdist[condition]	the frequency distribution for this condition
cfdist[condition][sample]	frequency for the given sample for this condition
cfdist.tabulate()	tabulate the conditional frequency distribution
cfdist.tabulate(samples, conditions)	tabulation limited to the specified samples and conditions
cfdist.plot()	graphical plot of the conditional frequency distribution
cfdist.plot(samples, conditions)	graphical plot limited to the specified samples and conditions
cfdist1 < cfdist2	test if samples in cfdist1 occur less frequently than in cfdist2

3. More Python: Reusing Code

실행하고 싶은 코드로 돌아가 한줄 재실행
- >>>를 해당 코드 앞에 적어 놓으면 interpret 하게 실행됨
실행하고 싶은 코드를 .py파일로 만들어 import하기
함수로 만들어 사용하기

In [36]:

>>> def lexical_diversity(my_text_data):
...     word_count = len(my_text_data)
...     vocab_size = len(set(my_text_data))
...     diversity_score = vocab_size / word_count
...     return diversity_score

In [37]:

>>> from nltk.corpus import genesis
>>> kjv = genesis.words('english-kjv.txt')
>>> lexical_diversity(kjv)

Out[37]:

0.06230453042623537

4. Lexical Resources

Lexicon
- collection of words and/or phrases along with associated information such as part of speech and sense definitions.

e.g., 일반적 Lexicon 예시

4.1 Wordlist Copora

단어 목록만으로 존재하는 Copora
- 자주 안 쓰이거나, 철자가 틀린 단어 찾기

In [38]:

def unusual_words(text):
    text_vocab = set(w.lower() for w in text if w.isalpha())
    english_vocab = set(w.lower() for w in nltk.corpus.words.words())
    unusual = text_vocab - english_vocab
    return sorted(unusual)

In [39]:

print(unusual_words(nltk.corpus.gutenberg.words("austen-sense.txt"))[:6])

['abbeyland', 'abhorred', 'abilities', 'abounded', 'abridgement', 'abused']

stopwords

In [40]:

from nltk.corpus import stopwords
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']

In [41]:

>>> def content_fraction(text):
...     stopwords = nltk.corpus.stopwords.words('english')
...     content = [w for w in text if w.lower() not in stopwords]
...     return len(content) / len(text)
...
>>> content_fraction(nltk.corpus.reuters.words())

Out[41]:

0.735240435097661

4.2 A Pronouncing Dictionary

In [42]:

>>> entries = nltk.corpus.cmudict.entries()
>>> len(entries)

Out[42]:

In [43]:

>>> for entry in entries[42371:42379]:
...     print(entry)

('fir', ['F', 'ER1'])
('fire', ['F', 'AY1', 'ER0'])
('fire', ['F', 'AY1', 'R'])
('firearm', ['F', 'AY1', 'ER0', 'AA2', 'R', 'M'])
('firearm', ['F', 'AY1', 'R', 'AA2', 'R', 'M'])
('firearms', ['F', 'AY1', 'ER0', 'AA2', 'R', 'M', 'Z'])
('firearms', ['F', 'AY1', 'R', 'AA2', 'R', 'M', 'Z'])
('fireball', ['F', 'AY1', 'ER0', 'B', 'AO2', 'L'])

4.3 Comparative Wordlists

Swadesh wordlists
- 200개의 단어들을 다국어로 저장

In [44]:

from nltk.corpus import swadesh
print(swadesh.fileids(),"\n")
print(swadesh.words("en"))

['be', 'bg', 'bs', 'ca', 'cs', 'cu', 'de', 'en', 'es', 'fr', 'hr', 'it', 'la', 'mk', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sr', 'sw', 'uk'] 

['I', 'you (singular), thou', 'he', 'we', 'you (plural)', 'they', 'this', 'that', 'here', 'there', 'who', 'what', 'where', 'when', 'how', 'not', 'all', 'many', 'some', 'few', 'other', 'one', 'two', 'three', 'four', 'five', 'big', 'long', 'wide', 'thick', 'heavy', 'small', 'short', 'narrow', 'thin', 'woman', 'man (adult male)', 'man (human being)', 'child', 'wife', 'husband', 'mother', 'father', 'animal', 'fish', 'bird', 'dog', 'louse', 'snake', 'worm', 'tree', 'forest', 'stick', 'fruit', 'seed', 'leaf', 'root', 'bark (from tree)', 'flower', 'grass', 'rope', 'skin', 'meat', 'blood', 'bone', 'fat (noun)', 'egg', 'horn', 'tail', 'feather', 'hair', 'head', 'ear', 'eye', 'nose', 'mouth', 'tooth', 'tongue', 'fingernail', 'foot', 'leg', 'knee', 'hand', 'wing', 'belly', 'guts', 'neck', 'back', 'breast', 'heart', 'liver', 'drink', 'eat', 'bite', 'suck', 'spit', 'vomit', 'blow', 'breathe', 'laugh', 'see', 'hear', 'know (a fact)', 'think', 'smell', 'fear', 'sleep', 'live', 'die', 'kill', 'fight', 'hunt', 'hit', 'cut', 'split', 'stab', 'scratch', 'dig', 'swim', 'fly (verb)', 'walk', 'come', 'lie', 'sit', 'stand', 'turn', 'fall', 'give', 'hold', 'squeeze', 'rub', 'wash', 'wipe', 'pull', 'push', 'throw', 'tie', 'sew', 'count', 'say', 'sing', 'play', 'float', 'flow', 'freeze', 'swell', 'sun', 'moon', 'star', 'water', 'rain', 'river', 'lake', 'sea', 'salt', 'stone', 'sand', 'dust', 'earth', 'cloud', 'fog', 'sky', 'wind', 'snow', 'ice', 'smoke', 'fire', 'ashes', 'burn', 'road', 'mountain', 'red', 'green', 'yellow', 'white', 'black', 'night', 'day', 'year', 'warm', 'cold', 'full', 'new', 'old', 'good', 'bad', 'rotten', 'dirty', 'straight', 'round', 'sharp', 'dull', 'smooth', 'wet', 'dry', 'correct', 'near', 'far', 'right', 'left', 'at', 'in', 'with', 'and', 'if', 'because', 'name']

In [45]:

>>> fr2en = swadesh.entries(['fr', 'en'])
>>> print(fr2en[:6],"\n")

>>> translate = dict(fr2en)
>>> print(translate['chien'],"\n")

>>> print(translate['jeter'])

[('je', 'I'), ('tu, vous', 'you (singular), thou'), ('il', 'he'), ('nous', 'we'), ('vous', 'you (plural)'), ('ils, elles', 'they')] 

dog 

throw

4.4 Shoebox and Toolbox Lexicons

linguists for managing data tool
- ps: part-of-speech
- ge: gloss-into-english
- ex: exmple sentence in Rotokas
- xp: translate Tok Pisin
- xe: translate English

In [47]:

>>> from nltk.corpus import toolbox
>>> toolbox.entries('rotokas.dic')[:2]

Out[47]:

[('kaa',
  [('ps', 'V'),
   ('pt', 'A'),
   ('ge', 'gag'),
   ('tkp', 'nek i pas'),
   ('dcsv', 'true'),
   ('vx', '1'),
   ('sc', '???'),
   ('dt', '29/Oct/2005'),
   ('ex', 'Apoka ira kaaroi aioa-ia reoreopaoro.'),
   ('xp', 'Kaikai i pas long nek bilong Apoka bikos em i kaikai na toktok.'),
   ('xe', 'Apoka is gagging from food while talking.')]),
 ('kaa',
  [('ps', 'V'),
   ('pt', 'B'),
   ('ge', 'strangle'),
   ('tkp', 'pasim nek'),
   ('arg', 'O'),
   ('vx', '2'),
   ('dt', '07/Oct/2006'),
   ('ex', 'Rera rauroro rera kaarevoi.'),
   ('xp', 'Em i holim pas em na nekim em.'),
   ('xe', 'He is holding him and strangling him.'),
   ('ex', 'Iroiro-ia oirato okoearo kaaivoi uvare rirovira kaureoparoveira.'),
   ('xp', 'Ol i pasim nek bilong man long rop bikos em i save bikhet tumas.'),
   ('xe',
    "They strangled the man's neck with rope because he was very stubborn and arrogant."),
   ('ex',
    'Oirato okoearo kaaivoi iroiro-ia. Uva viapau uvuiparoi ra vovouparo uva kopiiroi.'),
   ('xp',
    'Ol i pasim nek bilong man long rop. Olsem na em i no pulim win olsem na em i dai.'),
   ('xe',
    "They strangled the man's neck with a rope. And he couldn't breathe and he died.")])]

5. WordNet

semantic 기반 사전
- nltk는 영어 wordnet을 포함하고 있음
- 총 155,287단어와 117,659 동음이의어 셋 포함

5.1 Senses and Synonyms

e.g.,

a. Benz is credited with the invention of the motorcar.

b. Benz is credited with the invention of the automobile.

In [48]:

>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('motorcar')

Out[48]:

[Synset('car.n.01')]

In [49]:

>>> wn.synset('car.n.01').lemma_names()

Out[49]:

['car', 'auto', 'automobile', 'machine', 'motorcar']

In [50]:

>>> print(wn.synset('car.n.01').definition(),"\n")
>>> print(wn.synset('car.n.01').examples())

a motor vehicle with four wheels; usually propelled by an internal combustion engine 

['he needs a car to get to work']

In [51]:

print(wn.synset('car.n.01').lemmas(),"\n")
print(wn.lemma('car.n.01.automobile'),"\n")
print(wn.lemma('car.n.01.automobile').synset(),"\n")
print(wn.lemma('car.n.01.automobile').name())

[Lemma('car.n.01.car'), Lemma('car.n.01.auto'), Lemma('car.n.01.automobile'), Lemma('car.n.01.machine'), Lemma('car.n.01.motorcar')] 

Lemma('car.n.01.automobile') 

Synset('car.n.01') 

automobile

In [52]:

print(wn.synsets("car"),"\n")
>>> for synset in wn.synsets('car'):
...     print(synset.lemma_names())

[Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'), Synset('cable_car.n.01')] 

['car', 'auto', 'automobile', 'machine', 'motorcar']
['car', 'railcar', 'railway_car', 'railroad_car']
['car', 'gondola']
['car', 'elevator_car']
['cable_car', 'car']

5.2 The WordNet Hierarchy

e.g.,
wordnet을 이용하면 손쉽게 하위어를 찾을 수 있음

In [55]:

motorcar = wn.synset('car.n.01')
types_of_motorcar = motorcar.hyponyms()
sorted(lemma.name() for synset in types_of_motorcar for lemma in synset.lemmas())

Out[55]:

['Model_T',
 'S.U.V.',
 'SUV',
 'Stanley_Steamer',
 'ambulance',
 'beach_waggon',
 'beach_wagon',
 'bus',
 'cab',
 'compact',
 'compact_car',
 'convertible',
 'coupe',
 'cruiser',
 'electric',
 'electric_automobile',
 'electric_car',
 'estate_car',
 'gas_guzzler',
 'hack',
 'hardtop',
 'hatchback',
 'heap',
 'horseless_carriage',
 'hot-rod',
 'hot_rod',
 'jalopy',
 'jeep',
 'landrover',
 'limo',
 'limousine',
 'loaner',
 'minicar',
 'minivan',
 'pace_car',
 'patrol_car',
 'phaeton',
 'police_car',
 'police_cruiser',
 'prowl_car',
 'race_car',
 'racer',
 'racing_car',
 'roadster',
 'runabout',
 'saloon',
 'secondhand_car',
 'sedan',
 'sport_car',
 'sport_utility',
 'sport_utility_vehicle',
 'sports_car',
 'squad_car',
 'station_waggon',
 'station_wagon',
 'stock_car',
 'subcompact',
 'subcompact_car',
 'taxi',
 'taxicab',
 'tourer',
 'touring_car',
 'two-seater',
 'used-car',
 'waggon',
 'wagon']

상위 패스 찾기

In [56]:

print(motorcar.hypernyms(),"\n")
paths = motorcar.hypernym_paths()
print(len(paths),"\n")
print([synset.name() for synset in paths[0]],"\n")
print([synset.name() for synset in paths[1]],"\n")

[Synset('motor_vehicle.n.01')] 

2 

['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01', 'instrumentality.n.03', 'container.n.01', 'wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01'] 

['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01', 'instrumentality.n.03', 'conveyance.n.03', 'vehicle.n.01', 'wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01']

root 찾기

In [57]:

>>> motorcar.root_hypernyms()

Out[57]:

[Synset('entity.n.01')]

5.3 More Lexical Relations

meronyms나 holonyms도 찾을 수 있음
- the parts of a tree are its trunk, crown, and so on; the part_meronyms().
- The substance a tree is made of includes heartwood and sapwood; the substance_meronyms().
- A collection of trees forms a forest; the member_holonyms()
한국식 해석
- meronyms - 반가운 얼굴을 보았다(얼굴 - 사람)
- holonyms - 얼굴에는 눈,코,입 등이 존재

In [58]:

print(wn.synset('tree.n.01').part_meronyms(),"\n")
print(wn.synset('tree.n.01').substance_meronyms(),"\n")
print(wn.synset('tree.n.01').member_holonyms())

[Synset('burl.n.02'), Synset('crown.n.07'), Synset('limb.n.02'), Synset('stump.n.01'), Synset('trunk.n.01')] 

[Synset('heartwood.n.01'), Synset('sapwood.n.01')] 

[Synset('forest.n.01')]

To see just how intricate things can get, consider the word mint, which has several closely-related senses. We can see that mint.n.04 is part of mint.n.02 and the substance from which mint.n.05 is made.

In [59]:

>>> for synset in wn.synsets('mint', wn.NOUN):
...     print(synset.name() + ':', synset.definition())

batch.n.02: (often followed by `of') a large number or amount or extent
mint.n.02: any north temperate plant of the genus Mentha with aromatic leaves and small mauve flowers
mint.n.03: any member of the mint family of plants
mint.n.04: the leaves of a mint plant used fresh or candied
mint.n.05: a candy that is flavored with a mint oil
mint.n.06: a plant where money is coined by authority of the government

In [60]:

print(wn.synset('mint.n.04').part_holonyms(),"\n")
print(wn.synset('mint.n.04').substance_holonyms())

[Synset('mint.n.02')] 

[Synset('mint.n.05')]

There are also relationships between verbs. For example, the act of walking involves the act of stepping, so walking entails stepping. Some verbs have multiple entailments:

In [61]:

>>> print(wn.synset('walk.v.01').entailments(),"\n")
>>> print(wn.synset('eat.v.01').entailments(),"\n")
>>> print(wn.synset('tease.v.03').entailments(),"\n")

[Synset('step.v.01')] 

[Synset('chew.v.01'), Synset('swallow.v.01')] 

[Synset('arouse.v.07'), Synset('disappoint.v.01')]

반의어

In [62]:

>>> print(wn.lemma('supply.n.02.supply').antonyms(),"\n")
>>> print(wn.lemma('rush.v.01.rush').antonyms(),"\n")
>>> print(wn.lemma('horizontal.a.01.horizontal').antonyms(),"\n")
>>> print(wn.lemma('staccato.r.01.staccato').antonyms(),"\n")

[Lemma('demand.n.02.demand')] 

[Lemma('linger.v.04.linger')] 

[Lemma('inclined.a.02.inclined'), Lemma('vertical.a.01.vertical')] 

[Lemma('legato.r.01.legato')]

5.4 Semantic Similarity

In [63]:

>>> right = wn.synset('right_whale.n.01')
>>> orca = wn.synset('orca.n.01')
>>> minke = wn.synset('minke_whale.n.01')
>>> tortoise = wn.synset('tortoise.n.01')
>>> novel = wn.synset('novel.n.01')

In [64]:

print(right.lowest_common_hypernyms(minke),"\n")
print(right.lowest_common_hypernyms(orca),"\n")
print(right.lowest_common_hypernyms(tortoise),"\n")
print(right.lowest_common_hypernyms(novel))

[Synset('baleen_whale.n.01')] 

[Synset('whale.n.02')] 

[Synset('vertebrate.n.01')] 

[Synset('entity.n.01')]

In [65]:

>>> print(wn.synset('baleen_whale.n.01').min_depth(),"\n")
>>> print(wn.synset('whale.n.02').min_depth(),"\n")
>>> print(wn.synset('vertebrate.n.01').min_depth(),"\n")
>>> print(wn.synset('entity.n.01').min_depth())

In [66]:

>>> right.path_similarity(minke)

Out[66]:

0.25

In [67]:

>>> right.path_similarity(orca)

Out[67]:

0.16666666666666666

In [68]:

>>> right.path_similarity(tortoise)

Out[68]:

0.07692307692307693

In [69]:

>>> right.path_similarity(novel)

Out[69]:

0.043478260869565216

저작자표시 비영리 변경금지

'Text Mining/Python' 관련 글

[형태소 분석] Konlpy 사용하기

Date : 2017.02.17

[nltk Books] Chapter 5. Categorizing and Tagging Words

Date : 2016.10.27

[nltk Books] Chapter 1. Computing with Language: Texts and Words

Date : 2016.10.27

[텍스트마이닝] TF-IDF

Date : 2016.10.22

Admin

05-17 01:00

Contact Us

Address
경기도 수원시 영통구 원천동 산5번지 아주대학교 다산관 429호

E-mail
textminings@gmail.com

Phone
031-219-2910

« 2024/05 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31