[nltk Books] Chapter 5. Categorizing and Tagging Words

Author : tmlab / Date : 2016. 10. 27. 17:56 / Category : Text Mining/Python

3. Processing Raw Text

1.How can we write programs to access text from local files and from the Web
2.How can we split documents up into individual words
3.How can we write programs to produce formatted output and save it
토크나이즈 + 패턴찾아 바꾸기(정규식)
http://www.nltk.org/book/ch03.html

4. Writing Structured Programs

How can you write well-structured, readable programs that you and others will be able to re-use easily?
How do the fundamental building blocks work, such as loops, functions and assignment?
What are some of the pitfalls with Python programming and how can you avoid them?
함수짜기, For문(절차식, 선언식)짜기
http://www.nltk.org/book/ch04.html

5. Categorizing and Tagging Words

1. Using a Tagger

A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word
pos_tagger는 각 단어에 맞는 POS를 붙여준다

In [1]:

import nltk, re, pprint
from nltk import word_tokenize

In [2]:

text = word_tokenize("And now for something completely different")
nltk.pos_tag(text)

Out[2]:

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

another example, this time including some homonyms
refuse and permit both appear as a present tense verb (VBP) and a noun (NN)
동음이의어의 경우 다른 형태소를 갖기도 한다.

In [3]:

text = word_tokenize("They refuse to permit us to obtain the refuse permit")
nltk.pos_tag(text)

Out[3]:

[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN')]

Many of these categories arise from superficial analysis the distribution of words in text.
Consider the following analysis involving woman (a noun), bought (a verb), over (a preposition), and the (a determiner).
The text.similar() method takes a word w, finds all contexts w1w w2, then finds all words w' that appear in the same context, i.e. w1w'w2.
동일한 맥락에서 단어가 나타난 단어들을 보여줌

In [4]:

text = nltk.Text(word.lower() for word in nltk.corpus.brown.words()) #소문자로 변경
text.similar('woman')

man time day year car moment world family house country child boy
state job way war girl place word work

In [5]:

text.similar('bought')

made said put done seen had found left given heard brought got been
was set told took in felt that

In [6]:

text.similar('over')

in on to of and for with from at by that into as up out down through
is all about

In [7]:

text.similar('the')

a his this their its her an that our any all one these my in your no
some other and

2. Tagged Corpora

2.1 Representing Tagged Tokens

튜플을 사용해서 문자에 형태소를 태깅할 수 있음

In [8]:

tagged_token = nltk.tag.str2tuple('fly/NN')
tagged_token

Out[8]:

('fly', 'NN')

In [9]:

sent = '''The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN
other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC
Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS
said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB
accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT
interest/NN of/IN both/ABX governments/NNS ''/'' ./.'''

In [10]:

[nltk.tag.str2tuple(t) for t in sent.split()]

Out[10]:

[('The', 'AT'),
 ('grand', 'JJ'),
 ('jury', 'NN'),
 ('commented', 'VBD'),
 ('on', 'IN'),
 ('a', 'AT'),
 ('number', 'NN'),
 ('of', 'IN'),
 ('other', 'AP'),
 ('topics', 'NNS'),
 (',', ','),
 ('AMONG', 'IN'),
 ('them', 'PPO'),
 ('the', 'AT'),
 ('Atlanta', 'NP'),
 ('and', 'CC'),
 ('Fulton', 'NP-TL'),
 ('County', 'NN-TL'),
 ('purchasing', 'VBG'),
 ('departments', 'NNS'),
 ('which', 'WDT'),
 ('it', 'PPS'),
 ('said', 'VBD'),
 ('``', '``'),
 ('ARE', 'BER'),
 ('well', 'QL'),
 ('operated', 'VBN'),
 ('and', 'CC'),
 ('follow', 'VB'),
 ('generally', 'RB'),
 ('accepted', 'VBN'),
 ('practices', 'NNS'),
 ('which', 'WDT'),
 ('inure', 'VB'),
 ('to', 'IN'),
 ('the', 'AT'),
 ('best', 'JJT'),
 ('interest', 'NN'),
 ('of', 'IN'),
 ('both', 'ABX'),
 ('governments', 'NNS'),
 ("''", "''"),
 ('.', '.')]

2.2 Reading Tagged Corpora

Note that part-of-speech tags have been converted to uppercase, since this has become standard practice since the Brown Corpus was published.
브라운 코퍼스가 제작된 후로 형태소는 대문자료표기하는 것이 표준이 됨

In [11]:

nltk.corpus.brown.tagged_words()

Out[11]:

[(u'The', u'AT'), (u'Fulton', u'NP-TL'), ...]

In [12]:

nltk.corpus.brown.tagged_words(tagset='universal')

Out[12]:

[(u'The', u'DET'), (u'Fulton', u'NOUN'), ...]

In [13]:

print(nltk.corpus.nps_chat.tagged_words())

[(u'now', 'RB'), (u'im', 'PRP'), (u'left', 'VBD'), ...]

In [14]:

 nltk.corpus.conll2000.tagged_words()

Out[14]:

[(u'Confidence', u'NN'), (u'in', u'IN'), ...]

In [15]:

nltk.corpus.treebank.tagged_words()

Out[15]:

[(u'Pierre', u'NNP'), (u'Vinken', u'NNP'), ...]

Not all corpora employ the same set of tags
Initially we want to avoid the complications of these tagsets, so we use a built-in mapping to the "Universal Tagset
코퍼스 마다 다른 태그셋을 사용/ 이를 통일하기 위해 tagset='universal'을 사용

In [16]:

nltk.corpus.brown.tagged_words(tagset='universal')

Out[16]:

[(u'The', u'DET'), (u'Fulton', u'NOUN'), ...]

In [17]:

nltk.corpus.treebank.tagged_words(tagset='universal')

Out[17]:

[(u'Pierre', u'NOUN'), (u'Vinken', u'NOUN'), ...]

2.3 A Universal Part-of-Speech Tagset

Tag	Meaning	English Examples
ADJ	adjective	new, good, high, special, big, local
ADP	adposition	on, of, at, with, by, into, under
ADV	adverb	really, already, still, early, now
CONJ	conjunction	and, or, but, if, while, although
DET	determiner, article	the, a, some, most, every, no, which
NOUN	noun	year, home, costs, time, Africa
NUM	numeral	twenty-four, fourth, 1991, 14:24
PRT	particle	at, on, out, over per, that, up, with
PRON	pronoun	he, their, her, its, my, I, us
VERB	verb	is, say, told, given, playing, would
.	punctuation marks	. , ; !
X	other	ersatz, esprit, dunno, gr8, univeristy

브라운 코퍼스에서 가장 많이 출현한 형태소확인

In [18]:

from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
tag_fd.most_common()

Out[18]:

[(u'NOUN', 30654),
 (u'VERB', 14399),
 (u'ADP', 12355),
 (u'.', 11928),
 (u'DET', 11389),
 (u'ADJ', 6706),
 (u'ADV', 3349),
 (u'CONJ', 2717),
 (u'PRON', 2535),
 (u'PRT', 2264),
 (u'NUM', 2166),
 (u'X', 92)]

2.4 Nouns

word	After a determiner	Subject od the verb
woman	the woman who I saw yesterday ...	the woman sat down
Scotland	the Scotland I remember as a child	Scotland has five million people
book	the book I bought yesterday	this book recounts the colonization of Australia
intelligence	the intelligence displayed by the child ...	Mary's intelligence impressed her teachers

2개 단어 연쇄를 바탕으로 태깅을 할수 있다.
To begin with, we construct a list of bigrams whose members are themselves word-tag pairs such as (('The', 'DET'), ('Fulton', 'NP')) and (('Fulton', 'NP'), ('County', 'N')).
Then we construct a FreqDist from the tag parts of the bigrams.
2개 단어연쇄에서 형태소를 빼와서 NOUN 앞에 오는 형태소를 뽑아봄

In [19]:

word_tag_pairs = nltk.bigrams(brown_news_tagged)
noun_preceders = [a[1] for (a, b) in word_tag_pairs if b[1] == 'NOUN']
fdist = nltk.FreqDist(noun_preceders)
[tag for (tag, _) in fdist.most_common()]

Out[19]:

[u'NOUN',
 u'DET',
 u'ADJ',
 u'ADP',
 u'.',
 u'VERB',
 u'CONJ',
 u'NUM',
 u'ADV',
 u'PRT',
 u'PRON',
 u'X']

2.5 Verbs

2.6 Adjectives and Adverbs

2.7 Unsimplified Tags

In [20]:

def findtags(tag_prefix, tagged_text):
    cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text
                                  if tag.startswith(tag_prefix))
    return dict((tag, cfd[tag].most_common(5)) for tag in cfd.conditions())

In [21]:

tagdict = findtags('NN', nltk.corpus.brown.tagged_words(categories='news'))
for tag in sorted(tagdict):
    print(tag, tagdict[tag])

(u'NN', [(u'year', 137), (u'time', 97), (u'state', 88), (u'week', 85), (u'man', 72)])
(u'NN$', [(u"year's", 13), (u"world's", 8), (u"state's", 7), (u"nation's", 6), (u"company's", 6)])
(u'NN$-HL', [(u"Golf's", 1), (u"Navy's", 1)])
(u'NN$-TL', [(u"President's", 11), (u"University's", 3), (u"League's", 3), (u"Gallery's", 3), (u"Army's", 3)])
(u'NN-HL', [(u'cut', 2), (u'Salary', 2), (u'condition', 2), (u'Question', 2), (u'business', 2)])
(u'NN-NC', [(u'eva', 1), (u'ova', 1), (u'aya', 1)])
(u'NN-TL', [(u'President', 88), (u'House', 68), (u'State', 59), (u'University', 42), (u'City', 41)])
(u'NN-TL-HL', [(u'Fort', 2), (u'City', 1), (u'Commissioner', 1), (u'Grove', 1), (u'House', 1)])
(u'NNS', [(u'years', 101), (u'members', 69), (u'people', 52), (u'sales', 51), (u'men', 46)])
(u'NNS$', [(u"children's", 7), (u"women's", 5), (u"men's", 3), (u"janitors'", 3), (u"taxpayers'", 2)])
(u'NNS$-HL', [(u"Dealers'", 1), (u"Idols'", 1)])
(u'NNS$-TL', [(u"Women's", 4), (u"States'", 3), (u"Giants'", 2), (u"Officers'", 1), (u"Bombers'", 1)])
(u'NNS-HL', [(u'years', 1), (u'idols', 1), (u'Creations', 1), (u'thanks', 1), (u'centers', 1)])
(u'NNS-TL', [(u'States', 38), (u'Nations', 11), (u'Masters', 10), (u'Rules', 9), (u'Communists', 9)])
(u'NNS-TL-HL', [(u'Nations', 1)])

2.8 Exploring Tagged Corpora

use the tagged_words() method to look at the part-of-speech tag of the following words:
특정 단어 다음에 오는 태그를 확인할 수 있다.
often다음에오는 단어들의 태그

In [22]:

brown_lrnd_tagged = brown.tagged_words(categories='learned', tagset='universal')
tags = [b[1] for (a, b) in nltk.bigrams(brown_lrnd_tagged) if a[0] == 'often']
fd = nltk.FreqDist(tags)
fd.tabulate()

VERB  ADV  ADP  ADJ    .  PRT 
  37    8    7    6    4    2

3. Mapping Words to Properties Using Python Dictionaries

3.1 Indexing Lists vs Dictionaries
3.2 Dictionaries in Python
3.3 Defining Dictionaries
3.4 Default Dictionaries
3.5 Incrementally Updating a Dictionary
3.6 Complex Keys and Values
3.7 Inverting a Dictionary

4. Automatic Tagging

태그는 문장에서 단어가 출현한 맥락에 의존하기 때문에 문장을 살펴본다.
the tag of a word depends on the word and its context within a sentence.
For this reason, we will be working with data at the level of (tagged) sentences rather than words.

In [23]:

from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')

4.1 The Default Tagger

The simplest possible tagger assigns the same tag to each token.
문서에서 가장 많이 나온 태그로 다 똑같이 붙여 버림

In [24]:

tags = [tag for (word, tag) in brown.tagged_words(categories='news')]
nltk.FreqDist(tags).max()

Out[24]:

u'NN'

In [25]:

raw = 'I do not like green eggs and ham, I do not like them Sam I am!'
tokens = word_tokenize(raw)
default_tagger = nltk.DefaultTagger('NN')
default_tagger.tag(tokens)

Out[25]:

[('I', 'NN'),
 ('do', 'NN'),
 ('not', 'NN'),
 ('like', 'NN'),
 ('green', 'NN'),
 ('eggs', 'NN'),
 ('and', 'NN'),
 ('ham', 'NN'),
 (',', 'NN'),
 ('I', 'NN'),
 ('do', 'NN'),
 ('not', 'NN'),
 ('like', 'NN'),
 ('them', 'NN'),
 ('Sam', 'NN'),
 ('I', 'NN'),
 ('am', 'NN'),
 ('!', 'NN')]

In [26]:

default_tagger.evaluate(brown_tagged_sents)

Out[26]:

0.13089484257215028

4.2 The Regular Expression Tagger

The regular expression tagger assigns tags to tokens on the basis of matching patterns.
특정 패턴이 발견되는 단어를 찾아 태깅한다.

In [27]:

patterns = [
    (r'.*ing$', 'VBG'),               # gerunds
    (r'.*ed$', 'VBD'),                # simple past
    (r'.*es$', 'VBZ'),                # 3rd singular present
    (r'.*ould$', 'MD'),               # modals
    (r'.*\'s$', 'NN$'),               # possessive nouns
    (r'.*s$', 'NNS'),                 # plural nouns
    (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),  # cardinal numbers
    (r'.*', 'NN')                     # nouns (default)
]

In [28]:

regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(brown_sents[3])

Out[28]:

[(u'``', 'NN'),
 (u'Only', 'NN'),
 (u'a', 'NN'),
 (u'relative', 'NN'),
 (u'handful', 'NN'),
 (u'of', 'NN'),
 (u'such', 'NN'),
 (u'reports', 'NNS'),
 (u'was', 'NNS'),
 (u'received', 'VBD'),
 (u"''", 'NN'),
 (u',', 'NN'),
 (u'the', 'NN'),
 (u'jury', 'NN'),
 (u'said', 'NN'),
 (u',', 'NN'),
 (u'``', 'NN'),
 (u'considering', 'VBG'),
 (u'the', 'NN'),
 (u'widespread', 'NN'),
 (u'interest', 'NN'),
 (u'in', 'NN'),
 (u'the', 'NN'),
 (u'election', 'NN'),
 (u',', 'NN'),
 (u'the', 'NN'),
 (u'number', 'NN'),
 (u'of', 'NN'),
 (u'voters', 'NNS'),
 (u'and', 'NN'),
 (u'the', 'NN'),
 (u'size', 'NN'),
 (u'of', 'NN'),
 (u'this', 'NNS'),
 (u'city', 'NN'),
 (u"''", 'NN'),
 (u'.', 'NN')]

In [29]:

regexp_tagger.evaluate(brown_tagged_sents)

Out[29]:

0.20326391789486245

4.3 The Lookup Tagger

Let's find the hundred most frequent words and store their most likely tag.
We can then use this information as the model for a "lookup tagger"
이미 태깅된 단어들의 정보를 바탕으로 가장 가능성이 높은 것으로 태깅함

In [30]:

fd = nltk.FreqDist(brown.words(categories='news')) #브라운코퍼스 빈출워드
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news')) #태그된 단어들의 조건부 빈도
most_freq_words = fd.most_common(100) #제일 많이 나온 단어 100
likely_tags = dict((word, cfd[word].max()) for (word, _) in most_freq_words) #많이 나온 단어 100의 POS 
baseline_tagger = nltk.UnigramTagger(model=likely_tags) #100개에서 추출한 
baseline_tagger.evaluate(brown_tagged_sents)

Out[30]:

0.45578495136941344

In [31]:

sent = brown.sents(categories='news')[3]
baseline_tagger.tag(sent)

Out[31]:

[(u'``', u'``'),
 (u'Only', None),
 (u'a', u'AT'),
 (u'relative', None),
 (u'handful', None),
 (u'of', u'IN'),
 (u'such', None),
 (u'reports', None),
 (u'was', u'BEDZ'),
 (u'received', None),
 (u"''", u"''"),
 (u',', u','),
 (u'the', u'AT'),
 (u'jury', None),
 (u'said', u'VBD'),
 (u',', u','),
 (u'``', u'``'),
 (u'considering', None),
 (u'the', u'AT'),
 (u'widespread', None),
 (u'interest', None),
 (u'in', u'IN'),
 (u'the', u'AT'),
 (u'election', None),
 (u',', u','),
 (u'the', u'AT'),
 (u'number', None),
 (u'of', u'IN'),
 (u'voters', None),
 (u'and', u'CC'),
 (u'the', u'AT'),
 (u'size', None),
 (u'of', u'IN'),
 (u'this', u'DT'),
 (u'city', None),
 (u"''", u"''"),
 (u'.', u'.')]

베이스라인 태거를 평가해봅시다.

In [34]:

baseline_tagger = nltk.UnigramTagger(model=likely_tags,
                                     backoff=nltk.DefaultTagger('NN'))
baseline_tagger.evaluate(brown_tagged_sents)

Out[34]:

0.5817769556656125

베이스라인 태거를 써봅시다.

In [35]:

sent = brown.sents(categories='news')[3]
baseline_tagger.tag(sent)

Out[35]:

[(u'``', u'``'),
 (u'Only', 'NN'),
 (u'a', u'AT'),
 (u'relative', 'NN'),
 (u'handful', 'NN'),
 (u'of', u'IN'),
 (u'such', 'NN'),
 (u'reports', 'NN'),
 (u'was', u'BEDZ'),
 (u'received', 'NN'),
 (u"''", u"''"),
 (u',', u','),
 (u'the', u'AT'),
 (u'jury', 'NN'),
 (u'said', u'VBD'),
 (u',', u','),
 (u'``', u'``'),
 (u'considering', 'NN'),
 (u'the', u'AT'),
 (u'widespread', 'NN'),
 (u'interest', 'NN'),
 (u'in', u'IN'),
 (u'the', u'AT'),
 (u'election', 'NN'),
 (u',', u','),
 (u'the', u'AT'),
 (u'number', 'NN'),
 (u'of', u'IN'),
 (u'voters', 'NN'),
 (u'and', u'CC'),
 (u'the', u'AT'),
 (u'size', 'NN'),
 (u'of', u'IN'),
 (u'this', u'DT'),
 (u'city', 'NN'),
 (u"''", u"''"),
 (u'.', u'.')]

In [36]:

def performance(cfd, wordlist):
    lt = dict((word, cfd[word].max()) for word in wordlist)
    baseline_tagger = nltk.UnigramTagger(model=lt, backoff=nltk.DefaultTagger('NN'))
    return baseline_tagger.evaluate(brown.tagged_sents(categories='news'))

def display():
    import pylab
    word_freqs = nltk.FreqDist(brown.words(categories='news')).most_common()
    words_by_freq = [w for (w, _) in word_freqs]
    cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
    sizes = 2 ** pylab.arange(15)
    perfs = [performance(cfd, words_by_freq[:size]) for size in sizes]
    pylab.plot(sizes, perfs, '-bo')
    pylab.title('Lookup Tagger Performance with Varying Model Size')
    pylab.xlabel('Model Size')
    pylab.ylabel('Performance')
    pylab.show()

display()

5. N-Gram Tagging

5.1 Unigram Tagging

Unigram taggers are based on a simple statistical algorithm: for each token, assign the tag that is most likely for that particular token.
For example, it will assign the tag JJ to any occurrence of the word frequent, since frequent is used as an adjective

In [38]:

from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.tag(brown_sents[2007])

Out[38]:

[(u'Various', u'JJ'),
 (u'of', u'IN'),
 (u'the', u'AT'),
 (u'apartments', u'NNS'),
 (u'are', u'BER'),
 (u'of', u'IN'),
 (u'the', u'AT'),
 (u'terrace', u'NN'),
 (u'type', u'NN'),
 (u',', u','),
 (u'being', u'BEG'),
 (u'on', u'IN'),
 (u'the', u'AT'),
 (u'ground', u'NN'),
 (u'floor', u'NN'),
 (u'so', u'QL'),
 (u'that', u'CS'),
 (u'entrance', u'NN'),
 (u'is', u'BEZ'),
 (u'direct', u'JJ'),
 (u'.', u'.')]

In [39]:

unigram_tagger.evaluate(brown_tagged_sents)

Out[39]:

0.9349006503968017

5.2 Separating the Training and Testing Data

we are training a tagger on some data
A tagger that simply memorized its training data and made no attempt to construct a general model would get a perfect score, but would also be useless for tagging new text.

In [41]:

size = int(len(brown_tagged_sents) * 0.9)
size

Out[41]:

In [42]:

train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
unigram_tagger = nltk.UnigramTagger(train_sents)
unigram_tagger.evaluate(test_sents)

Out[42]:

0.8120203329014253

5.3 General N-Gram Tagging

n그램 태거는 유니그램태거의 일반화된 방법으로 1그램 태거는 유니그램/2그램태거는 바이그램 /3그램 태거는 트라이그램태거와 같다
n-gram tagger is a generalization of a unigram tagger whose context is the current word together with the part-of-speech tags of the n-1 preceding tokens
1-gram tagger is another term for a unigram tagger 2-gram taggers are also called bigram taggers, and 3-gram taggers are called trigram taggers
In the example of an n-gram tagger shown in 5.1, we have n=3
we consider the tags of the two preceding words in addition to the current word.
An n-gram tagger picks the tag that is most likely in the given context.

In [44]:

bigram_tagger = nltk.BigramTagger(train_sents)
bigram_tagger.tag(brown_sents[2007])

Out[44]:

[(u'Various', u'JJ'),
 (u'of', u'IN'),
 (u'the', u'AT'),
 (u'apartments', u'NNS'),
 (u'are', u'BER'),
 (u'of', u'IN'),
 (u'the', u'AT'),
 (u'terrace', u'NN'),
 (u'type', u'NN'),
 (u',', u','),
 (u'being', u'BEG'),
 (u'on', u'IN'),
 (u'the', u'AT'),
 (u'ground', u'NN'),
 (u'floor', u'NN'),
 (u'so', u'CS'),
 (u'that', u'CS'),
 (u'entrance', u'NN'),
 (u'is', u'BEZ'),
 (u'direct', u'JJ'),
 (u'.', u'.')]

In [45]:

unseen_sent = brown_sents[4203]
bigram_tagger.tag(unseen_sent)

Out[45]:

[(u'The', u'AT'),
 (u'population', u'NN'),
 (u'of', u'IN'),
 (u'the', u'AT'),
 (u'Congo', u'NP'),
 (u'is', u'BEZ'),
 (u'13.5', None),
 (u'million', None),
 (u',', None),
 (u'divided', None),
 (u'into', None),
 (u'at', None),
 (u'least', None),
 (u'seven', None),
 (u'major', None),
 (u'``', None),
 (u'culture', None),
 (u'clusters', None),
 (u"''", None),
 (u'and', None),
 (u'innumerable', None),
 (u'tribes', None),
 (u'speaking', None),
 (u'400', None),
 (u'separate', None),
 (u'dialects', None),
 (u'.', None)]

In [46]:

bigram_tagger.evaluate(test_sents)

Out[46]:

0.10276088906608193

5.4 Combining Taggers

combine the results of a bigram tagger, a unigram tagger, and a default tagger
- 1) Try tagging the token with the bigram tagger.
- 2) If the bigram tagger is unable to find a tag for the token, try the unigram tagger.
- 3) If the unigram tagger is also unable to find a tag, use a default tagger.

In [47]:

t0 = nltk.DefaultTagger('NN')  #다 명사로 붙임
t1 = nltk.UnigramTagger(train_sents, backoff=t0) #유니그램 태거
t2 = nltk.BigramTagger(train_sents, backoff=t1)  #바이그램 태거
t2.evaluate(test_sents)

Out[47]:

0.844911791089405

5.5 Tagging Unknown Words

모르는 단어는 일단 Unknown으로(UNK) 태그 붙인담에 n그램 태그 학습후 태깅

5.6 Storing Taggers

매번 학습하는 것보다 학습된 태거를 저장했다가 쓰면 편함
태거를 저장해봅시다.
Let's save our tagger t2 to a file t2.pkl

In [48]:

# 저장
from pickle import dump
output = open('t2.pkl', 'wb')
dump(t2, output, -1)
output.close()

In [49]:

# 로드
from pickle import load
input = open('t2.pkl', 'rb')
tagger = load(input)
input.close()

In [50]:

text = """The board's action shows what free enterprise is up against in our complex maze of regulatory laws ."""
tokens = text.split()
tagger.tag(tokens)

Out[50]:

[('The', u'AT'),
 ("board's", u'NN$'),
 ('action', 'NN'),
 ('shows', u'NNS'),
 ('what', u'WDT'),
 ('free', u'JJ'),
 ('enterprise', 'NN'),
 ('is', u'BEZ'),
 ('up', u'RP'),
 ('against', u'IN'),
 ('in', u'IN'),
 ('our', u'PP$'),
 ('complex', u'JJ'),
 ('maze', 'NN'),
 ('of', u'IN'),
 ('regulatory', 'NN'),
 ('laws', u'NNS'),
 ('.', u'.')]

5.7 Performance Limitations

n그램 태거의 성능의 한계는 얼마일까?
예재(3그램)에서는 5%저도 모호한 상황이 생긴다.
잘못태그된 빈도를 컨퓨전매트릭스로 확인할수 있다.

6. Transformation-Based Tagging

A potential issue with n-gram taggers is the size of n-gram table
A second issue concerns context.() ### Brill tagging
Brill tagging, an inductive tagging method
Brill tagging performs very well using only a tiny fraction of the size of n-gram taggers
We will examine the operation of two rules
- (a) Replace NN with VB when the previous word is TO
- (b) Replace TO with IN when the next tag is NNS.
- first tagging with the unigram tagger, then applying the rules to fix the errors.
Brill taggers have another interesting property: the rules are linguistically interpretable.

Brill taggers have another interesting property: the rules are linguistically interpretable.

Phrase	to	increase	grants	to	states	for	vocational	rehabilitation
Unigram	TO	NN	NNS	TO	NNS	IN	JJ	NN
Rule 1		VB
Rule 2				IN
Output	TO	VB	NNS	IN	NNS	IN	JJ	NN
Gold	TO	VB	NNS	IN	NNS	IN	JJ	NN

In [ ]:

nltk.tag.brill.demo()

7. How to Determine the Category of a Word

7.1 Morphological Clues : -ness, -ment

7.2 Syntactic Clues : before a noun

7.3 Semantic Clues : 신조어는 명사

7.4 New Words :

7.5 Morphology in Part of Speech Tagsets

Form	Category	Tag
go	base	VB
goes	3rd singular present	VBZ
gone	past participle	VBN
going	gerund	VBG
went	simple past	VBD

저작자표시 비영리 변경금지

'Text Mining/Python' 관련 글

[Classification] e-mail 분류하기

Date : 2017.02.17

[형태소 분석] Konlpy 사용하기

Date : 2017.02.17

[nltk Books] Chapter 2. Accessing Text Corpora and Lexical Resources

Date : 2016.10.27

[nltk Books] Chapter 1. Computing with Language: Texts and Words

Date : 2016.10.27

Admin

04-29 19:16

Contact Us

Address
경기도 수원시 영통구 원천동 산5번지 아주대학교 다산관 429호

E-mail
textminings@gmail.com

Phone
031-219-2910

« 2024/04 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

[nltk Books] Chapter 5. Categorizing and Tagging Words

3. Processing Raw Text

4. Writing Structured Programs

5. Categorizing and Tagging Words

1. Using a Tagger

2. Tagged Corpora

2.1 Representing Tagged Tokens

2.2 Reading Tagged Corpora

2.3 A Universal Part-of-Speech Tagset

브라운 코퍼스에서 가장 많이 출현한 형태소확인

2.4 Nouns

2.5 Verbs

2.6 Adjectives and Adverbs

2.7 Unsimplified Tags

2.8 Exploring Tagged Corpora

3. Mapping Words to Properties Using Python Dictionaries

4. Automatic Tagging

4.1 The Default Tagger

4.2 The Regular Expression Tagger

4.3 The Lookup Tagger

5. N-Gram Tagging

5.1 Unigram Tagging

5.2 Separating the Training and Testing Data

5.3 General N-Gram Tagging

5.4 Combining Taggers

5.5 Tagging Unknown Words

5.6 Storing Taggers

5.7 Performance Limitations

6. Transformation-Based Tagging

7. How to Determine the Category of a Word

7.1 Morphological Clues : -ness, -ment

7.2 Syntactic Clues : before a noun

7.3 Semantic Clues : 신조어는 명사

7.4 New Words :

7.5 Morphology in Part of Speech Tagsets

'Text Mining/Python' 관련 글

[Classification] e-mail 분류하기

[형태소 분석] Konlpy 사용하기

[nltk Books] Chapter 2. Accessing Text Corpora and Lexical Resources

[nltk Books] Chapter 1. Computing with Language: Texts and Words

Category

Recent

Archives

Links

Admin

Contact Us

Tags

Calendar

Copyright © All Rights Reserved

Designed by CMSFactory.NET

티스토리툴바