[nltk Books] Chapter 1. Computing with Language: Texts and Words

Author : tmlab / Date : 2016. 10. 27. 17:54 / Category : Text Mining/Python

1. Computing with Language: Texts and Words

1.2 Getting started with NLTK

자연어처리 모듈 NlTK 불러오기 및 샘플 데이터 셋인 nltk.book 설치.

In [1]:
import nltk
nltk.download("all")
[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /home/jester/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package cess_cat is already up-to-date!
[nltk_data]    | Downloading package cess_esp to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package cess_esp is already up-to-date!
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package city_database to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package city_database is already up-to-date!
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package comparative_sentences to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package comparative_sentences is already up-to-
[nltk_data]    |       date!
[nltk_data]    | Downloading package comtrans to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package comtrans is already up-to-date!
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package conll2002 is already up-to-date!
[nltk_data]    | Downloading package conll2007 to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package conll2007 is already up-to-date!
[nltk_data]    | Downloading package crubadan to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package crubadan is already up-to-date!
[nltk_data]    | Downloading package dependency_treebank to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package dependency_treebank is already up-to-date!
[nltk_data]    | Downloading package europarl_raw to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package europarl_raw is already up-to-date!
[nltk_data]    | Downloading package floresta to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package floresta is already up-to-date!
[nltk_data]    | Downloading package framenet_v15 to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package framenet_v15 is already up-to-date!
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package ieer to /home/jester/nltk_data...
[nltk_data]    |   Package ieer is already up-to-date!
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package indian to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package indian is already up-to-date!
[nltk_data]    | Downloading package jeita to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package jeita is already up-to-date!
[nltk_data]    | Downloading package kimmo to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package kimmo is already up-to-date!
[nltk_data]    | Downloading package knbc to /home/jester/nltk_data...
[nltk_data]    |   Package knbc is already up-to-date!
[nltk_data]    | Downloading package lin_thesaurus to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package lin_thesaurus is already up-to-date!
[nltk_data]    | Downloading package mac_morpho to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package mac_morpho is already up-to-date!
[nltk_data]    | Downloading package machado to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package machado is already up-to-date!
[nltk_data]    | Downloading package masc_tagged to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package masc_tagged is already up-to-date!
[nltk_data]    | Downloading package moses_sample to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package moses_sample is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Downloading package nombank.1.0 to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package nombank.1.0 is already up-to-date!
[nltk_data]    | Downloading package nps_chat to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package nps_chat is already up-to-date!
[nltk_data]    | Downloading package omw to /home/jester/nltk_data...
[nltk_data]    |   Package omw is already up-to-date!
[nltk_data]    | Downloading package opinion_lexicon to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package opinion_lexicon is already up-to-date!
[nltk_data]    | Downloading package paradigms to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package paradigms is already up-to-date!
[nltk_data]    | Downloading package pil to /home/jester/nltk_data...
[nltk_data]    |   Package pil is already up-to-date!
[nltk_data]    | Downloading package pl196x to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package pl196x is already up-to-date!
[nltk_data]    | Downloading package ppattach to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package ppattach is already up-to-date!
[nltk_data]    | Downloading package problem_reports to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package problem_reports is already up-to-date!
[nltk_data]    | Downloading package propbank to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package propbank is already up-to-date!
[nltk_data]    | Downloading package ptb to /home/jester/nltk_data...
[nltk_data]    |   Package ptb is already up-to-date!
[nltk_data]    | Downloading package product_reviews_1 to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package product_reviews_1 is already up-to-date!
[nltk_data]    | Downloading package product_reviews_2 to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package product_reviews_2 is already up-to-date!
[nltk_data]    | Downloading package pros_cons to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package pros_cons is already up-to-date!
[nltk_data]    | Downloading package qc to /home/jester/nltk_data...
[nltk_data]    |   Package qc is already up-to-date!
[nltk_data]    | Downloading package reuters to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package reuters is already up-to-date!
[nltk_data]    | Downloading package rte to /home/jester/nltk_data...
[nltk_data]    |   Package rte is already up-to-date!
[nltk_data]    | Downloading package semcor to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package semcor is already up-to-date!
[nltk_data]    | Downloading package senseval to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package senseval is already up-to-date!
[nltk_data]    | Downloading package sentiwordnet to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package sentiwordnet is already up-to-date!
[nltk_data]    | Downloading package sentence_polarity to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package sentence_polarity is already up-to-date!
[nltk_data]    | Downloading package shakespeare to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package shakespeare is already up-to-date!
[nltk_data]    | Downloading package sinica_treebank to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package sinica_treebank is already up-to-date!
[nltk_data]    | Downloading package smultron to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package smultron is already up-to-date!
[nltk_data]    | Downloading package state_union to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package state_union is already up-to-date!
[nltk_data]    | Downloading package stopwords to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package stopwords is already up-to-date!
[nltk_data]    | Downloading package subjectivity to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package subjectivity is already up-to-date!
[nltk_data]    | Downloading package swadesh to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package swadesh is already up-to-date!
[nltk_data]    | Downloading package switchboard to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package switchboard is already up-to-date!
[nltk_data]    | Downloading package timit to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package timit is already up-to-date!
[nltk_data]    | Downloading package toolbox to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package toolbox is already up-to-date!
[nltk_data]    | Downloading package treebank to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package treebank is already up-to-date!
[nltk_data]    | Downloading package twitter_samples to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package twitter_samples is already up-to-date!
[nltk_data]    | Downloading package udhr to /home/jester/nltk_data...
[nltk_data]    |   Package udhr is already up-to-date!
[nltk_data]    | Downloading package udhr2 to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package udhr2 is already up-to-date!
[nltk_data]    | Downloading package unicode_samples to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package unicode_samples is already up-to-date!
[nltk_data]    | Downloading package universal_treebanks_v20 to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package universal_treebanks_v20 is already up-to-
[nltk_data]    |       date!
[nltk_data]    | Downloading package verbnet to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package verbnet is already up-to-date!
[nltk_data]    | Downloading package webtext to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package webtext is already up-to-date!
[nltk_data]    | Downloading package wordnet to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package wordnet is already up-to-date!
[nltk_data]    | Downloading package wordnet_ic to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package wordnet_ic is already up-to-date!
[nltk_data]    | Downloading package words to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package words is already up-to-date!
[nltk_data]    | Downloading package ycoe to /home/jester/nltk_data...
[nltk_data]    |   Package ycoe is already up-to-date!
[nltk_data]    | Downloading package rslp to /home/jester/nltk_data...
[nltk_data]    |   Package rslp is already up-to-date!
[nltk_data]    | Downloading package hmm_treebank_pos_tagger to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package hmm_treebank_pos_tagger is already up-to-
[nltk_data]    |       date!
[nltk_data]    | Downloading package maxent_treebank_pos_tagger to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package maxent_treebank_pos_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package universal_tagset to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package universal_tagset is already up-to-date!
[nltk_data]    | Downloading package maxent_ne_chunker to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package maxent_ne_chunker is already up-to-date!
[nltk_data]    | Downloading package punkt to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package punkt is already up-to-date!
[nltk_data]    | Downloading package book_grammars to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package book_grammars is already up-to-date!
[nltk_data]    | Downloading package sample_grammars to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package sample_grammars is already up-to-date!
[nltk_data]    | Downloading package spanish_grammars to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package spanish_grammars is already up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package basque_grammars is already up-to-date!
[nltk_data]    | Downloading package large_grammars to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package large_grammars is already up-to-date!
[nltk_data]    | Downloading package tagsets to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package tagsets is already up-to-date!
[nltk_data]    | Downloading package snowball_data to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package snowball_data is already up-to-date!
[nltk_data]    | Downloading package bllip_wsj_no_aux to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package bllip_wsj_no_aux is already up-to-date!
[nltk_data]    | Downloading package word2vec_sample to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package word2vec_sample is already up-to-date!
[nltk_data]    | Downloading package panlex_swadesh to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package panlex_swadesh is already up-to-date!
[nltk_data]    | Downloading package mte_teip5 to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package mte_teip5 is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /home/jester/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package panlex_lite to
[nltk_data]    |     /home/jester/nltk_data...
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-1-d22ab9bfeafe> in <module>()
      1 import nltk
----> 2 nltk.download("all")

/usr/local/lib/python3.5/dist-packages/nltk/downloader.py in download(self, info_or_id, download_dir, quiet, force, prefix, halt_on_error, raise_on_error)
    662                                     subsequent_indent=prefix+prefix2+' '*4))
    663 
--> 664             for msg in self.incr_download(info_or_id, download_dir, force):
    665                 # Error messages
    666                 if isinstance(msg, ErrorMessage):

/usr/local/lib/python3.5/dist-packages/nltk/downloader.py in incr_download(self, info_or_id, download_dir, force)
    541         if isinstance(info, Collection):
    542             yield StartCollectionMessage(info)
--> 543             for msg in self.incr_download(info.children, download_dir, force):
    544                 yield msg
    545             yield FinishCollectionMessage(info)

/usr/local/lib/python3.5/dist-packages/nltk/downloader.py in incr_download(self, info_or_id, download_dir, force)
    527         # If they gave us a list of ids, then download each one.
    528         if isinstance(info_or_id, (list,tuple)):
--> 529             for msg in self._download_list(info_or_id, download_dir, force):
    530                 yield msg
    531             return

/usr/local/lib/python3.5/dist-packages/nltk/downloader.py in _download_list(self, items, download_dir, force)
    570             else:
    571                 delta = len(item.packages)/num_packages
--> 572             for msg in self.incr_download(item, download_dir, force):
    573                 if isinstance(msg, ProgressMessage):
    574                     yield ProgressMessage(progress + msg.progress*delta)

/usr/local/lib/python3.5/dist-packages/nltk/downloader.py in incr_download(self, info_or_id, download_dir, force)
    547         # Handle Packages (delegate to a helper function).
    548         else:
--> 549             for msg in self._download_package(info, download_dir, force):
    550                 yield msg
    551 

/usr/local/lib/python3.5/dist-packages/nltk/downloader.py in _download_package(self, info, download_dir, force)
    616                 num_blocks = max(1, info.size/(1024*16))
    617                 for block in itertools.count():
--> 618                     s = infile.read(1024*16) # 16k blocks.
    619                     outfile.write(s)
    620                     if not s: break

/usr/lib/python3.5/http/client.py in read(self, amt)
    446             # Amount is given, implement using readinto
    447             b = bytearray(amt)
--> 448             n = self.readinto(b)
    449             return memoryview(b)[:n].tobytes()
    450         else:

/usr/lib/python3.5/http/client.py in readinto(self, b)
    486         # connection, and the user is reading more bytes than will be provided
    487         # (for example, reading in 1k chunks)
--> 488         n = self.fp.readinto(b)
    489         if not n and b:
    490             # Ideally, we would raise IncompleteRead if the content-length

/usr/lib/python3.5/socket.py in readinto(self, b)
    573         while True:
    574             try:
--> 575                 return self._sock.recv_into(b)
    576             except timeout:
    577                 self._timeout_occurred = True

/usr/lib/python3.5/ssl.py in recv_into(self, buffer, nbytes, flags)
    927                   "non-zero flags not allowed in calls to recv_into() on %s" %
    928                   self.__class__)
--> 929             return self.read(nbytes, buffer)
    930         else:
    931             return socket.recv_into(self, buffer, nbytes, flags)

/usr/lib/python3.5/ssl.py in read(self, len, buffer)
    789             raise ValueError("Read on closed or unwrapped SSL socket.")
    790         try:
--> 791             return self._sslobj.read(len, buffer)
    792         except SSLError as x:
    793             if x.args[0] == SSL_ERROR_EOF and self.suppress_ragged_eofs:

/usr/lib/python3.5/ssl.py in read(self, len, buffer)
    573         """
    574         if buffer is not None:
--> 575             v = self._sslobj.read(len, buffer)
    576         else:
    577             v = self._sslobj.read(len)

KeyboardInterrupt: 
  • nltk.book에서 모든 데이터를 로딩하기.
  • nltk.book의 모든 데이터들은 다 tokenized된 데이터임.
In [2]:
import nltk
from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

1.3 searching text

  • 문서 내에서 특정 단어를 찾는 다양한 법을 알아봄.
  • concordance()은 문서 전체에서 단어와 일치하는 부분을 보여준다.
In [3]:
text1.concordance('monstrous')

#Q1.concordance함수를 활용하여 다른 단어들을 검색해보자.
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
  • similar 함수
    • monstrous라는 단어와 비슷한 문맥에서 출현한 단어들을 보여준다.
In [4]:
text1.similar('monstrous')
horrible maddens careful exasperate uncommon tyrannical wise fearless
curious christian untoward true mean pitiable trustworthy domineering
singular passing puzzled contemptible
  • common_contexts() 두 단어 또는 이상의 단어 사이에 공유되는 맥락을 보여준다.
  • 어떻게 구하는지는 나아중에 뒤에 나오니 그때 고민합시다.
In [5]:
text2.common_contexts(['monstrous','very'])
is_pretty be_glad a_lucky a_pretty am_glad
  • dispersion_plot()은 입력되는 단어들이 문서 전체에서 어떻게 분포하는지 시각화하여 보여준다.
    • 220년동안의 미 대통령 취임연설에서 나타나는 단어들의 분포를 볼 수 있다.
In [7]:
import matplotlib
%matplotlib nbagg

text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

1.4 counting vocabulary

  • 문서 내에서 단어의 수 등을 파악하는 다양한 방법을 알아봄.
  • token이란? 단어, 구둣점, 이모티콘 등을 일컫는 말로 연속된 문자열(sequence of character)의 기술적 정의다.
    • 예를 들면, hairy, his , ! :)등이 있다.
  • len()을 통해 해당 데이터에 존재하는 토큰의 수를 파악할 수 있다.
In [8]:
len(text3)
Out[8]:
44764
  • word type: token에서 나타나는 단어들의 유형
  • type : token에서 나타나는 단어들의 유형뿐만 아니라 구둣점, 숫자등 모든 것들을 포함한 경우.
In [9]:
print(sorted(set(text3))) #text3에 있는 모든 type을 보여준다.
print(len(set(text3))) #text3에 있는 모든 type의 수를 보여준다.
['!', "'", '(', ')', ',', ',)', '.', '.)', ':', ';', ';)', '?', '?)', 'A', 'Abel', 'Abelmizraim', 'Abidah', 'Abide', 'Abimael', 'Abimelech', 'Abr', 'Abrah', 'Abraham', 'Abram', 'Accad', 'Achbor', 'Adah', 'Adam', 'Adbeel', 'Admah', 'Adullamite', 'After', 'Aholibamah', 'Ahuzzath', 'Ajah', 'Akan', 'All', 'Allonbachuth', 'Almighty', 'Almodad', 'Also', 'Alvah', 'Alvan', 'Am', 'Amal', 'Amalek', 'Amalekites', 'Ammon', 'Amorite', 'Amorites', 'Amraphel', 'An', 'Anah', 'Anamim', 'And', 'Aner', 'Angel', 'Appoint', 'Aram', 'Aran', 'Ararat', 'Arbah', 'Ard', 'Are', 'Areli', 'Arioch', 'Arise', 'Arkite', 'Arodi', 'Arphaxad', 'Art', 'Arvadite', 'As', 'Asenath', 'Ashbel', 'Asher', 'Ashkenaz', 'Ashteroth', 'Ask', 'Asshur', 'Asshurim', 'Assyr', 'Assyria', 'At', 'Atad', 'Avith', 'Baalhanan', 'Babel', 'Bashemath', 'Be', 'Because', 'Becher', 'Bedad', 'Beeri', 'Beerlahairoi', 'Beersheba', 'Behold', 'Bela', 'Belah', 'Benam', 'Benjamin', 'Beno', 'Beor', 'Bera', 'Bered', 'Beriah', 'Bethel', 'Bethlehem', 'Bethuel', 'Beware', 'Bilhah', 'Bilhan', 'Binding', 'Birsha', 'Bless', 'Blessed', 'Both', 'Bow', 'Bozrah', 'Bring', 'But', 'Buz', 'By', 'Cain', 'Cainan', 'Calah', 'Calneh', 'Can', 'Cana', 'Canaan', 'Canaanite', 'Canaanites', 'Canaanitish', 'Caphtorim', 'Carmi', 'Casluhim', 'Cast', 'Cause', 'Chaldees', 'Chedorlaomer', 'Cheran', 'Cherubims', 'Chesed', 'Chezib', 'Come', 'Cursed', 'Cush', 'Damascus', 'Dan', 'Day', 'Deborah', 'Dedan', 'Deliver', 'Diklah', 'Din', 'Dinah', 'Dinhabah', 'Discern', 'Dishan', 'Dishon', 'Do', 'Dodanim', 'Dothan', 'Drink', 'Duke', 'Dumah', 'Earth', 'Ebal', 'Eber', 'Edar', 'Eden', 'Edom', 'Edomites', 'Egy', 'Egypt', 'Egyptia', 'Egyptian', 'Egyptians', 'Ehi', 'Elah', 'Elam', 'Elbethel', 'Eldaah', 'EleloheIsrael', 'Eliezer', 'Eliphaz', 'Elishah', 'Ellasar', 'Elon', 'Elparan', 'Emins', 'En', 'Enmishpat', 'Eno', 'Enoch', 'Enos', 'Ephah', 'Epher', 'Ephra', 'Ephraim', 'Ephrath', 'Ephron', 'Er', 'Erech', 'Eri', 'Es', 'Esau', 'Escape', 'Esek', 'Eshban', 'Eshcol', 'Ethiopia', 'Euphrat', 'Euphrates', 'Eve', 'Even', 'Every', 'Except', 'Ezbon', 'Ezer', 'Fear', 'Feed', 'Fifteen', 'Fill', 'For', 'Forasmuch', 'Forgive', 'From', 'Fulfil', 'G', 'Gad', 'Gaham', 'Galeed', 'Gatam', 'Gather', 'Gaza', 'Gentiles', 'Gera', 'Gerar', 'Gershon', 'Get', 'Gether', 'Gihon', 'Gilead', 'Girgashites', 'Girgasite', 'Give', 'Go', 'God', 'Gomer', 'Gomorrah', 'Goshen', 'Guni', 'Hadad', 'Hadar', 'Hadoram', 'Hagar', 'Haggi', 'Hai', 'Ham', 'Hamathite', 'Hamor', 'Hamul', 'Hanoch', 'Happy', 'Haran', 'Hast', 'Haste', 'Have', 'Havilah', 'Hazarmaveth', 'Hazezontamar', 'Hazo', 'He', 'Hear', 'Heaven', 'Heber', 'Hebrew', 'Hebrews', 'Hebron', 'Hemam', 'Hemdan', 'Here', 'Hereby', 'Heth', 'Hezron', 'Hiddekel', 'Hinder', 'Hirah', 'His', 'Hitti', 'Hittite', 'Hittites', 'Hivite', 'Hobah', 'Hori', 'Horite', 'Horites', 'How', 'Hul', 'Huppim', 'Husham', 'Hushim', 'Huz', 'I', 'If', 'In', 'Irad', 'Iram', 'Is', 'Isa', 'Isaac', 'Iscah', 'Ishbak', 'Ishmael', 'Ishmeelites', 'Ishuah', 'Isra', 'Israel', 'Issachar', 'Isui', 'It', 'Ithran', 'Jaalam', 'Jabal', 'Jabbok', 'Jac', 'Jachin', 'Jacob', 'Jahleel', 'Jahzeel', 'Jamin', 'Japhe', 'Japheth', 'Jared', 'Javan', 'Jebusite', 'Jebusites', 'Jegarsahadutha', 'Jehovahjireh', 'Jemuel', 'Jerah', 'Jetheth', 'Jetur', 'Jeush', 'Jezer', 'Jidlaph', 'Jimnah', 'Job', 'Jobab', 'Jokshan', 'Joktan', 'Jordan', 'Joseph', 'Jubal', 'Judah', 'Judge', 'Judith', 'Kadesh', 'Kadmonites', 'Karnaim', 'Kedar', 'Kedemah', 'Kemuel', 'Kenaz', 'Kenites', 'Kenizzites', 'Keturah', 'Kiriathaim', 'Kirjatharba', 'Kittim', 'Know', 'Kohath', 'Kor', 'Korah', 'LO', 'LORD', 'Laban', 'Lahairoi', 'Lamech', 'Lasha', 'Lay', 'Leah', 'Lehabim', 'Lest', 'Let', 'Letushim', 'Leummim', 'Levi', 'Lie', 'Lift', 'Lo', 'Look', 'Lot', 'Lotan', 'Lud', 'Ludim', 'Luz', 'Maachah', 'Machir', 'Machpelah', 'Madai', 'Magdiel', 'Magog', 'Mahalaleel', 'Mahalath', 'Mahanaim', 'Make', 'Malchiel', 'Male', 'Mam', 'Mamre', 'Man', 'Manahath', 'Manass', 'Manasseh', 'Mash', 'Masrekah', 'Massa', 'Matred', 'Me', 'Medan', 'Mehetabel', 'Mehujael', 'Melchizedek', 'Merari', 'Mesha', 'Meshech', 'Mesopotamia', 'Methusa', 'Methusael', 'Methuselah', 'Mezahab', 'Mibsam', 'Mibzar', 'Midian', 'Midianites', 'Milcah', 'Mishma', 'Mizpah', 'Mizraim', 'Mizz', 'Moab', 'Moabites', 'Moreh', 'Moreover', 'Moriah', 'Muppim', 'My', 'Naamah', 'Naaman', 'Nahath', 'Nahor', 'Naphish', 'Naphtali', 'Naphtuhim', 'Nay', 'Nebajoth', 'Neither', 'Night', 'Nimrod', 'Nineveh', 'Noah', 'Nod', 'Not', 'Now', 'O', 'Obal', 'Of', 'Oh', 'Ohad', 'Omar', 'On', 'Onam', 'Onan', 'Only', 'Ophir', 'Our', 'Out', 'Padan', 'Padanaram', 'Paran', 'Pass', 'Pathrusim', 'Pau', 'Peace', 'Peleg', 'Peniel', 'Penuel', 'Peradventure', 'Perizzit', 'Perizzite', 'Perizzites', 'Phallu', 'Phara', 'Pharaoh', 'Pharez', 'Phichol', 'Philistim', 'Philistines', 'Phut', 'Phuvah', 'Pildash', 'Pinon', 'Pison', 'Potiphar', 'Potipherah', 'Put', 'Raamah', 'Rachel', 'Rameses', 'Rebek', 'Rebekah', 'Rehoboth', 'Remain', 'Rephaims', 'Resen', 'Return', 'Reu', 'Reub', 'Reuben', 'Reuel', 'Reumah', 'Riphath', 'Rosh', 'Sabtah', 'Sabtech', 'Said', 'Salah', 'Salem', 'Samlah', 'Sarah', 'Sarai', 'Saul', 'Save', 'Say', 'Se', 'Seba', 'See', 'Seeing', 'Seir', 'Sell', 'Send', 'Sephar', 'Serah', 'Sered', 'Serug', 'Set', 'Seth', 'Shalem', 'Shall', 'Shalt', 'Shammah', 'Shaul', 'Shaveh', 'She', 'Sheba', 'Shebah', 'Shechem', 'Shed', 'Shel', 'Shelah', 'Sheleph', 'Shem', 'Shemeber', 'Shepho', 'Shillem', 'Shiloh', 'Shimron', 'Shinab', 'Shinar', 'Shobal', 'Should', 'Shuah', 'Shuni', 'Shur', 'Sichem', 'Siddim', 'Sidon', 'Simeon', 'Sinite', 'Sitnah', 'Slay', 'So', 'Sod', 'Sodom', 'Sojourn', 'Some', 'Spake', 'Speak', 'Spirit', 'Stand', 'Succoth', 'Surely', 'Swear', 'Syrian', 'Take', 'Tamar', 'Tarshish', 'Tebah', 'Tell', 'Tema', 'Teman', 'Temani', 'Terah', 'Thahash', 'That', 'The', 'Then', 'There', 'Therefore', 'These', 'They', 'Thirty', 'This', 'Thorns', 'Thou', 'Thus', 'Thy', 'Tidal', 'Timna', 'Timnah', 'Timnath', 'Tiras', 'To', 'Togarmah', 'Tola', 'Tubal', 'Tubalcain', 'Twelve', 'Two', 'Unstable', 'Until', 'Unto', 'Up', 'Upon', 'Ur', 'Uz', 'Uzal', 'We', 'What', 'When', 'Whence', 'Where', 'Whereas', 'Wherefore', 'Which', 'While', 'Who', 'Whose', 'Whoso', 'Why', 'Wilt', 'With', 'Woman', 'Ye', 'Yea', 'Yet', 'Zaavan', 'Zaphnathpaaneah', 'Zar', 'Zarah', 'Zeboiim', 'Zeboim', 'Zebul', 'Zebulun', 'Zemarite', 'Zepho', 'Zerah', 'Zibeon', 'Zidon', 'Zillah', 'Zilpah', 'Zimran', 'Ziphion', 'Zo', 'Zoar', 'Zohar', 'Zuzims', 'a', 'abated', 'abide', 'able', 'abode', 'abomination', 'about', 'above', 'abroad', 'absent', 'abundantly', 'accept', 'accepted', 'according', 'acknowledged', 'activity', 'add', 'adder', 'afar', 'afflict', 'affliction', 'afraid', 'after', 'afterward', 'afterwards', 'aga', 'again', 'against', 'age', 'aileth', 'air', 'al', 'alive', 'all', 'almon', 'alo', 'alone', 'aloud', 'also', 'altar', 'altogether', 'always', 'am', 'among', 'amongst', 'an', 'and', 'angel', 'angels', 'anger', 'angry', 'anguish', 'anointedst', 'anoth', 'another', 'answer', 'answered', 'any', 'anything', 'appe', 'appear', 'appeared', 'appease', 'appoint', 'appointed', 'aprons', 'archer', 'archers', 'are', 'arise', 'ark', 'armed', 'arms', 'army', 'arose', 'arrayed', 'art', 'artificer', 'as', 'ascending', 'ash', 'ashamed', 'ask', 'asked', 'asketh', 'ass', 'assembly', 'asses', 'assigned', 'asswaged', 'at', 'attained', 'audience', 'avenged', 'aw', 'awaked', 'away', 'awoke', 'back', 'backward', 'bad', 'bade', 'badest', 'badne', 'bak', 'bake', 'bakemeats', 'baker', 'bakers', 'balm', 'bands', 'bank', 'bare', 'barr', 'barren', 'basket', 'baskets', 'battle', 'bdellium', 'be', 'bear', 'beari', 'bearing', 'beast', 'beasts', 'beautiful', 'became', 'because', 'become', 'bed', 'been', 'befall', 'befell', 'before', 'began', 'begat', 'beget', 'begettest', 'begin', 'beginning', 'begotten', 'beguiled', 'beheld', 'behind', 'behold', 'being', 'believed', 'belly', 'belong', 'beneath', 'bereaved', 'beside', 'besides', 'besought', 'best', 'betimes', 'better', 'between', 'betwixt', 'beyond', 'binding', 'bird', 'birds', 'birthday', 'birthright', 'biteth', 'bitter', 'blame', 'blameless', 'blasted', 'bless', 'blessed', 'blesseth', 'blessi', 'blessing', 'blessings', 'blindness', 'blood', 'blossoms', 'bodies', 'boldly', 'bondman', 'bondmen', 'bondwoman', 'bone', 'bones', 'book', 'booths', 'border', 'borders', 'born', 'bosom', 'both', 'bottle', 'bou', 'boug', 'bough', 'bought', 'bound', 'bow', 'bowed', 'bowels', 'bowing', 'boys', 'bracelets', 'branches', 'brass', 'bre', 'breach', 'bread', 'breadth', 'break', 'breaketh', 'breaking', 'breasts', 'breath', 'breathed', 'breed', 'brethren', 'brick', 'brimstone', 'bring', 'brink', 'broken', 'brook', 'broth', 'brother', 'brought', 'brown', 'bruise', 'budded', 'build', 'builded', 'built', 'bulls', 'bundle', 'bundles', 'burdens', 'buried', 'burn', 'burning', 'burnt', 'bury', 'buryingplace', 'business', 'but', 'butler', 'butlers', 'butlership', 'butter', 'buy', 'by', 'cakes', 'calf', 'call', 'called', 'came', 'camel', 'camels', 'camest', 'can', 'cannot', 'canst', 'captain', 'captive', 'captives', 'carcases', 'carried', 'carry', 'cast', 'castles', 'catt', 'cattle', 'caught', 'cause', 'caused', 'cave', 'cease', 'ceased', 'certain', 'certainly', 'chain', 'chamber', 'change', 'changed', 'changes', 'charge', 'charged', 'chariot', 'chariots', 'chesnut', 'chi', 'chief', 'child', 'childless', 'childr', 'children', 'chode', 'choice', 'chose', 'circumcis', 'circumcise', 'circumcised', 'citi', 'cities', 'city', 'clave', 'clean', 'clear', 'cleave', 'clo', 'closed', 'clothed', 'clothes', 'cloud', 'clusters', 'co', 'coat', 'coats', 'coffin', 'cold', 'colours', 'colt', 'colts', 'come', 'comest', 'cometh', 'comfort', 'comforted', 'comi', 'coming', 'command', 'commanded', 'commanding', 'commandment', 'commandments', 'commended', 'committed', 'commune', 'communed', 'communing', 'company', 'compassed', 'compasseth', 'conceal', 'conceive', 'conceived', 'conception', 'concerning', 'concubi', 'concubine', 'concubines', 'confederate', 'confound', 'consent', 'conspired', 'consume', 'consumed', 'content', 'continually', 'continued', 'cool', 'corn', 'corrupt', 'corrupted', 'couch', 'couched', 'couching', 'could', 'counted', 'countenance', 'countries', 'country', 'covenant', 'covered', 'covering', 'created', 'creature', 'creepeth', 'creeping', 'cried', 'crieth', 'crown', 'cru', 'cruelty', 'cry', 'cubit', 'cubits', 'cunning', 'cup', 'current', 'curse', 'cursed', 'curseth', 'custom', 'cut', 'd', 'da', 'dainties', 'dale', 'damsel', 'damsels', 'dark', 'darkne', 'darkness', 'daughers', 'daught', 'daughte', 'daughter', 'daughters', 'day', 'days', 'dea', 'dead', 'deal', 'dealt', 'dearth', 'death', 'deceitfully', 'deceived', 'deceiver', 'declare', 'decreased', 'deed', 'deeds', 'deep', 'deferred', 'defiled', 'defiledst', 'delight', 'deliver', 'deliverance', 'delivered', 'denied', 'depart', 'departed', 'departing', 'deprived', 'descending', 'desire', 'desired', 'desolate', 'despised', 'destitute', 'destroy', 'destroyed', 'devour', 'devoured', 'dew', 'did', 'didst', 'die', 'died', 'digged', 'dignity', 'dim', 'dine', 'dipped', 'direct', 'discern', 'discerned', 'discreet', 'displease', 'displeased', 'distress', 'distressed', 'divide', 'divided', 'divine', 'divineth', 'do', 'doe', 'doer', 'doest', 'doeth', 'doing', 'dominion', 'done', 'door', 'dost', 'doth', 'double', 'doubled', 'doubt', 'dove', 'down', 'dowry', 'drank', 'draw', 'dread', 'dreadful', 'dream', 'dreamed', 'dreamer', 'dreams', 'dress', 'dressed', 'drew', 'dried', 'drink', 'drinketh', 'drinking', 'driven', 'drought', 'drove', 'droves', 'drunken', 'dry', 'duke', 'dukes', 'dunge', 'dungeon', 'dust', 'dwe', 'dwell', 'dwelled', 'dwelling', 'dwelt', 'e', 'ea', 'each', 'ear', 'earing', 'early', 'earring', 'earrings', 'ears', 'earth', 'east', 'eastward', 'eat', 'eaten', 'eatest', 'edge', 'eight', 'eighteen', 'eighty', 'either', 'elder', 'elders', 'eldest', 'eleven', 'else', 'embalm', 'embalmed', 'embraced', 'emptied', 'empty', 'end', 'ended', 'endued', 'endure', 'enemies', 'enlarge', 'enmity', 'enough', 'enquire', 'enter', 'entered', 'entreated', 'envied', 'erected', 'errand', 'escape', 'escaped', 'espied', 'establish', 'established', 'ev', 'even', 'evening', 'eventide', 'ever', 'everlasting', 'every', 'evil', 'ewe', 'ewes', 'exceeding', 'exceedingly', 'excel', 'excellency', 'except', 'exchange', 'experience', 'ey', 'eyed', 'eyes', 'fa', 'face', 'faces', 'fai', 'fail', 'failed', 'faileth', 'fainted', 'fair', 'fall', 'fallen', 'falsely', 'fame', 'families', 'famine', 'famished', 'far', 'fashion', 'fast', 'fat', 'fatfleshed', 'fath', 'fathe', 'father', 'fathers', 'fatness', 'faults', 'favour', 'favoured', 'fear', 'feared', 'fearest', 'feast', 'fed', 'feeble', 'feebler', 'feed', 'feeding', 'feel', 'feet', 'fell', 'fellow', 'felt', 'fema', 'female', 'fetch', 'fetched', 'fetcht', 'few', 'fie', 'field', 'fierce', 'fifteen', 'fifth', 'fifty', 'fig', 'fill', 'filled', 'find', 'findest', 'findeth', 'finding', 'fine', 'finish', 'finished', 'fir', 'fire', 'firmame', 'firmament', 'first', 'firstborn', 'firstlings', 'fish', 'fishes', 'five', 'flaming', 'fle', 'fled', 'fleddest', 'flee', 'flesh', 'flo', 'floc', 'flock', 'flocks', 'flood', 'floor', 'fly', 'fo', 'foal', 'foals', 'folk', 'follow', 'followed', 'following', 'folly', 'food', 'foolishly', 'foot', 'for', 'forbid', 'force', 'ford', 'foremost', 'foreskin', 'forgat', 'forget', 'forgive', 'forgotten', 'form', 'formed', 'former', 'forth', 'forty', 'forward', 'fou', 'found', 'fountain', 'fountains', 'four', 'fourscore', 'fourteen', 'fourteenth', 'fourth', 'fowl', 'fowls', 'freely', 'friend', 'friends', 'fro', 'from', 'frost', 'fruit', 'fruitful', 'fruits', 'fugitive', 'fulfilled', 'full', 'furnace', 'furniture', 'fury', 'gard', 'garden', 'garmen', 'garment', 'garments', 'gat', 'gate', 'gather', 'gathered', 'gathering', 'gave', 'gavest', 'generatio', 'generation', 'generations', 'get', 'getting', 'ghost', 'giants', 'gift', 'gifts', 'give', 'given', 'giveth', 'giving', 'glory', 'go', 'goa', 'goat', 'goats', 'gods', 'goest', 'goeth', 'going', 'gold', 'golden', 'gone', 'good', 'goodly', 'goods', 'gopher', 'got', 'gotten', 'governor', 'gr', 'grace', 'gracious', 'graciously', 'grap', 'grapes', 'grass', 'grave', 'gray', 'gre', 'great', 'greater', 'greatly', 'green', 'grew', 'grief', 'grieved', 'grievous', 'grisl', 'grisled', 'gro', 'ground', 'grove', 'grow', 'grown', 'guard', 'guiding', 'guiltiness', 'guilty', 'gutters', 'h', 'ha', 'habitations', 'had', 'hadst', 'hairs', 'hairy', 'half', 'halted', 'han', 'hand', 'handfuls', 'handle', 'handmaid', 'handmaidens', 'handmaids', 'hands', 'hang', 'hanged', 'hard', 'hardly', 'harlot', 'harm', 'harp', 'harvest', 'hast', 'haste', 'hasted', 'hastened', 'hastily', 'hate', 'hated', 'hath', 'have', 'haven', 'having', 'hazel', 'he', 'head', 'heads', 'healed', 'health', 'heap', 'hear', 'heard', 'hearken', 'hearkened', 'heart', 'hearth', 'hearts', 'heat', 'heav', 'heaven', 'heavens', 'heed', 'heel', 'heels', 'heifer', 'height', 'heir', 'held', 'help', 'hence', 'henceforth', 'her', 'herb', 'herd', 'herdmen', 'herds', 'here', 'herein', 'herself', 'hid', 'hide', 'high', 'hil', 'hills', 'him', 'himself', 'hind', 'hindermost', 'hire', 'hired', 'his', 'hith', 'hither', 'hold', 'hollow', 'home', 'honey', 'honour', 'honourable', 'hor', 'horror', 'horse', 'horsemen', 'horses', 'host', 'hotly', 'hou', 'hous', 'house', 'household', 'households', 'how', 'hundred', 'hundredfo', 'hundredth', 'hunt', 'hunter', 'hunting', 'hurt', 'husba', 'husband', 'husbandman', 'if', 'ill', 'image', 'images', 'imagination', 'imagined', 'in', 'increase', 'increased', 'indeed', 'inhabitants', 'inhabited', 'inherit', 'inheritance', 'iniquity', 'inn', 'innocency', 'instead', 'instructor', 'instruments', 'integrity', 'interpret', 'interpretation', 'interpretations', 'interpreted', 'interpreter', 'into', 'intreat', 'intreated', 'ir', 'is', 'isles', 'issue', 'it', 'itself', 'jewels', 'joined', 'joint', 'journey', 'journeyed', 'journeys', 'jud', 'judge', 'judged', 'judgment', 'just', 'justice', 'keep', 'keeper', 'kept', 'ki', 'kid', 'kids', 'kill', 'killed', 'kind', 'kindled', 'kindly', 'kindness', 'kindred', 'kinds', 'kine', 'king', 'kingdom', 'kings', 'kiss', 'kissed', 'kn', 'knead', 'kneel', 'knees', 'knew', 'knife', 'know', 'knowest', 'knoweth', 'knowing', 'knowledge', 'known', 'la', 'labour', 'lack', 'lad', 'ladder', 'lade', 'laded', 'laden', 'lads', 'laid', 'lamb', 'lambs', 'lamentati', 'lamp', 'lan', 'land', 'lands', 'language', 'large', 'last', 'laugh', 'laughed', 'law', 'lawgiver', 'laws', 'lay', 'lead', 'leaf', 'lean', 'leanfleshed', 'leap', 'leaped', 'learned', 'least', 'leave', 'leaves', 'led', 'left', 'length', 'lentiles', 'lesser', 'lest', 'let', 'li', 'lie', 'lien', 'liest', 'lieth', 'life', 'lift', 'lifted', 'light', 'lighted', 'lightly', 'lights', 'like', 'likene', 'likeness', 'linen', 'lingered', 'lion', 'little', 'live', 'lived', 'lives', 'liveth', 'living', 'lo', 'lodge', 'lodged', 'loins', 'long', 'longedst', 'longeth', 'look', 'looked', 'loose', 'lord', 'lords', 'loss', 'loud', 'love', 'loved', 'lovest', 'loveth', 'lower', 'lying', 'm', 'ma', 'made', 'magicians', 'magnified', 'maid', 'maiden', 'maidservants', 'make', 'male', 'males', 'man', 'mandrakes', 'manner', 'many', 'mark', 'marriages', 'married', 'marry', 'marvelled', 'mast', 'master', 'matter', 'may', 'mayest', 'me', 'mead', 'meadow', 'meal', 'mean', 'meanest', 'meant', 'measures', 'meat', 'meditate', 'meet', 'meeteth', 'men', 'menservants', 'mention', 'merchant', 'merchantmen', 'mercies', 'merciful', 'mercy', 'merry', 'mess', 'messenger', 'messengers', 'messes', 'met', 'mi', 'midst', 'midwife', 'might', 'mightier', 'mighty', 'milch', 'milk', 'millions', 'mind', 'mine', 'mirth', 'mischief', 'mist', 'mistress', 'mock', 'mocked', 'mocking', 'money', 'month', 'months', 'moon', 'more', 'moreover', 'morever', 'morning', 'morrow', 'morsel', 'morter', 'most', 'mother', 'mou', 'mount', 'mountain', 'mountains', 'mourn', 'mourned', 'mourning', 'mouth', 'mouths', 'moved', 'moveth', 'moving', 'much', 'mules', 'multiplied', 'multiply', 'multiplying', 'multitude', 'must', 'my', 'myrrh', 'myself', 'n', 'na', 'naked', 'nakedness', 'name', 'named', 'names', 'nati', 'natio', 'nation', 'nations', 'nativity', 'ne', 'near', 'neck', 'needeth', 'needs', 'neither', 'never', 'next', 'nig', 'nigh', 'night', 'nights', 'nine', 'nineteen', 'ninety', 'no', 'none', 'noon', 'nor', 'north', 'northward', 'nostrils', 'not', 'nothing', 'nought', 'nourish', 'nourished', 'now', 'number', 'numbered', 'numbering', 'nurse', 'nuts', 'o', 'oa', 'oak', 'oath', 'obeisance', 'obey', 'obeyed', 'observed', 'obtain', 'occasion', 'occupation', 'of', 'off', 'offended', 'offer', 'offered', 'offeri', 'offering', 'offerings', 'office', 'officer', 'officers', 'oil', 'old', 'olive', 'on', 'one', 'ones', 'only', 'onyx', 'open', 'opened', 'openly', 'or', 'order', 'organ', 'oth', 'other', 'ou', 'ought', 'our', 'ours', 'ourselves', 'out', 'over', 'overcome', 'overdrive', 'overseer', 'oversig', 'overspread', 'overtake', 'overthrew', 'overthrow', 'overtook', 'own', 'oxen', 'parcel', 'part', 'parted', 'parts', 'pass', 'passed', 'past', 'pasture', 'path', 'pea', 'peace', 'peaceable', 'peaceably', 'peop', 'people', 'peradventure', 'perceived', 'perfect', 'perform', 'perish', 'perpetual', 'person', 'persons', 'physicians', 'piece', 'pieces', 'pigeon', 'pilgrimage', 'pillar', 'pilled', 'pillows', 'pit', 'pitch', 'pitched', 'pitcher', 'pla', 'place', 'placed', 'places', 'plagued', 'plagues', 'plain', 'plains', 'plant', 'planted', 'played', 'pleasant', 'pleased', 'pleaseth', 'pleasure', 'pledge', 'plenteous', 'plenteousness', 'plenty', 'pluckt', 'point', 'poor', 'poplar', 'portion', 'possess', 'possessi', 'possession', 'possessions', 'possessor', 'posterity', 'pottage', 'poured', 'poverty', 'pow', 'power', 'praise', 'pray', 'prayed', 'precious', 'prepared', 'presence', 'present', 'presented', 'preserve', 'preserved', 'pressed', 'prevail', 'prevailed', 'prey', 'priest', 'priests', 'prince', 'princes', 'pris', 'prison', 'prisoners', 'proceedeth', 'process', 'profit', 'progenitors', 'prophet', 'prosper', 'prospered', 'prosperous', 'protest', 'proved', 'provender', 'provide', 'provision', 'pulled', 'punishment', 'purchase', 'purchased', 'purposing', 'pursue', 'pursued', 'put', 'putting', 'quart', 'quickly', 'quite', 'quiver', 'raiment', 'rain', 'rained', 'raise', 'ram', 'rams', 'ran', 'rank', 'raven', 'ravin', 'reach', 'reached', 'ready', 'reason', 'rebelled', 'rebuked', 'receive', 'received', 'red', 'redeemed', 'refrain', 'refrained', 'refused', 'regard', 'reign', 'reigned', 'remained', 'remaineth', 'remember', 'remembered', 'remove', 'removed', 'removing', 'renown', 'rent', 'repented', 'repenteth', 'replenish', 'report', 'reproa', 'reproach', 'reproved', 'require', 'required', 'requite', 'reserved', 'respect', 'rest', 'rested', 'restore', 'restored', 'restrained', 'return', 'returned', 'reviv', 'reward', 'rewarded', 'ri', 'rib', 'ribs', 'rich', 'riches', 'rid', 'ride', 'rider', 'right', 'righteous', 'righteousness', 'rightly', 'ring', 'ringstraked', 'ripe', 'rise', 'risen', 'riv', 'river', 'rode', 'rods', 'roll', 'rolled', 'roof', 'room', 'rooms', 'rose', 'roughly', 'round', 'rouse', 'royal', 'rul', 'rule', 'ruled', 'ruler', 'rulers', 'run', 's', 'sa', 'sac', 'sack', 'sackcloth', 'sacks', 'sacrifice', 'sacrifices', 'sad', 'saddled', 'sadly', 'said', 'saidst', 'saith', 'sake', 'sakes', 'salt', 'salvation', 'same', 'sanctified', 'sand', 'sat', 'save', 'saved', 'saving', 'savour', 'savoury', 'saw', 'sawest', 'say', 'saying', 'scarce', 'scarlet', 'scatter', 'scattered', 'sceptre', 'sea', 'searched', 'seas', 'season', 'seasons', 'second', 'secret', 'secretly', 'see', 'seed', 'seedtime', 'seeing', 'seek', 'seekest', 'seem', 'seemed', 'seen', 'seest', 'seeth', 'selfsame', 'selfwill', 'sell', 'send', 'sent', 'separate', 'separated', 'sepulchre', 'sepulchres', 'serpent', 'serva', 'servan', 'servant', 'servants', 'serve', 'served', 'service', 'set', 'seven', 'sevenfold', 'sevens', 'seventeen', 'seventeenth', 'seventh', 'seventy', 'sewed', 'sh', 'shadow', 'shall', 'shalt', 'shamed', 'shaved', 'she', 'sheaf', 'shear', 'sheaves', 'shed', 'sheddeth', 'sheep', 'sheepshearers', 'shekel', 'shekels', 'shepherd', 'shepherds', 'shew', 'shewed', 'sheweth', 'shield', 'ships', 'shoelatchet', 'shore', 'shortly', 'shot', 'should', 'shoulder', 'shoulders', 'shouldest', 'shrank', 'shrubs', 'shut', 'si', 'side', 'sight', 'signet', 'signs', 'silv', 'silver', 'sin', 'since', 'sinew', 'sinners', 'sinning', 'sir', 'sist', 'sister', 'sit', 'six', 'sixteen', 'sixth', 'sixty', 'skins', 'slain', 'slaughter', 'slay', 'slayeth', 'sle', 'sleep', 'slept', 'slew', 'slime', 'slimepits', 'small', 'smell', 'smelled', 'smite', 'smoke', 'smoking', 'smooth', 'smote', 'so', 'sod', 'softly', 'sojourn', 'sojourned', 'sojourner', 'sold', 'sole', 'solemnly', 'some', 'son', 'songs', 'sons', 'soon', 'sore', 'sorely', 'sorrow', 'sort', 'sou', 'sought', 'soul', 'souls', 'south', 'southward', 'sow', 'sowed', 'space', 'spake', 'spare', 'spe', 'speak', 'speaketh', 'speaking', 'speckl', 'speckled', 'spee', 'speech', 'speed', 'speedily', 'spent', 'spi', 'spicery', 'spices', 'spies', 'spilled', 'spirit', 'spoil', 'spoiled', 'spoken', 'sporting', 'spotted', 'spread', 'springing', 'sprung', 'staff', 'stalk', 'stand', 'standest', 'stars', 'state', 'statutes', 'stay', 'stayed', 'ste', 'stead', 'steal', 'steward', 'still', 'stink', 'sto', 'stole', 'stolen', 'stone', 'stones', 'stood', 'stooped', 'stopped', 'store', 'storehouses', 'stories', 'straitly', 'strakes', 'strange', 'stranger', 'strangers', 'straw', 'street', 'strength', 'strengthened', 'stretched', 'stricken', 'strife', 'stript', 'strive', 'strong', 'stronger', 'strove', 'struggled', 'stuff', 'subdue', 'submit', 'substance', 'subtil', 'subtilty', 'such', 'suck', 'suffered', 'summer', 'sun', 'supplanted', 'sure', 'surely', 'surety', 'sustained', 'sware', 'swear', 'sweat', 'sweet', 'sword', 'sworn', 'tabret', 'tak', 'take', 'taken', 'talked', 'talking', 'tar', 'tarried', 'tarry', 'teeth', 'tell', 'tempt', 'ten', 'tender', 'tenor', 'tent', 'tenth', 'tents', 'terror', 'th', 'than', 'that', 'the', 'thee', 'their', 'them', 'themselv', 'themselves', 'then', 'thence', 'there', 'thereby', 'therefore', 'therein', 'thereof', 'thereon', 'these', 'they', 'thi', 'thicket', 'thigh', 'thin', 'thine', 'thing', 'things', 'think', 'third', 'thirteen', 'thirteenth', 'thirty', 'this', 'thistles', 'thither', 'thoroughly', 'those', 'thou', 'though', 'thought', 'thoughts', 'thousand', 'thousands', 'thread', 'three', 'threescore', 'threshingfloor', 'throne', 'through', 'throughout', 'thus', 'thy', 'thyself', 'tidings', 'till', 'tiller', 'tillest', 'tim', 'time', 'times', 'tithes', 'to', 'togeth', 'together', 'toil', 'token', 'told', 'tongue', 'tongues', 'too', 'took', 'top', 'tops', 'torn', 'touch', 'touched', 'toucheth', 'touching', 'toward', 'tower', 'towns', 'tr', 'trade', 'traffick', 'trained', 'travail', 'travailed', 'treasure', 'tree', 'trees', 'trembled', 'trespass', 'tribes', 'tribute', 'troop', 'troubled', 'trough', 'troughs', 'tru', 'true', 'truly', 'truth', 'turn', 'turned', 'turtledove', 'twel', 'twelve', 'twentieth', 'twenty', 'twice', 'twins', 'two', 'unawares', 'uncircumcised', 'uncovered', 'under', 'understand', 'understood', 'ungirded', 'unit', 'unleavened', 'until', 'unto', 'up', 'upon', 'uppermost', 'upright', 'upward', 'urged', 'us', 'utmost', 'vagabond', 'vail', 'vale', 'valley', 'vengeance', 'venison', 'verified', 'verily', 'very', 'vessels', 'vestures', 'victuals', 'vine', 'vineyard', 'violence', 'violently', 'virgin', 'vision', 'visions', 'visit', 'visited', 'voi', 'voice', 'void', 'vow', 'vowed', 'vowedst', 'w', 'wa', 'wages', 'wagons', 'waited', 'walk', 'walked', 'walketh', 'walking', 'wall', 'wander', 'wandered', 'wandering', 'war', 'ward', 'was', 'wash', 'washed', 'wast', 'wat', 'watch', 'water', 'watered', 'watering', 'waters', 'waxed', 'waxen', 'way', 'ways', 'we', 'wealth', 'weaned', 'weapons', 'wearied', 'weary', 'week', 'weep', 'weig', 'weighed', 'weight', 'welfare', 'well', 'wells', 'went', 'wentest', 'wept', 'were', 'west', 'westwa', 'whales', 'what', 'whatsoever', 'wheat', 'whelp', 'when', 'whence', 'whensoever', 'where', 'whereby', 'wherefore', 'wherein', 'whereof', 'whereon', 'wherewith', 'whether', 'which', 'while', 'white', 'whither', 'who', 'whole', 'whom', 'whomsoever', 'whoredom', 'whose', 'whosoever', 'why', 'wi', 'wick', 'wicked', 'wickedly', 'wickedness', 'widow', 'widowhood', 'wife', 'wild', 'wilderness', 'will', 'willing', 'wilt', 'wind', 'window', 'windows', 'wine', 'winged', 'winter', 'wise', 'wit', 'with', 'withered', 'withheld', 'withhold', 'within', 'without', 'witness', 'wittingly', 'wiv', 'wives', 'wo', 'wolf', 'woman', 'womb', 'wombs', 'women', 'womenservan', 'womenservants', 'wondering', 'wood', 'wor', 'word', 'words', 'work', 'worse', 'worship', 'worshipped', 'worth', 'worthy', 'wot', 'wotteth', 'would', 'wouldest', 'wounding', 'wrapped', 'wrath', 'wrestled', 'wrestlings', 'wrong', 'wroth', 'wrought', 'y', 'ye', 'yea', 'year', 'yearn', 'years', 'yesternight', 'yet', 'yield', 'yielded', 'yielding', 'yoke', 'yonder', 'you', 'young', 'younge', 'younger', 'youngest', 'your', 'yourselves', 'youth']
2789
  • 문서에서 사용한 어휘의 다양성을 측정.(lexical diversity)
In [10]:
len(set(text3)) / len(text3) 
Out[10]:
0.06230453042623537
  • 문서 전체에서 다른 단어들이 약 6%정도를 차지한다.
  • 문서 전체에서 각각의 type이 평균적으로 약 16회 출현하였다.
  • text1~text9의 token과 type과 lexical diversity를 각각 구해보고 해석해보자.
  • 특정 어휘의 출현 빈도 및 비율 측정.
In [11]:
print(text3.count("smote"))
#문서에서 출현한 smote의 빈도
print(len(text3))
print(100 * text3.count('smote') / len(text3))
#smote가 문서에서 차지하는 비율

#Q3.text5의 lol의 비율을 구하고 해석을 고민해보자.
5
44764
0.01116968992940756

3. Computing with Language: Simple Statistics.

eda of text data

3.1 Frequency Distributions

In [12]:
fdist1=FreqDist(text1)
print(fdist1) #type, token
<FreqDist with 19317 samples and 260819 outcomes>
In [13]:
type(fdist1)
Out[13]:
nltk.probability.FreqDist
In [14]:
fdist1.most_common(50)
Out[14]:
[(',', 18713),
 ('the', 13721),
 ('.', 6862),
 ('of', 6536),
 ('and', 6024),
 ('a', 4569),
 ('to', 4542),
 (';', 4072),
 ('in', 3916),
 ('that', 2982),
 ("'", 2684),
 ('-', 2552),
 ('his', 2459),
 ('it', 2209),
 ('I', 2124),
 ('s', 1739),
 ('is', 1695),
 ('he', 1661),
 ('with', 1659),
 ('was', 1632),
 ('as', 1620),
 ('"', 1478),
 ('all', 1462),
 ('for', 1414),
 ('this', 1280),
 ('!', 1269),
 ('at', 1231),
 ('by', 1137),
 ('but', 1113),
 ('not', 1103),
 ('--', 1070),
 ('him', 1058),
 ('from', 1052),
 ('be', 1030),
 ('on', 1005),
 ('so', 918),
 ('whale', 906),
 ('one', 889),
 ('you', 841),
 ('had', 767),
 ('have', 760),
 ('there', 715),
 ('But', 705),
 ('or', 697),
 ('were', 680),
 ('now', 646),
 ('which', 640),
 ('?', 637),
 ('me', 627),
 ('like', 624)]
In [15]:
fdist1['whale']
Out[15]:
906
In [16]:
fdist1.plot(10,cumulative=True)
In [18]:
fdist1.hapaxes()#한번 출현한 단어들을 보여줌.
Out[18]:
['recede',
 'clayey',
 'hardest',
 'greedy',
 'transferringly',
 'corkscrew',
 'salamander',
 'languishing',
 'insupportable',
 'dun',
 'circumnavigations',
 'lunges',
 'rogues',
 'gizzard',
 'stifle',
 'bedraggled',
 'subdivide',
 'nosed',
 'crimsoned',
 'Tusked',
 'Carson',
 'bravadoes',
 'Vermont',
 'discreetly',
 'Grenadier',
 'bought',
 'Want',
 'Pressing',
 'fatherless',
 'enchantment',
 'deformity',
 'hoarded',
 'Maccabees',
 'Cholo',
 'blusterer',
 'Anyhow',
 'genteel',
 'forethrown',
 'FEET',
 'hieroglyphical',
 'Partners',
 'Created',
 'Plains',
 'speckled',
 'splendors',
 'BRACTON',
 'overbalance',
 'absorbingly',
 'strays',
 'meddling',
 'Bentham',
 'boatmen',
 'couldst',
 'vignettes',
 'APPLICATION',
 'swagger',
 'silences',
 'Herman',
 'Caw',
 'Kedron',
 'servile',
 'Randolphs',
 'paregoric',
 'clenching',
 'literal',
 'amputations',
 'analogical',
 'COWPER',
 'bush',
 'lover',
 'frisky',
 'Respectively',
 'Retribution',
 'grievances',
 'incommodiously',
 'arrah',
 'swum',
 'Mistress',
 'honouring',
 'repute',
 'Plunge',
 'sinker',
 'OAKES',
 'stultifying',
 'ungracious',
 'unskilful',
 'illness',
 'gross',
 'End',
 'Doctor',
 'reclines',
 'ruffled',
 'augmented',
 'cutlets',
 'wainscots',
 'Lights',
 'infantileness',
 'agonies',
 'enticing',
 'Roses',
 ';"--',
 'invertedly',
 'honesty',
 'bounteous',
 'CHACE',
 'rechurned',
 'gaffs',
 'Kills',
 'madden',
 'penetrating',
 'FIGURED',
 'studious',
 'Pity',
 'insufficient',
 'YARD',
 'satirizing',
 'stumbled',
 'tunnel',
 'bucks',
 'Detached',
 'comparable',
 'bleakness',
 'retraced',
 'mountaineers',
 'Giver',
 'orchestra',
 'analysis',
 'enthrone',
 'Saul',
 'Turkey',
 'soles',
 'Led',
 'ruffed',
 'railways',
 'weeps',
 'grasps',
 'droves',
 'summits',
 'sulphurous',
 'Fountain',
 'drench',
 'recrossing',
 'douse',
 'Belated',
 'hornpipe',
 'Ombay',
 'reproachfully',
 'dissolutions',
 'kine',
 'alpacas',
 'graved',
 'thrills',
 'sobriety',
 'voided',
 'unrolling',
 'cinders',
 'uncomfortableness',
 'THAR',
 'outdone',
 'radical',
 'Archbishop',
 '129',
 'bumpers',
 'AROUND',
 'guarding',
 'FIRMLY',
 'prose',
 'couples',
 'Channel',
 'scorchingly',
 'missent',
 'chases',
 'wipe',
 'RED',
 'particles',
 'Louisiana',
 'baulks',
 'mannikin',
 'paralysed',
 'missionaries',
 'FEGEE',
 'metaphysically',
 'habitual',
 'exception',
 'crackers',
 'Usually',
 'Miserable',
 'familyless',
 'perpetuates',
 'Warmest',
 'conjure',
 'aliment',
 'autumn',
 'bearings',
 'vacillations',
 'assistance',
 'indite',
 'characterized',
 'gamboge',
 'outspreadingly',
 'spraining',
 'maddens',
 'patches',
 'landscapes',
 'tenement',
 'nestling',
 'Power',
 'hussar',
 'ebbs',
 'manoeuvred',
 'persuasiveness',
 'Orleans',
 'appoint',
 'pledges',
 'conceives',
 'maccaroni',
 'mooted',
 'unendurable',
 'fixing',
 'Zeuglodon',
 'HIMSELF',
 'ulceration',
 'fencer',
 'footman',
 'complain',
 'evanescence',
 'silks',
 'drugging',
 'Inasmuch',
 'lacks',
 'mystically',
 'Heart',
 'Levanter',
 'shan',
 'searches',
 'shave',
 'fattening',
 'senate',
 'unlock',
 'imposed',
 'needful',
 'Exploring',
 'CHAPTERS',
 'impeach',
 'odoriferous',
 'giddy',
 'favoured',
 'horrifying',
 'warfare',
 'deliriums',
 'Help',
 'cloves',
 'Decapitation',
 'unscientific',
 'roundingly',
 'transit',
 'straddling',
 'revels',
 'marchings',
 'Bellies',
 'beaters',
 'sagacity',
 'Patience',
 'threatens',
 'wasps',
 'Earthsman',
 'soliloquizer',
 'Constantine',
 'exterminated',
 'Japans',
 'repent',
 'Anak',
 'Pampas',
 'infatuation',
 'pelvis',
 'abjectly',
 'seignories',
 'flanked',
 'SOLANDER',
 'RESPECTABLE',
 'dissent',
 'catalogue',
 'transformed',
 'SCREWS',
 'append',
 'refining',
 'religionists',
 'resent',
 'Sachem',
 'unwound',
 'starved',
 'corruption',
 'middling',
 'overgrowing',
 'outyell',
 'peltry',
 'Netherlands',
 'unchanged',
 'farmers',
 'consulting',
 'TWISTED',
 'ATTACK',
 'slaughter',
 'NOSTRIL',
 'Sixteen',
 'Secretary',
 'groin',
 'Measured',
 '5TH',
 'GOES',
 'Either',
 'capsize',
 'divined',
 'elevations',
 'persuading',
 'threading',
 'WAY',
 'bladder',
 'tipping',
 'dungeoned',
 'symbolically',
 'tiny',
 'Oft',
 'IAN',
 'educated',
 'tissues',
 'Pottsfich',
 'Mesopotamian',
 'Utter',
 'interrupt',
 'shallowest',
 'clearer',
 'penning',
 'reprehensible',
 'classical',
 'snuffling',
 'Fisheries',
 'libertines',
 'muffledness',
 'apertures',
 'clad',
 'sashless',
 'unbiased',
 'confines',
 'foible',
 'REQUIEM',
 'jostle',
 'lancers',
 'judgmatically',
 'ungainly',
 'consecrating',
 'unequal',
 'comets',
 'brutal',
 'expert',
 'creative',
 'conducting',
 'Meshach',
 'bat',
 'Probably',
 'frankincense',
 'saving',
 'backwardly',
 'spoiling',
 'roundly',
 'launch',
 'paternity',
 'simplicity',
 'strutting',
 'Pompey',
 'perusal',
 'chastisements',
 'peeringly',
 'panel',
 'bewildered',
 'prophesies',
 'bountifully',
 'bilocular',
 'initials',
 'aggregations',
 'stair',
 'habituated',
 'blisters',
 'appeal',
 'durable',
 'Judges',
 'fared',
 'Stylites',
 'forswears',
 'depose',
 'disembowelments',
 'betokening',
 'gleamings',
 'Sphinx',
 'cynical',
 'chalking',
 'remembrances',
 'engrossing',
 'freight',
 'tapered',
 'dispenses',
 'derision',
 'burnish',
 'developing',
 'popularize',
 'Bungle',
 'Feegeeans',
 'deposited',
 'murmuring',
 'limpid',
 'feebler',
 'elevation',
 'gowns',
 'cleets',
 'Stammering',
 'Horner',
 'Said',
 'Mohawk',
 'inscribed',
 'profitable',
 'Slid',
 'preservers',
 'fetches',
 'Pooh',
 'disinfecting',
 'HAILS',
 'WIDOW',
 'fastener',
 'MEN',
 'outspread',
 'Cold',
 'blisteringly',
 'keg',
 'kneepans',
 'Earl',
 'acerbities',
 'uncheered',
 'imminglings',
 'overtaking',
 'intrepidly',
 'MSS',
 'asses',
 'Trinity',
 'bevy',
 'extinguishing',
 'waxy',
 'liberally',
 'amidst',
 'thirteenth',
 'agrees',
 'confessed',
 'juniper',
 'ungentlemanly',
 'uncapturable',
 'slippered',
 'PARLIAMENT',
 'overtakes',
 'Stoic',
 'acquiescence',
 'lovings',
 'jeering',
 'treating',
 'Horrible',
 'allegory',
 'Tiger',
 'loungingly',
 'CURRENTS',
 'headmost',
 'dalliance',
 'Fiery',
 'weazel',
 'surest',
 'bigamist',
 'mace',
 'napping',
 'glee',
 'betaken',
 'ceti',
 'Calais',
 'superincumbent',
 'aforesaid',
 'substantiate',
 'motionlessly',
 'Rocky',
 'sagged',
 'parried',
 'results',
 'erudition',
 '76',
 'cultivate',
 'domineered',
 'spiracles',
 'Damocles',
 'soladoes',
 'stilts',
 'thoughtfulness',
 'confabulations',
 'unsays',
 'promissory',
 'FLOOD',
 'contingent',
 'mobbing',
 'fiendish',
 'complaints',
 'paine',
 'reverential',
 'Growlands',
 'issues',
 'submits',
 'shindy',
 'leopard',
 'imperceptibly',
 'ST',
 'cats',
 'tolling',
 'STRAPS',
 'slaughtering',
 'pulsations',
 'bestowal',
 'butteries',
 'Savage',
 'sayst',
 'Bag',
 'herrings',
 'rudeness',
 'exhaustive',
 'Future',
 'SCATTER',
 'le',
 'dispute',
 'descendants',
 'brackish',
 'rocket',
 'Champagne',
 'LINE',
 'Somehow',
 'ejaculated',
 'Europa',
 'scrambled',
 'bodied',
 'amounted',
 'ARE',
 'metropolis',
 'sleeplessness',
 'Saco',
 'Starboard',
 'props',
 'Baling',
 'sweethearts',
 'destinations',
 '32',
 'digester',
 'inanimate',
 'cartloads',
 'unrifled',
 '24',
 'shod',
 'vintage',
 'hermaphroditical',
 'pirouetting',
 'Bendigoes',
 'Juba',
 'childlessness',
 'entrenchments',
 'sentinels',
 'usefulness',
 'suffused',
 'uncontinented',
 'worldly',
 'bordering',
 'privations',
 'ELIZABETH',
 'disorder',
 'HOMEWARD',
 'fumbled',
 'Melville',
 'sixteenth',
 'overleap',
 'bushy',
 'predominating',
 'Lieutenant',
 'professed',
 'Joe',
 'gauntleted',
 'collecting',
 'damsels',
 'CONVERSATIONS',
 'advised',
 'finical',
 'ruptured',
 'joist',
 'Affected',
 'abominate',
 'pitcher',
 'watchmen',
 'commentator',
 'whets',
 'hoop',
 '18',
 'dentists',
 'abstraction',
 'sheered',
 'emptying',
 'interlacings',
 'bartered',
 'brats',
 'Crockett',
 'giddily',
 'invariability',
 'dictionaries',
 'anticipative',
 'damning',
 'wept',
 "?'--'",
 'lingers',
 'Wise',
 'Matsmai',
 'bamboo',
 'Remembering',
 'tucking',
 'remind',
 'Chartering',
 'Helena',
 'Carthage',
 'rending',
 'Argo',
 'Monsoons',
 'nibbling',
 'crashing',
 'sordid',
 'gnashing',
 'uniqueness',
 'entreated',
 'skrimshandering',
 'deepeningly',
 'fencing',
 'voluntary',
 'Slope',
 'Devils',
 'modifies',
 'Gurry',
 'Physiognomy',
 'Spinoza',
 'KEDGER',
 'loon',
 'mannerly',
 'mildewed',
 'swaller',
 'stig',
 'Bonneterre',
 'batten',
 'PEQUOD',
 'assigns',
 'hacked',
 'reliance',
 'encasing',
 'packs',
 'wrapall',
 'convulsive',
 'designates',
 'MAT',
 'stereotype',
 'sprouts',
 'parade',
 'keeper',
 'Hindoos',
 'perspective',
 'GRIMLY',
 '131',
 'inspectingly',
 'bejuggled',
 'sooth',
 'trending',
 'FLASHES',
 'blessing',
 'sinecure',
 'analytic',
 'Mason',
 'phiz',
 'undervalue',
 'concernments',
 'mornings',
 'grooved',
 'rifled',
 'ENSUING',
 'slanderous',
 'complimentary',
 'scolds',
 'hastier',
 'plebeian',
 'toughness',
 'exceeds',
 'flyin',
 'painters',
 'communities',
 'equanimity',
 'Muffled',
 'imported',
 'wickedness',
 'supernaturalism',
 'Ebony',
 'braining',
 'Sodom',
 'journeyman',
 'raked',
 'arbitrary',
 'experiments',
 'layeth',
 'expresses',
 'waxes',
 'sharpest',
 'pallet',
 'YORK',
 'blazed',
 'exaggerating',
 'inferentially',
 'tanned',
 'drawlingly',
 'Deep',
 'watergate',
 'inkstand',
 'digressively',
 'TROIL',
 'Physiognomist',
 'spectral',
 'Arter',
 'fallacious',
 'heartwoes',
 'kindhearted',
 'enchanter',
 'patriot',
 'Proceed',
 'KETOS',
 'Cattegat',
 'Aldrovandi',
 'dissemble',
 'keeling',
 'Exception',
 'fastenings',
 'heath',
 'breedeth',
 'Coke',
 'panels',
 'contingencies',
 'Melancthon',
 'tumblers',
 'canonicals',
 'unreluctantly',
 'Judea',
 'slatternly',
 'Shake',
 'salamed',
 'thwack',
 'quadrupeds',
 'hilarity',
 'lengthen',
 'goring',
 'CAP',
 'tellin',
 'voyaged',
 'axles',
 'Keeping',
 'knockings',
 'heraldic',
 'HORRID',
 'Dame',
 'knightly',
 'magnify',
 'admonitory',
 'flickering',
 'fibrous',
 'Stir',
 'thunderings',
 'regardless',
 'toadstools',
 'intensities',
 'ungovernable',
 'Saxon',
 'Olassen',
 'Bottom',
 'butchering',
 'gown',
 'hie',
 'disjointedly',
 'firmer',
 'unbecomingness',
 'Horned',
 '15',
 'reservoirs',
 'bump',
 'Befooled',
 'communicated',
 'freshening',
 'convicts',
 'siding',
 'delectable',
 'RICHARDSON',
 'panellings',
 'cosmopolite',
 'songster',
 'Sultan',
 'boon',
 'overruns',
 'patrolled',
 'reap',
 'uniformity',
 'leadership',
 'knaves',
 'glitters',
 'vats',
 'bitterer',
 'palpableness',
 'Wretched',
 'markest',
 'Swimming',
 'incidental',
 'overrunningly',
 'cheered',
 'affecting',
 'cylindrically',
 'Split',
 'Ochotsh',
 'whalin',
 'blackest',
 'BREACH',
 'unappalled',
 'torsoes',
 'Common',
 'asunder',
 'spectre',
 'wigwams',
 'sigh',
 'Jig',
 'sinks',
 'patentees',
 'rural',
 'Hydriote',
 'disrated',
 'cypher',
 'changeful',
 'mustered',
 'aesthetics',
 'Scorpio',
 'demonstrations',
 'distilled',
 'Won',
 'quaff',
 'Canadian',
 'bulge',
 'OCTAVOES',
 'fished',
 'osseous',
 'Regent',
 'derisive',
 'Satanic',
 'Ezekiel',
 'Gros',
 'knob',
 'untidy',
 'inuendoes',
 'squatting',
 'premised',
 'elevate',
 'inactive',
 'BOARD',
 'palavering',
 'DARKENS',
 'beholder',
 'forges',
 'pertains',
 'weaves',
 'chilled',
 'Abominable',
 'Kingdom',
 'exotic',
 'Expedition',
 'cleats',
 'cannikin',
 'princess',
 '49',
 'allowances',
 'Welding',
 'slacken',
 'Raise',
 'spiral',
 'Whosoever',
 'nipper',
 'soils',
 'drilled',
 'incomputable',
 'amuck',
 'pedestals',
 'clusters',
 'ineffably',
 'BELOW',
 'abstained',
 'basso',
 'dissociated',
 'instructions',
 'rumor',
 'indigenous',
 'Regarded',
 'Immense',
 'coax',
 'slink',
 'unsettled',
 'undetached',
 'Yoke',
 'PROGRESS',
 'tee',
 'abbreviation',
 'showest',
 'gruff',
 'haunt',
 'Pascal',
 'caput',
 'passively',
 'parcelling',
 'sends',
 'Pandects',
 'Englander',
 'Cancer',
 'Roll',
 'transfix',
 'sprat',
 'gnaw',
 'tumults',
 'empties',
 'retires',
 'glutinous',
 'ERROMANGOAN',
 'gesticulated',
 'harvesting',
 'piercer',
 'Berlin',
 'deathful',
 'peer',
 'Cave',
 'midship',
 'Lit',
 'Cato',
 'BIT',
 'detects',
 'Tattoo',
 'anomalous',
 'masted',
 'enkindling',
 'frayed',
 'worried',
 'shrinked',
 'oilpainting',
 'amputated',
 'Advancement',
 'rugged',
 'passionlessness',
 'pealing',
 'deserving',
 'CHEERLY',
 'sinned',
 'grows',
 'perturbation',
 'Friar',
 'crumpled',
 'hostility',
 'mud',
 'undressing',
 'localness',
 'soulless',
 'BEWARE',
 'intervene',
 'tougher',
 'wrinkling',
 'tubes',
 'truths',
 'shipyards',
 'lookouts',
 'solaces',
 'Paint',
 'lurches',
 'crucified',
 'aboriginalness',
 'trover',
 'Tekel',
 'shyness',
 '73',
 'exhort',
 'Astronomy',
 'samphire',
 'hallo',
 'albatrosses',
 'sympathetical',
 'spice',
 'predicament',
 'miner',
 'adjust',
 'punctiliously',
 'hymns',
 'coyings',
 'marge',
 'irregularity',
 ...]
  • Q4.text2에 대하여 위의 과정을 통하여 eda를 해보자.

3.2 Fine-grained Selection of Words

In [19]:
V = set(text1)
long_words = [w for w in V if len(w) > 15]
sorted(long_words)
Out[19]:
['CIRCUMNAVIGATION',
 'Physiognomically',
 'apprehensiveness',
 'cannibalistically',
 'characteristically',
 'circumnavigating',
 'circumnavigation',
 'circumnavigations',
 'comprehensiveness',
 'hermaphroditical',
 'indiscriminately',
 'indispensableness',
 'irresistibleness',
 'physiognomically',
 'preternaturalness',
 'responsibilities',
 'simultaneousness',
 'subterraneousness',
 'supernaturalness',
 'superstitiousness',
 'uncomfortableness',
 'uncompromisedness',
 'undiscriminating',
 'uninterpenetratingly']
  • 의미있는 단어를 추출하기 위해 글의 길이가 너무 짧지 않고(관사) 빈도도 너무 적지 않은 단어(이상한 전문용어)를 뽑는다.
In [20]:
fdist5 = FreqDist(text5)
sorted(w for w in set(text5) if len(w) > 7 and fdist5[w] > 7)
Out[20]:
['#14-19teens',
 '#talkcity_adults',
 '((((((((((',
 '........',
 'Question',
 'actually',
 'anything',
 'computer',
 'cute.-ass',
 'everyone',
 'football',
 'innocent',
 'listening',
 'remember',
 'seriously',
 'something',
 'together',
 'tomorrow',
 'watching']

3.3 collocations and bigrams

  • 빈도가 높은 단어를 파악한 후 그러면 이 단어가 어떤 맥락에서 나왔는지 궁금할 때 가장 쉽게 접근할 수 있는 방법.
  • collocation - 연속적으로 같이 출현한 빈도가 높은 단어 (연어)
  • bigram - 단어의 연속을 두 단어씩 pairing함.
In [22]:
list(nltk.bigrams(['more', 'is', 'said', 'than', 'done']))
Out[22]:
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
  • tokenized된 데이터에서 collocations을 시각적으로 보고플때
In [23]:
text4.collocations()
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties

3.4 Counting other Things

  • 단어의 길이를 count할 수 있다.
In [24]:
[len(w) for w in text1]
fdist = FreqDist(len(w) for w in text1)
print(fdist)
fdist #단어의 길이:출현빈도
<FreqDist with 19 samples and 260819 outcomes>
Out[24]:
FreqDist({1: 47933,
          2: 38513,
          3: 50223,
          4: 42345,
          5: 26597,
          6: 17111,
          7: 14399,
          8: 9966,
          9: 6428,
          10: 3528,
          11: 1873,
          12: 1053,
          13: 567,
          14: 177,
          15: 70,
          16: 22,
          17: 12,
          18: 1,
          20: 1})
In [25]:
fdist.most_common()
Out[25]:
[(3, 50223),
 (1, 47933),
 (4, 42345),
 (2, 38513),
 (5, 26597),
 (6, 17111),
 (7, 14399),
 (8, 9966),
 (9, 6428),
 (10, 3528),
 (11, 1873),
 (12, 1053),
 (13, 567),
 (14, 177),
 (15, 70),
 (16, 22),
 (17, 12),
 (18, 1),
 (20, 1)]
In [26]:
fdist.max()
Out[26]:
3
In [27]:
fdist[3]
Out[27]:
50223
In [28]:
fdist.freq(3)#단어의 길이가 3인 아이들은 전체 20%를 차지한다.
#nltk에서 제공하는 도수분포와 관련된 함수는 table3.1에 정리되어있다.
Out[28]:
0.19255882431878046
  • Q4.위 분포를 ploting해보고 어떻게 사용할지 생각해보자.

5. Automatic Natural Language Understanding.

  • 최근 언어학습 기술들에 대하여 서술함... 연구 방법의 insight를 얻는데 도움이 될듯...읽어보세요


Archives

05-15 10:04

Contact Us

Address
경기도 수원시 영통구 원천동 산5번지 아주대학교 다산관 429호

E-mail
textminings@gmail.com

Phone
031-219-2910

Tags

Calendar

«   2024/05   »
1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31
Copyright © All Rights Reserved
Designed by CMSFactory.NET