Part I

I wrote this notebook as part of learning Natural Language Processing in Python. Its a work in progress.

#Lets import nltk library which I will use throughout this notebook for processing text
import nltk
nltk.data.path.append("/media/newhd/PyScripts/nltk")
#Importing corpus of famous books
from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
#Lets load the text file containing the complete novel, I am using the text file here not the file from corpus
f=open('SenseAndSensibility.txt','r')
novel=f.read()
#Lets check the type of text
type(novel)
str
#Lets check type of same book(Sense and Sensibility) stored in corpus as name 'text2'
type(text2)
nltk.text.Text
#Lets check first 100 words from novel
novel[0:99]

#Too few words, it seems that its treating every character separately
'SENSE AND SENSIBILITY\n\nby Jane Austen\n\n(1811)\n\n\n\n\nCHAPTER 1\n\n\nThe family of Dashwood had long been '

Tokenization

Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation.

#Lets split the text by space & End of Line character & check first 100 words again
import re
novel=re.split(r'[ \t\n]+', novel)
novel[0:50]

#Looks better now
['SENSE',
 'AND',
 'SENSIBILITY',
 'by',
 'Jane',
 'Austen',
 '(1811)',
 'CHAPTER',
 '1',
 'The',
 'family',
 'of',
 'Dashwood',
 'had',
 'long',
 'been',
 'settled',
 'in',
 'Sussex.',
 'Their',
 'estate',
 'was',
 'large,',
 'and',
 'their',
 'residence',
 'was',
 'at',
 'Norland',
 'Park,',
 'in',
 'the',
 'centre',
 'of',
 'their',
 'property,',
 'where,',
 'for',
 'many',
 'generations,',
 'they',
 'had',
 'lived',
 'in',
 'so',
 'respectable',
 'a',
 'manner',
 'as',
 'to']
type(novel)
#Note that after applying split function our novel is now a list of all words in book
list
#Counting total words in book
len(novel)
118566
#Lets sort these words alphabetically & check few of them
sorted(set(novel))[:50]


#Looks good except few special characters with some words
['',
 '"\'Tis',
 '"\'Twill',
 '"A',
 '"About',
 '"Add',
 '"Ah!',
 '"Ah!"',
 '"Ah!--no,--have',
 '"Ah,',
 '"All',
 '"Almost',
 '"And',
 '"Another',
 '"Are',
 '"As',
 '"At',
 '"Ay,',
 '"Aye,',
 '"Bartlett\'s',
 '"Beautifully',
 '"Because,"',
 '"Being',
 '"Bond',
 '"Brandon',
 '"But',
 '"But,',
 '"By',
 '"Can',
 '"Certainly',
 '"Certainly,',
 '"Certainly,"',
 '"Certainly--and',
 '"Certainly."',
 '"Choice!--how',
 '"Civil!--Did',
 '"Cleveland!"--she',
 '"Colonel',
 '"Come',
 '"Come,',
 '"Concealing',
 '"Could',
 '"DEAR',
 '"Dear',
 '"Dear,',
 '"Dearest',
 '"Depend',
 '"Devonshire!',
 '"Did',
 '"Disappointment?"']
#Converting all words in same case before checking books vocabulary
tokens = []
for word in novel:
    tokens.append(word.lower())
#Checking Vocabulary of book
vocab=set(tokens)
len(vocab)

#Book contains roughly 12880 distinct words
12880
#Lets verify few of these words
tokens[10:60]
['family',
 'of',
 'dashwood',
 'had',
 'long',
 'been',
 'settled',
 'in',
 'sussex.',
 'their',
 'estate',
 'was',
 'large,',
 'and',
 'their',
 'residence',
 'was',
 'at',
 'norland',
 'park,',
 'in',
 'the',
 'centre',
 'of',
 'their',
 'property,',
 'where,',
 'for',
 'many',
 'generations,',
 'they',
 'had',
 'lived',
 'in',
 'so',
 'respectable',
 'a',
 'manner',
 'as',
 'to',
 'engage',
 'the',
 'general',
 'good',
 'opinion',
 'of',
 'their',
 'surrounding',
 'acquaintance.',
 'the']
#Finding lexical richness of text
len(set(tokens))/len(tokens)

#So total 10.8% words are distinct out of total number of words
0.1086314795135199

Finding Word Frequencies

As per wikipedia word “the” is most frequent in most textbooks, lets verify this using Sense and Sensibility.

#Checking word lengths
word_lengths=[]
for word in vocab:
    word_lengths.append(len(word))

word_lengths.sort(reverse=True)
word_lengths[0:24]

#Words as long as 40 & 26!, lets find out these words
[40,
 26,
 26,
 23,
 23,
 23,
 23,
 22,
 22,
 22,
 21,
 21,
 21,
 20,
 20,
 20,
 20,
 20,
 20,
 19,
 19,
 19,
 19,
 19]
#Lets find the longest words in book
for word in vocab:
    if(len(word)>20):
        print(word)

#hmm, so thsese are not single words, not sure why these words appear like this in text
willoughby.--remember
confiding--everything
thunderbolt.--thunderbolts
somersetshire.--there,
preparation!--day!--in
drawing-room.--nobody
foolish--business--this
distinguished--as--they
letter-writing?--delicate--tender--truly
elinor?"--hesitatingly
everything;--marianne's
embarrassment.--whether
delaford.--delaford,--that
#Lets find frequency of each word in book using NLTK inbuilt functions,
#but for this lets first covert tokens to nltk text format
text = nltk.Text(tokens)
type(text)

#got new format of type nltk.text.Text
nltk.text.Text
#Checking Frequency Distributions
fd = FreqDist(text)
fd.most_common(50) #Checking 50 most frequently occurring words

#Most frequent words are stop words here, lets remove these & check again
[('the', 4071),
 ('to', 4049),
 ('of', 3538),
 ('and', 3255),
 ('her', 2230),
 ('a', 2030),
 ('in', 1896),
 ('was', 1783),
 ('i', 1672),
 ('she', 1501),
 ('it', 1273),
 ('that', 1235),
 ('be', 1231),
 ('as', 1194),
 ('for', 1187),
 ('not', 1186),
 ('his', 1002),
 ('he', 998),
 ('had', 983),
 ('with', 968),
 ('you', 882),
 ('at', 813),
 ('have', 801),
 ('by', 733),
 ('but', 708),
 ('is', 675),
 ('on', 663),
 ('my', 587),
 ('so', 579),
 ('could', 554),
 ('which', 552),
 ('all', 545),
 ('from', 530),
 ('mrs.', 520),
 ('would', 506),
 ('their', 496),
 ('they', 496),
 ('very', 491),
 ('no', 482),
 ('him', 431),
 ('been', 426),
 ('were', 424),
 ('what', 392),
 ('any', 385),
 ('every', 373),
 ('this', 362),
 ('than', 360),
 ('your', 358),
 ('more', 356),
 ('elinor', 349)]
#Removing stop words
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
filtered_text = [w for w in tokens if not w in stop_words]
#Again checking 50 most frequent words
filtered_text = nltk.Text(filtered_text)
new_fd = FreqDist(filtered_text)
new_fd.most_common(50)
[('could', 554),
 ('mrs.', 520),
 ('would', 506),
 ('every', 373),
 ('elinor', 349),
 ('said', 342),
 ('must', 279),
 ('one', 262),
 ('much', 258),
 ('marianne', 252),
 ('"i', 228),
 ('might', 209),
 ('though', 206),
 ('miss', 206),
 ('elinor,', 198),
 ('think', 189),
 ('never', 184),
 ('know', 172),
 ('without', 171),
 ('mr.', 168),
 ('may', 167),
 ('marianne,', 165),
 ('soon', 163),
 ('see', 159),
 ('time', 155),
 ('nothing', 154),
 ('her,', 153),
 ('thing', 142),
 ('first', 141),
 ('colonel', 141),
 ('it,', 140),
 ('great', 139),
 ('make', 136),
 ('little', 136),
 ('edward', 134),
 ('made', 133),
 ('two', 131),
 ('ever', 129),
 ('it.', 126),
 ('dashwood', 123),
 ('good', 123),
 ('lady', 122),
 ('well', 121),
 ('always', 121),
 ('give', 120),
 ('shall', 119),
 ('even', 117),
 ('mother', 115),
 ('sir', 112),
 ('say', 111)]
#Lets check distribution of some of these words in book using Dispersion Plot
text.dispersion_plot(["mrs.","elinor", "marianne", "lady", "mother"])

#Here Elinor & Marianne are 2 main protagonists of novel

png

#Plotting word count of 50 most frequent words
new_fd.plot(50)

png

#Lets check the context in which some of these frequent words appear
filtered_text.concordance("elinor")
Displaying 25 of 349 matches:
esemblance mother strikingly great. elinor saw, concern, excess sister's sensi
h appeared amiable, loved daughter, elinor returned partiality. contrary every
observe approve farther, reflection elinor chanced one day make difference sis
prehended merits; persuasion regard elinor perhaps assisted penetration; reall
!--but must allow difference taste. elinor feelings, therefore may overlook it
ing, said subject; kind approbation elinor described excited drawings people,
hall see imperfection face, heart." elinor started declaration, sorry warmth b
se words again, leave room moment." elinor could help laughing. "excuse me," s
draw himself, delightful would be!" elinor given real opinion sister. could co
izement. knowledge this, impossible elinor feel easy subject. far depending re
nce expense sudden removal, beloved elinor exposed another week insinuations.
ght secure approbation answer sent. elinor always thought would prudent settle
principally tended. separate edward elinor far object ever; wished show mrs. j
d wishes, would kept it; discretion elinor prevailed. wisdom limited number se
 sister's sake, turned eyes towards elinor see bore attacks, earnestness gave
 see bore attacks, earnestness gave elinor far pain could arise common-place r
g invitation, talked coming barton. elinor expect already?" "i never mentioned
ther! edward's farewell distinction elinor me: good wishes affectionate brothe
wind, pitied fears prevented mother elinor sharing delightful sensations. "is
ted hold till seated chair parlour. elinor mother rose amazement entrance, eye
's estimation faultless marianne's; elinor saw nothing censure propensity, str
y propriety, displayed want caution elinor could approve, spite marianne could
idicule justly annexed sensibility. elinor obliged, though unwillingly, believ
mself, pointed assurance affection. elinor could surprised attachment. wished
discourse. already repeated history elinor three four times; elinor's memory e
filtered_text.concordance("think")
Displaying 25 of 189 matches:
mpoverishing dreadful degree. begged think subject. could answer rob child, chi
 occasions, much little. one, least, think done enough them: even themselves, h
knowing may expect," said lady, "but think expectations: question is, afford do
tion is, afford do." "certainly--and think may afford give five hundred pounds
ependence." "undoubtedly; thanks it. think secure, expected, raises gratitude a
d half it; giving more, quite absurd think it. much able give something." "upon
verything amiable. love already." "i think like him," said elinor, "when know h
taste drawing!" replied elinor, "why think so? draw himself, indeed, great plea
ies improving it. ever way learning, think would drawn well. distrusts judgment
der deficient general taste. indeed, think may say cannot, behaviour perfectly
ighest opinion world goodness sense. think every thing worthy amiable." "i sure
ance, perceived. present, know well, think really handsome; least, almost so. s
o. say you, marianne?" "i shall soon think handsome, elinor, now. tell love bro
r. "i attempt deny," said she, "that think highly him--that greatly esteem, lik
ion conduct opinions, never disposed think amiable; much mistaken edward aware
ttage, assured everything done might think necessary, situation pleased her. ea
g, plenty money, dare say shall, may think building. parlors small parties frie
n old bachelor. mrs. dashwood, could think man five years younger herself, exce
eny absurdity accusation, though may think intentionally ill-natured. colonel b
ce happen woman single seven twenty, think colonel brandon's thirty-five object
ntioned her, course must." "i rather think mistaken, talking yesterday getting
n, "i see be. setting cap now, never think poor brandon." "that expression, sir
or, soon left them, "for one morning think done pretty well. already ascertaine
dy remembers talk to." "that exactly think him," cried marianne. "do boast it,
make amends left behind, could teach think norland less regret ever. neither la

Part of Speech Tagging

POS Tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context.

#Part of Speech tagging
#Lets try POS on this famous line from Sense and Sensibility
sentence = "If I could but know his heart, everything would become easy"
sentence_tokens = nltk.word_tokenize(sentence)
sentence_tokens
['If',
 'I',
 'could',
 'but',
 'know',
 'his',
 'heart',
 ',',
 'everything',
 'would',
 'become',
 'easy']
#Part of Speech Tagging
tagged = nltk.pos_tag(sentence_tokens)
tagged
[('If', 'IN'),
 ('I', 'PRP'),
 ('could', 'MD'),
 ('but', 'CC'),
 ('know', 'VB'),
 ('his', 'PRP$'),
 ('heart', 'NN'),
 (',', ','),
 ('everything', 'NN'),
 ('would', 'MD'),
 ('become', 'VB'),
 ('easy', 'JJ')]
#Drawing POS Tree
grammar = "NP: {<DT>?<JJ>*<NN>}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(tagged)
print(result)
(S
  If/IN
  I/PRP
  could/MD
  but/CC
  know/VB
  his/PRP$
  (NP heart/NN)
  ,/,
  (NP everything/NN)
  would/MD
  become/VB
  easy/JJ)

Finding N-Grams

An n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application.

#Finding frequent bigrams in text
text.collocations()
mrs. jennings; lady middleton; mrs. jennings,; colonel brandon; sir
john; mrs. dashwood; every thing; mrs. jennings's; colonel brandon,;
mrs. ferrars; said she,; colonel brandon's; sir john,; said elinor,;
said he,; great deal; mrs. dashwood,; john dashwood; replied elinor,;
every body
#Another way of finding all bigrams
from nltk.util import ngrams
bigrams=ngrams(tokens,2)
print(list(bigrams)[0:5])
[('sense', 'and'), ('and', 'sensibility'), ('sensibility', 'by'), ('by', 'jane'), ('jane', 'austen')]
#And trigrams
trigrams=ngrams(tokens,3)
print(list(trigrams)[0:5])
[('sense', 'and', 'sensibility'), ('and', 'sensibility', 'by'), ('sensibility', 'by', 'jane'), ('by', 'jane', 'austen'), ('jane', 'austen', '(1811)')]

Stemming & Lemmatization

For grammatical reasons, documents use different forms of a word, such as organize, organizes, and organizing. The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

#Stemming Manually
#Defining suffixes to be stemmed
stemmer = nltk.RegexpStemmer('ing$|es$')
for token in tokens[0:1000]:
    stemmed_token=stemmer.stem(token)
    if(token!=stemmed_token):
        print(token,stemmed_token)

#surrounding to surround, inheriting to inherit & remaining to remain seems fair but still seems like a naive approach.
surrounding surround
coming com
inheriting inherit
remaining remain
providing provid
having hav
cunning cunn
living liv
including includ
thing th
fortunes fortun
besides besid
remaining remain
#Lets try using nltk stemming function
from nltk.stem import PorterStemmer

ps = PorterStemmer()
for token in tokens[0:100]:
    stemmed_token = ps.stem(token)
    if(token!=stemmed_token):
        print(token,stemmed_token)
sense sens
sensibility sensibl
family famili
settled settl
estate estat
was wa
residence resid
was wa
centre centr
many mani
lived live
respectable respect
engage engag
general gener
surrounding surround
this thi
estate estat
was wa
single singl
lived live
very veri
advanced advanc
many mani
years year
his hi
housekeeper housekeep
his hi
happened happen
years year
wnl = nltk.WordNetLemmatizer()
[wnl.lemmatize(t) for t in tokens[0:100]]
['sense',
 'and',
 'sensibility',
 'by',
 'jane',
 'austen',
 '(1811)',
 'chapter',
 '1',
 'the',
 'family',
 'of',
 'dashwood',
 'had',
 'long',
 'been',
 'settled',
 'in',
 'sussex.',
 'their',
 'estate',
 'wa',
 'large,',
 'and',
 'their',
 'residence',
 'wa',
 'at',
 'norland',
 'park,',
 'in',
 'the',
 'centre',
 'of',
 'their',
 'property,',
 'where,',
 'for',
 'many',
 'generations,',
 'they',
 'had',
 'lived',
 'in',
 'so',
 'respectable',
 'a',
 'manner',
 'a',
 'to',
 'engage',
 'the',
 'general',
 'good',
 'opinion',
 'of',
 'their',
 'surrounding',
 'acquaintance.',
 'the',
 'late',
 'owner',
 'of',
 'this',
 'estate',
 'wa',
 'a',
 'single',
 'man,',
 'who',
 'lived',
 'to',
 'a',
 'very',
 'advanced',
 'age,',
 'and',
 'who',
 'for',
 'many',
 'year',
 'of',
 'his',
 'life,',
 'had',
 'a',
 'constant',
 'companion',
 'and',
 'housekeeper',
 'in',
 'his',
 'sister.',
 'but',
 'her',
 'death,',
 'which',
 'happened',
 'ten',
 'year']

Sentences & Words

How many words can a sentence have? Some sources say that sentences longer than 25 words aren’t accessible.

#Lets try to find & separate all sentences in novel
from nltk import sent_tokenize
f=open('SenseAndSensibility.txt','r')
book=f.read()
sentence_list=sent_tokenize(book)

#printing first few sentences from book
for sentence in sentence_list[5:30]:
    print(sentence,"\n\n")
His
attachment to them all increased.


The constant attention of Mr. and
Mrs. Henry Dashwood to his wishes, which proceeded not merely from
interest, but from goodness of heart, gave him every degree of solid
comfort which his age could receive; and the cheerfulness of the
children added a relish to his existence.


By a former marriage, Mr. Henry Dashwood had one son: by his present
lady, three daughters.


The son, a steady respectable young man, was
amply provided for by the fortune of his mother, which had been large,
and half of which devolved on him on his coming of age.


By his own
marriage, likewise, which happened soon afterwards, he added to his
wealth.


To him therefore the succession to the Norland estate was not
so really important as to his sisters; for their fortune, independent
of what might arise to them from their father's inheriting that
property, could be but small.


Their mother had nothing, and their
father only seven thousand pounds in his own disposal; for the
remaining moiety of his first wife's fortune was also secured to her
child, and he had only a life-interest in it.


The old gentleman died: his will was read, and like almost every other
will, gave as much disappointment as pleasure.


He was neither so
unjust, nor so ungrateful, as to leave his estate from his nephew;--but
he left it to him on such terms as destroyed half the value of the
bequest.


Mr. Dashwood had wished for it more for the sake of his wife
and daughters than for himself or his son;--but to his son, and his
son's son, a child of four years old, it was secured, in such a way, as
to leave to himself no power of providing for those who were most dear
to him, and who most needed a provision by any charge on the estate, or
by any sale of its valuable woods.


The whole was tied up for the
benefit of this child, who, in occasional visits with his father and
mother at Norland, had so far gained on the affections of his uncle, by
such attractions as are by no means unusual in children of two or three
years old; an imperfect articulation, an earnest desire of having his
own way, many cunning tricks, and a great deal of noise, as to outweigh
all the value of all the attention which, for years, he had received
from his niece and her daughters.


He meant not to be unkind, however,
and, as a mark of his affection for the three girls, he left them a
thousand pounds a-piece.


Mr. Dashwood's disappointment was, at first, severe; but his temper was
cheerful and sanguine; and he might reasonably hope to live many years,
and by living economically, lay by a considerable sum from the produce
of an estate already large, and capable of almost immediate
improvement.


But the fortune, which had been so tardy in coming, was
his only one twelvemonth.


He survived his uncle no longer; and ten
thousand pounds, including the late legacies, was all that remained for
his widow and daughters.


His son was sent for as soon as his danger was known, and to him Mr.
Dashwood recommended, with all the strength and urgency which illness
could command, the interest of his mother-in-law and sisters.


Mr. John Dashwood had not the strong feelings of the rest of the
family; but he was affected by a recommendation of such a nature at
such a time, and he promised to do every thing in his power to make
them comfortable.


His father was rendered easy by such an assurance,
and Mr. John Dashwood had then leisure to consider how much there might
prudently be in his power to do for them.


He was not an ill-disposed young man, unless to be rather cold hearted
and rather selfish is to be ill-disposed: but he was, in general, well
respected; for he conducted himself with propriety in the discharge of
his ordinary duties.


Had he married a more amiable woman, he might
have been made still more respectable than he was:--he might even have
been made amiable himself; for he was very young when he married, and
very fond of his wife.


But Mrs. John Dashwood was a strong caricature
of himself;--more narrow-minded and selfish.


When he gave his promise to his father, he meditated within himself to
increase the fortunes of his sisters by the present of a thousand
pounds a-piece.


He then really thought himself equal to it.


The
prospect of four thousand a-year, in addition to his present income,
besides the remaining half of his own mother's fortune, warmed his
heart, and made him feel capable of generosity.-- "Yes, he would give
them three thousand pounds: it would be liberal and handsome!


It would
be enough to make them completely easy.
#Total sentences found
len(sentence_list)
4835
#Sentence to Word ratio
len(sentence_list)/len(tokens)
0.040778975422971174
#Distribution of words in each sentence
import numpy as np

word_distribution=[]
for sentence in sentence_list:
    temp_tokens=nltk.word_tokenize(sentence)
    total_words=len(temp_tokens)
    word_distribution.append(total_words)

print(np.mean(word_distribution))

#Each sentence have roughly 29 words on average
29.23453981385729
#Lets plot this distribution
import seaborn as sns
sns.set(color_codes=True)
from pylab import *

sns.distplot(word_distribution)
show()

png

Building a classifier

Can we build classifier to predict gender of 4 main protagonists of the novel? I will discuss this in Part 2 of this notebook.