PMs Visualized

As I mentioned in the last post thanks to Tableau Blog Finder, I have discovered several data blogs and I am learning new tricks. After seeing this post on DataRemixed which visualizes history of US presidents and given that next parliamentary elections of India are just around the corner, I was inspired to create a similar visualization for prime ministers of India. I really fumbled with few points but could get it done and here is the result of my attempts. Click on the image to view the interactive dashboard:

PMs_of_India

Some observations:

  • First thing I observed was that compared to the original dashboard on DataRemixed and a few similar ones, I was visualizing a much shorter period of time – 66 years. It was a very eventful time period though.
  • Uttar Pradesh has given the country 8 prime ministers.
  • 3 prime ministers died while in the office.
  • Only 2 ex-prime ministers are alive as of today.
  • Other than two major political parties – INC and BJP, the longest rule was under the prime minister Morarji Desai of Janta Party. It was also the first non-Congress government of India.

* Gulzarilal Nanda served as an acting prime minister of the country twice – after death of Jawaharlal Nehru and Lal Bahadur Shashtri.

Text mining, Tableau and Jane Austen

“Jane Austen’s corpus is modest in number but magnificent in achievement.”

I read this sentence somewhere and given that I have been exploring text mining these days, I got thinking about exploring Jane Austen’s words which are certainly magnificent in achievement.

First things first. Get data. I decided to concentrate on her six novels – Sense and Sensibility, Pride and Prejudice, Emma, Mansfield Park, Northanger Abbey and Persuasion – and obtained their texts from Project Gutenberg website.

After doing some manual clean up in Notepad++, further work was done using tm package in R. The corpus was processed in a standard way – words were converted to lower case and numbers and punctuation were removed.

  • How many words in each book? I considered every word including articles to find this number.
  • How many unique words in each book? The words were stemmed and then counted to get this number.
  • Which words are frequently used? To get answer to this question, first of all, stop words of the language were removed from the corpus. I saw that titles like ‘Miss’, ‘Mrs’ and ‘Mr’ also had very high frequency, so I decided to remove them, too. All the words with length more than two were used to build the term document matrix which was further used to count word frequencies.
  • Austen heroines topped the frequency list. Not much surprised here. So, the word or name, ‘Catherine’ has highest frequency in the ‘Northanger Abbey’ and ‘Emma’ is the most frequent word in well ‘Emma’ and so on…. So, the next question appeared. How often is a heroine mentioned? I considered given names of each heroine. They would also be referred to as ‘Miss Morland’ or ‘Miss Woodhouse’, but I didn’t count those. I observed interesting things when I found answers to this question.
    • Fanny is the heroine whose name appears the most number of times in her book – ‘Mansfield Park’, 816 times; followed by Emma whose name appears 786 times in ‘Emma’. (I was surprised learning this. I had counted on Emma to top this list)
    • As the novels have different word lengths, this wasn’t a fair comparison. So, I adjusted frequencies by fixing number of words in a novel to be at 100,000 words. Now, Catherine of ‘Northanger Abbey’ which is a much shorter novel at 77,249 words tops the list. If the book had 100,000 words, her name would have appeared 554 times. If ‘Mansfield Park’ had 100,000 words, Fanny’s name would have appeared 511 times.
    • So, how about Austen heroes then? How often do they appear in text? This question was tricky to answer, because some of them are referred as ‘Mr Knightely’ or ‘Mr Darcy’ and some of them are referred more by their given name like ‘Edmund’ or ‘Edward’. Later two are also referred to as ‘Mr Ferrars’ and ‘Mr Bertram’ respectively. So, here I decided to depend on my knowledge of the books. For Edmund and Edward, frequency of their given names was used; while for Mr Darcy, Mr Knightley, Captain Wentworth and Colonel Brandon, frequency of their last names ‘darcy’, ‘knightley’, ‘wentworth’ and ‘brandon’ was used. Now, ‘darcy’ could be part of ‘Mr Darcy’ or ‘Miss Darcy’ (Remember that corpus was processed and titles were removed), so frequency of ‘darcy’ does not give exact number of times ‘Mr Darcy’ was mentioned, yet it gives  a fair idea.

All of these numbers and findings are visualized using a Tableau Public dashboard. Click to view the interactive dashboard.

Jane_Austen_Dashboard

By the way, welcome!