As the new Indian government presented its first ever Rail Budget in the parliament on 8th July, 2014, I got thinking about railway accidents the last one being as recent as on 25th June, 2014 and it resulted in death of 4 people and several were injured. So, here is a visualization that shows all the rail accidents since 1891. There is no clear trend in terms of number of accidents or number of casualties over time. 1990s is the worst decade in the recorded history in terms of deaths and injuries that happened in railway accidents. Numbers for current decade are alarming in the sense that within first 5 years, 61 accidents have occurred which is the highest number for a decade and these accidents have already resulted in 633+ deaths and 1114+ injuries. Accurate numbers of deaths/injuries not easily available, but these numbers are bound to be higher. These numbers are a clear cut indication that the government needs to take strong steps to curb such accidents.
Data source used is the list of rail incidents on Wikipedia: http://en.wikipedia.org/wiki/List_of_Indian_rail_incidents
Only accidents that occurred on Indian soil are considered. Incidents that occurred at the time of partition are not considered as they are different from a typical rail accident.
The list was cleaned first using OpenRefine and then several fields were extracted manually. Accident location coordinates were acquired using ggmap in R. This visualization uses the Storypoints feature of Tableau 8.2.
Facebook held 2nd bi-annual Viz Cup on 20th of last month. Being almost 8000 miles away from the place where it was being held, it was not possible to take part in it. But, as I was curious about it and was looking for some interesting data sets to explore, I checked out the data sets that were available for use in the competition. Being someone who wanted to become an astronaut for the longest time, out of all the data sets available the data set of meteorites caught my attention and that is what has been visualized here.
As I tried to understand what each feature meant by googling, I came across a very beautiful visualization: http://bolid.es/. The visualization that I have created draws inspiration from this one. Plus, it adds some more to it. I have also wanted to use features provided by ‘Pages Shelf’ in Tableau for a while and finally used it here.
So, here is the visualization of all the recorded meteorites.
What you need to know:
- A meteorite can be either an observed ‘Fall’ – someone observed it falling from the sky and the found it or it can be just ‘Found’.
- A meteorite is mainly classified in three classes: Stony, Iron, Stony-Iron. More sophisticated classifications are available, but these top level classes are used here.
What are some of the observations?
- Meteorite falls are rare compared to finds.
- Meteorites have virtually equal probability of falling anywhere on Earth, but they have high probability of being observed in high population density areas. This can be observed in the map, as ‘Falls’ are located in India, Europe, East coast of USA, China, etc. I am really curious about this phenomenon, as high population density areas also have high level of light pollution and still meteors are observed.
- In the case of ‘Finds’, they have high probability of being found in areas where there is less human activity as well as certain other favourable circumstances (See Wikipedia article for details: http://en.wikipedia.org/wiki/Meteorite#Finds) We can observe this phenomenon in the map, as many ‘Finds’ are located in Antarctica, Sahara desert, Western Australia, etc.
- In the earlier part of the recorded history (before 1900), falls were high compared to finds. Systematic expeditions that started in the last quarter of the last century resulted in really high number of meteorites being “found”.
Alright, get ready to observe an on-demand meteor shower. Hover over the meteor in the visualization and let the falling star guide your way (First click on the image to open the viz).
This visualization uses pages shelf and its playback controls which do not work when the workbook is hosted online. Download the workbook and view it locally or using free Tableau Reader to watch the stars fall.
Did you switch off lights for Earth Hour this year? I did and so did 162 countries and territories around the world as per official reports.
During the days approaching 2014 Earth Hour, I wanted to create a visualization that showed how Earth Hour has spread starting from Sydney, Australia in 2007, but couldn’t find a data set of all the countries/cities that participated in the event in a given year. And, then, yesterday, I found this page: http://www.earthhour.org/earth-hour-around-world which did provide information about when a country started participating in the event. There were some discrepancies in the information provided and whenever that happened I took help of Wikipedia and of course, Google and using thus created data set, I have created the following visualization. Keep in mind that the data set is not complete nor it is 100% accurate. When it says that a country started participating in the event in a particular year, it could mean official participation or participation by individuals or by organizations. Even with these limitations, we can get an idea of how the event had spread worldwide; 2009 being the tipping point year.
(Click on the image to view the visualization.)
Fruits are good for health. No, that is not the reason I decided to explore status of fruit production. Just like that, I got wondering about the world’s favourite fruits. I thought it would be either banana or apple or orange. But, I was wrong. It is watermelon. That’s what fruit production data from ‘Food and Agriculture Organization of the United Nations’ says. I downloaded data of available last 21 years (1992-2012) and watermelon production tops the list at total production of 1,597,666,215 tonnes. Banana comes second at 1,582,096,157 tonnes. Of course, regional favourites would be different.
So, if you like me are interested in knowing how fruit production changed over years, here is a dashboard. If watermelon production tops the chart in 2012, grapes had the honours in 1992 which again I found surprising. The world saw a great rise in fruit production: from 472,120,415 tonnes in 1992 to 822,028,049 tonnes in 2012 (74.11% rise). China and India are top fruit producers, but China is way ahead of everyone else. In 2012, China’s share in total fruit production was 26.93%. India was second with 10.10% of total production. If you want to know more about fruit production in last 21 years, click on the image to open an interactive dashboard.
- As said before, data is from year 1992 to 2012.
- Only culinary fruits were considered (As for example, tomato is not a culinary fruit)
- Fruit images from openclipart.org
As I mentioned in the last post thanks to Tableau Blog Finder, I have discovered several data blogs and I am learning new tricks. After seeing this post on DataRemixed which visualizes history of US presidents and given that next parliamentary elections of India are just around the corner, I was inspired to create a similar visualization for prime ministers of India. I really fumbled with few points but could get it done and here is the result of my attempts. Click on the image to view the interactive dashboard:
- First thing I observed was that compared to the original dashboard on DataRemixed and a few similar ones, I was visualizing a much shorter period of time – 66 years. It was a very eventful time period though.
- Uttar Pradesh has given the country 8 prime ministers.
- 3 prime ministers died while in the office.
- Only 2 ex-prime ministers are alive as of today.
- Other than two major political parties – INC and BJP, the longest rule was under the prime minister Morarji Desai of Janta Party. It was also the first non-Congress government of India.
* Gulzarilal Nanda served as an acting prime minister of the country twice – after death of Jawaharlal Nehru and Lal Bahadur Shashtri.
This blog is mentioned in a visualization created by Tableau Digital Team. Tableau Public team called January ‘Data Blogging Month’ and asked bloggers to submit their blog details and the result is this visualization. I found it very helpful and I have already learned new tricks from fellow bloggers.
Few weeks ago I watched a news report on migratory birds being hunted and being considered a culinary delicacy. After watching the news, I searched for a database that kept track of migratory birds over years and I came across MigrantWatch. It is a wonderful citizen science project which asks volunteers to report migrant sightings in India. The project started in 2007 and has collected a wealth of data so far. Now, my knowledge of bird migration is limited and I explored the data from a novice’s point of view. Recently, I worked with latest IUCN Red List and so I thought of combining these two datasets and the result is the following visualization where I have tried to learn more about bird species with conservation status of ‘Threatened’ (which includes ‘Critical’, ‘Endangered’ and ‘Vulnerable’ species). The visualization helps to know about sightings of these threatened birds over last 6 migratory seasons (starting from 2007-2008 to 2012-2013). Current season is still on and hence not included (A migratory season is from July 1 to June 30).
Click on the image to access interactive dashboard:
Out of 264 migrant species sighted till the last migratory season, 15 species have threatened conservation status. Many of the world’s migratory species are in decline for various reasons. Especially threatened species will need to be closely watched. The dashboard shows when and where these species are sighted and it would be helpful to bird watchers who want to view these birds, as if efforts are not made for their conservation, we may not be able to see them in future.
A note about number of sightings in each migratory season: The numbers should not be interpreted as increase/decline in a bird species numbers. As MigrantWatch project has gained popularity, more and more bird watchers are contributing their sightings and hence there are more records. And, yet there are interesting observations. As for example, Sociable Plover (What a name! 🙂 ) is a bird which is found wintering in Gujarat and Rajasthan. In 2011-2012 season, there were only 2 sightings. Was it because there was less number of visitors that season or was it because no one was looking for them? Data over more seasons is required to get a reasonable answer.
- Bird sighting reports: http://www.migrantwatch.in/data.php
- IUCN Red List Status of Birds: http://www.birdlife.org/datazone/species/search
Most of the images are from Wikipedia; cropped and resized as needed.
Rural healthcare system in India is a three tier system.
- Sub-Centres (SCs): First contact point between the community and healthcare system
- Manned by at least 1 female and 1 male health worker
- Primary Health Centres (PHCs): First contact point between the rural community and a medical officer
- Manned by a medical officer and a paramedical and other staff of 14 that consists of nursing staff, pharmacists, health workers, laboratory technicians and others.
- Community Health Centres (CHCs): A referral centre for 4 PHCs.
- Manned by 4 medical specialists (Surgeon, Gynaecologist, Paediatrician, Physician) and supported by 21 paramedical and other staff consisting of nursing staff, pharmacists, radiographers, laboratory technicians and others.
As per 2011 census of India, rural population number is 83,34,63,448 which is 68.85% of total population of 1,21,05,69,573. This population is served by the rural healthcare system and is affected by the shortfall of trained healthcare providers in the system. Following dashboard allows understanding the situation in various states and union territories.
- Bihar, Odisha and Chhatisgarh are mainly rural states and they face severe manpower shortage (Check out the dashboard tab: How rural is my state?)
- There is a heavy shortage of specialists who work at CHCs. Most of the states/UTs have more than 50% of shortfall.
- There is also a severe shortage of male and female health assistants who work at SCs and PHCs. In 4 UTs and in 3 states, there is 100% shortfall of male health assistants. 2 UTs and 5 states have more than 80% of positions for female health assistants unfilled. It should be noted here that in several states/UTs, there is requirement of staff, but not a single position is sanctioned. Why is it so? Is it bureaucracy? Is it because of lack of funds? Needs to be explored further.
Explore the dashboard yourself to find more interesting observations. Click the image to open the Tableau dashboard.
Recently, I came across this post on Guardian Data Blog: www.theguardian.com/news/datablog/2013/nov/26/iucn-red-list-threatened-species-by-country-statistics
And, I decided to explore the latest updates in the IUCN red list (http://www.iucnredlist.org/). The list is updated at least once a year. I concentrated on the threat category status changes. As per the information given with data (http://www.theguardian.com/news/datablog/2013/nov/26/iucn-red-list-threatened-species-by-country-statistics#data) status changes happen because of genuine reasons (genuine improvement or deterioration in a species status) or non-genuine reasons (status changes due to new information, improved knowledge of the criteria, taxonomic revision, incorrect data used previously, etc.)
Following dashboard explores these changes across various species types. Click to view the interactive dashboard.
“Jane Austen’s corpus is modest in number but magnificent in achievement.”
I read this sentence somewhere and given that I have been exploring text mining these days, I got thinking about exploring Jane Austen’s words which are certainly magnificent in achievement.
First things first. Get data. I decided to concentrate on her six novels – Sense and Sensibility, Pride and Prejudice, Emma, Mansfield Park, Northanger Abbey and Persuasion – and obtained their texts from Project Gutenberg website.
After doing some manual clean up in Notepad++, further work was done using tm package in R. The corpus was processed in a standard way – words were converted to lower case and numbers and punctuation were removed.
- How many words in each book? I considered every word including articles to find this number.
- How many unique words in each book? The words were stemmed and then counted to get this number.
- Which words are frequently used? To get answer to this question, first of all, stop words of the language were removed from the corpus. I saw that titles like ‘Miss’, ‘Mrs’ and ‘Mr’ also had very high frequency, so I decided to remove them, too. All the words with length more than two were used to build the term document matrix which was further used to count word frequencies.
- Austen heroines topped the frequency list. Not much surprised here. So, the word or name, ‘Catherine’ has highest frequency in the ‘Northanger Abbey’ and ‘Emma’ is the most frequent word in well ‘Emma’ and so on…. So, the next question appeared. How often is a heroine mentioned? I considered given names of each heroine. They would also be referred to as ‘Miss Morland’ or ‘Miss Woodhouse’, but I didn’t count those. I observed interesting things when I found answers to this question.
- Fanny is the heroine whose name appears the most number of times in her book – ‘Mansfield Park’, 816 times; followed by Emma whose name appears 786 times in ‘Emma’. (I was surprised learning this. I had counted on Emma to top this list)
- As the novels have different word lengths, this wasn’t a fair comparison. So, I adjusted frequencies by fixing number of words in a novel to be at 100,000 words. Now, Catherine of ‘Northanger Abbey’ which is a much shorter novel at 77,249 words tops the list. If the book had 100,000 words, her name would have appeared 554 times. If ‘Mansfield Park’ had 100,000 words, Fanny’s name would have appeared 511 times.
- So, how about Austen heroes then? How often do they appear in text? This question was tricky to answer, because some of them are referred as ‘Mr Knightely’ or ‘Mr Darcy’ and some of them are referred more by their given name like ‘Edmund’ or ‘Edward’. Later two are also referred to as ‘Mr Ferrars’ and ‘Mr Bertram’ respectively. So, here I decided to depend on my knowledge of the books. For Edmund and Edward, frequency of their given names was used; while for Mr Darcy, Mr Knightley, Captain Wentworth and Colonel Brandon, frequency of their last names ‘darcy’, ‘knightley’, ‘wentworth’ and ‘brandon’ was used. Now, ‘darcy’ could be part of ‘Mr Darcy’ or ‘Miss Darcy’ (Remember that corpus was processed and titles were removed), so frequency of ‘darcy’ does not give exact number of times ‘Mr Darcy’ was mentioned, yet it gives a fair idea.
All of these numbers and findings are visualized using a Tableau Public dashboard. Click to view the interactive dashboard.
By the way, welcome!