Kaptaan in the light of Data: Exploratory and Sentiment Analysis - Part 1

Kaptaan in the light of Data: Exploratory and Sentiment Analysis - Part 1

cover ik.png


Pakistan election 2018 being just around the corner is probably the perfect pint of motivation needed, to try and really push the boundaries of mathematics and science and see what is revealed. For those who like their politics brewed with just a dash more data, this is part one, of a series centered towards exploring the data about the election.

The Pakistani Kaptaan, Imran Khan, is a major contender for the position of Pakistan's Prime Minister. Given his monumental rise in popularity since the last election, and especially his brief 'dharna' stunt in the capital, I thought it would be interesting to take a look at all that people have been saying about him. More importantly, I wanted to see if the general public's sentiment around him has changed, or if it has, by how much?

With the elections nearly upon us, I hoped that armed with data science and the arsenal of tools that come with it, I would be able to make an educated guess about how much Imran Khan has managed to achieve during the past few years and how it has affected people's opinion about him. Moreover, I also wanted to see what were the biggest and most notable events in the past few years that forced people to talk about him (be it in a positive or negative light).


The Goal

‘Applying exploratory sentiment analysis on comments, tweets, Facebook, and forum posts to gauge the general interest and opinion about Imran Khan.’

The original aim was to extract the views of the Pakistani public using data from tweets, Facebook comments, political forums and comments on popular news websites. However, I faced several issues, from a limit on the number of tweets I could scrape to a new review policy implemented by Facebook, put into action after the recent Cambridge Analytica scandal. In the end, I turned to the websites of newspapers and news channels in Pakistan.

Only Dawn, one of Pakistan's oldest and most respected newspaper, had enough comments on each article to be worth analyzing. The problem? They don't provide any API or a direct way to download and search through all the comments so I decided to develop a web scraping script on my own using R's 'rvest' package.


Exploratory Analysis

I scraped 845 articles from Dawn containing  'Imran Khan' in the title, from 16th July, 2018, to as far back as 2002 to. However, since I was only focusing on comments, I only considered articles with at least one comment.

Here is the distribution of all the articles on Dawn's website with 'Imran Khan' in the title and with at least one published comment.


You can see a clear spike in 2014 — the year of the 'Azadi March', also known as the tsunami march. 2015 managed to keep up the hype but 2016 was the year when interest died down a little.

In 2017, with just a year to go till the elections, more articles started appearing and with 2018 barely halfway through, Dawn has written almost the same number of articles in 6 and a half months than it wrote all of last year.

Note: These are only the articles with at least one comment.

Next, I decided to take a look at the number of comments (or the hype) generated each year. In total, I got 38,866 comments from 717 articles. Here is how they are distributed, over the years,

“2015 was the year with the most number of comments — 8709.”

Since 2011, the number of comments has gradually increased but it is also worth noting that, during the same time period, internet users in Pakistan also increased significantly, from just 9% in 2011 to 10.9% in 2013. Even then, there is a clear spike in comments in 2014 (8648 comments), during the days of the Azadi March, which can also be due to the increasing interest in Imran Khan.

Here is a look at the comment and article charts for 2014. See how the number of comments climbed as more articles were written about Imran Khan and his quest to overturn the government,

   The articles labelled above the dotted line are the 5% articles from 2014 with the most number of comments.

The articles labelled above the dotted line are the 5% articles from 2014 with the most number of comments.

There is an obvious and significant relationship between the number of articles and the number of total comments in a year. In August 2014, Imran Khan initiated his march and there were 29 articles written about Imran Khan in this month, attracting over two thousand comments, but did the number of comments per article also increase at the same level? To find this out, I plotted a scatter plot of the total number of comments per total number of comments for each month of 2014,

  Green — Number of Articles     Grey — Number of Comments Per Article

Green — Number of Articles

Grey — Number of Comments Per Article

There isn't much to infer from this, except that the number of comments per article climbed as the number of articles climbed. However, a similar increase in the number of articles in November didn't see the number of comments following a similar path. In conclusion, it seems that August was an anomaly with more and more people wanting to talk about Imran Khan.

Next, I tried the same technique for all the years to see how the number of comments per article has fared and this gave me a better answer,


Surprisingly, the number of comments per article has gone down over the years, except for 2014 (which is to be expected given his Imran Khan's abrupt rise to popularity). The median number of comments per article is 25 while the mean comes at around 45.8.

Here is a look at all the articles, plotted based on the number of comments. The articles labeled above the dotted line are the 1% articles with the most number of comments.


These 850 articles were written by 150 writers, although the number may be higher because my web scraping script only picked up the name of the writer written first (a Dawn article can have multiple authors mentioned in a single post ). After removing the articles written by generic staff pseudonyms like ‘Dawn.com’, ‘The Newspaper’s Correspondent’, etc., I found the writers who have written the most about Imran Khan,


Does a writer’s writing style have an effect on the number of people who comment on his or her articles? To find out, I plotted the average number of comments per article each writer attracted on a bar chart. I only chose writers who have written at least five articles. Only 19 writers have written five or more articles and attracted an average of 25 comments per article.  

Irfan Haider has written articles like ‘Imran Khan joins civil disobedience movement, burns power bill’, ‘Imran accuses HRCP of promoting foreign agenda’, and ‘Imran Khan suspends Justice (r) Wajihuddin's PTI membership’, mostly around the 2014 - 2015 period, averaging 65 comments per article.

Here is a look at the top 5 writers who have generated the most hype in the comments section of Dawn’s website.



Conclusively, based on the data from Dawn, it is fair to say that the interest factor around Imran Khan has grown considerably over the past few years, especially since his dharna in 2014. While 2016 may have been a somewhat slow year, he bounced back in 2017, generating a whole lot of interest with his campaign to become the next Prime Minister of Pakistan.

To wrap things up, in Part - 1, I took a look at the data using exploratory data analysis to visualize and summarize its main characteristics. The next part will be focused on sentiment analysis of the articles and the comments to find out more insights on the general opinion around Imran Khan.


Rehan is pursuing his Bachelors Computer Science from IBA and wants to make sense of the world of ones and zeros. He is also a freelance writer on Technology.