Saturday, 12 August 2017

Word clouds: Text analysis for interesting patterns

This post utilises text analysis to identify trends in popular web pages. Of interest are home pages of news sites, Twitter pages of popular personalities and other sites where content changes frequently. Analysing words and projecting to a cloud provides a visual representation of text data which in turn helps to quickly perceive overall changes in trends and interests. Here the cloud is generated on content from Saturday and Sunday.

About the application:

The web application generates a word cloud for each submitted link at a given time. The application first generates a word count for each page using natural language tool kit and stores the results in a database. The counts are converted to a word cloud using opensource javascript libraries. Web links can be added to the application as needed. The word count task is performed asynchronously using celery worker nodes. 

Software built using:

Django1.10/Python3.5, Asynchronous tasks using Celery 4.1.0, Javascript, Natural Language Processing Tool Kit (NLTK), Postgres Database. 3 hosts. One for web application, one for celery workers and Database and third for the message broker. 

Home pages analysed:

RT.com
BBC.com
Google News
Fox News
Indian Prime Minister Twitter Page
Obama Twitter page
President Trump Twitter Page

Celery Django Architecture

>> Word cloud on Saturday to left and Sunday to right

RT.com

BBC.com

Fox News

Google News

Obama


 President Trump and  the Indian Prime Minister Twitter Handles