Analyzing RSS feeds

The aim of this project is to monitor ideas and trends about data journalism. For this we have set up a “Living lab” using :

  1. A RSS-server to collect and manage the publications. For this we use Tiny Tiny RSS , an open source web-based news feed (RSS/Atom) reader and aggregator. It stores the content in its own Mysql database which makes it easy to re-use the data.
  2. A script to clean the feeds from all HTML code.
  3. Our Semantic Analyzer based on InterSystems iKNOW technology. Read about the basics of this technology
  4. The integration of the content analyses results within the Business Intelligence Server.

Case study : Analysing RSS feeds about data journalism

For the last year we gathered 8.217 articles (RSS feeds) from 81 websites publishing stories and insights about data journalism.

The articles were stored in a MySQL database together with the info about the source, the author, publishing date, and a link to the original publication.

A script, running in the background, cleaned the articles from all HTML code, next  the clean-text is loaded into our Semantic Analyzer which retrieves :

  1. Main concepts according to frequency and spread
  2. Similar concepts
  3. Related concepts

Browsing the articles by “Top Concepts”

The Semantic Analyzing process lists concepts according to spread and frequency.

The first concepts in this list may not be  that surprising since they often refer to the initial selection criteria you handled during the selection of the content, thus being “about data journalism”.  But anyhow, this way we see how the concept “journalism” is mentioned 1512 times spread over 1154 articles.

But when we dig deeper by scrolling the list of concepts we discover (not search) more interesting concepts. In this case – for example –  the articles also are automatically categorized by:

  • Projects
  • Tools
  • Future
  • examples
  • etc…

Getting insight by “Similar Concepts”

After selecting a top concept, the system will give us a list of similar concepts which delivers us a much more usable and detailed categorization schema. So when we talk about data journalism we see analysis is an important concept.  But what kind off analysis do people write about ?

similarAs you can see above, the concept “analysis” is mentioned 235 times in 164 different articles. But the system is also capable in breaking down the concept  “analysis” into more detailed key words as “data analysis”, “statistical analyses” , “text analysis” or “social network analysis”.  So, without searching we quickly discovered 9 articles about text analysis in 8251 articles about data journalism.

According to your interest you can easily browse a large amount of content and click through to the original publication for further reading.

People interested into education for example, would easily retrieve this list.


Bringing in the meta-data and Business Intelligence

By bringing in the meta-data we can create lists based on different criteria. In the example below we have build an overview of the authors based on the concepts ‘technology’, ‘tools’, ‘social media’ and ‘analysis’. We can see how “Ben Lorica” is mostly writing about ‘tools’, while “Anil” seems to be more interested into ‘social media’.



It is also possible to combine different concepts in a new list. This way we can easily find articles about tools for social media, or tools for analysis and see whether or not they are talking about money.



Back to the original content and how it was analyzed

By selecting the concept “analysis” as a”Top Concept” we discovered 9 articles about “text analysis”.  Since this is our main interest in the field of data journalism, for us  this is a very nice result. The list was automatically distilled from the 8.217 articles we started from. We would never have been able to read all these articles.

text analyses

Let’s have a look at the article that talks about 4 other web-based text analysis tools.

In the screenshot below we can see how the article  was broken up into sentences, and each sentence is analysed according to concepts and relationships. indexedsentences

If wanted you can also have a quick view at the summary which is generated by the system. It’s not perfect but it gives you a good idea where this article is about.


for more information or access to the online data : contact us