BBC – Blogs – Internet blog – BBC News Lab: Linked data

The linked data prototyping platform – The News JuicerIn order to productively explore the linked data problem spaces we quickly realised we needed a platform to give us BBC News in a linked data context.Over the course of six weeks we set up a prototyping platform on the cloud codenamed The News Juicer, as it ‘juiced’ the News archive for the key linked data concepts.As new BBC News articles are published to the BBC website they are placed in a queue on the News Juicer for semantic annotation. This job is performed as series of background processes using a combination of a natural language processing pipeline and human input for verification of results.Step 1 – Extract named entitiesThe first step in the pipeline is to extract ‘named entities’ from the raw article text. These are occurrences of proper nouns such as ‘London’ or ‘Mr Cameron’ that we can later map to DBpedia concepts.In order to extract these entities we make use of the Core NLP framework developed by Stanford University. This suite includes a statistical model that has been trained to recognise mentions of people, locations and places within news articles based on the Brown Corpus.      Step 2 – Match to DBpedia conceptsThe named entity recognition stage leaves us with a list of candidate terms that can be matched to DBpedia concepts.In many cases there is a direct mapping between the extracted entity and the DBpedia identifier.  For example, the extracted entity ‘London Place’ maps directly to http://dbpedia.org/resource/London.More interesting cases arise where the entity text may not match the context it is found in. For example many football articles return results such as ‘Liverpool Organisation’ referring to Liverpool FC rather than the city of Liverpool.In these cases we can use the DBpedia Lookup Service to perform a scoped query on the entity text.Much more difficult to resolve are truly ambiguous entities such as ‘Newport Place’ which could refer to any of the Newports around the UK and worldwide.The system currently uses a very naive approach using the DBpedia concept with the closest matching identifier. At the moment this means all Newport’s found in BBC News articles are mapped to the DBpedia concept http://dbpedia.org/resource/Newport which is the city of Newport in Wales.

via BBC – Blogs – Internet blog – BBC News Lab: Linked data.