In Strategic Foresight as Knowledge Management, Walter Kehl explained what the extractor on the Shaping Tomorrow website does. In this article, he explains the technical aspect of how it works. Also keep an eye out for the next installment that will cover the future plans for the extractor.
In this second blog contribution I want to explain a little bit about how the Extractor works and what kind of technologies it is using. To recapitulate, the task of the Extractor is to extract all meta data and future-relevant sentences from a given URL, which contains a news article or a similar piece of text. The Extractor does its work in three main steps:
- Downloading an Html file and extracting the “clean text” from it
- Processing and analyzing the clean text
- Interpreting and presenting the results
Extracting Clean Text
If you have ever looked at the contents of an Html file (you can do this by using commands like “View Source”), you might have seen a long sequence of Html tags, java script functions and many other cryptic-looking words and symbols. The actual text we want to read is often almost hidden and occupies only 10 percent or even less of the file. Also, at this stage, the Extractor cannot look at the content of the text – it is like finding the main text of a web site in a foreign language.
To find the clean text, the Extractor uses a mixture of hints and indicators:
- Structural information (headings, divisions)
- Text cohesion and length
- Links, titles, meta information
Luckily, most people (and the tools they are using) follow some kind of conventions which can be “reverse-engineered”, but there are always web pages which are authored in very special and even awkward ways. Therefore it is hard to achieve 100% success on identifying the correct clean text, but the current success rate is probably above 95% (including download failures and similar problems).
Once the text is found, it is then run through a natural language processing “pipeline” which does the following:
- Split up the text into single sentences.
- Split up the sentences into individual words and determine their type (noun, verb, etc.).
- Do some simple syntactic analysis.
- Find words in the sentence which are already present in Shaping Tomorrow’s database. In this way we can determine whether the sentence contains words which have a special meaning for us: authors, organizations, keywords, countries, dates, change- and future-related words.
In natural language, things are often not as regular as the Extractor would need it: sentence borders are not clear, words can have multiple meanings and there are about a thousand different ways to write something as seemingly simple as a date… but still it is possible to extract a surprising amount of meaningful information.
Getting to Results
When all sentences are processed, we can finally extract the parts which are important for the user:
- Meta-data like publication date, author, countries, organizations, keywords, etc. Whereas date and author are unique, keywords, organizations, countries etc. are only taken if they above a certain frequency limit.
- Change- and future-relevant text parts which are then used to fill in the “Changes” and “Implications” section of the Shaping Tomorrow user interface . To do that, each sentence is checked for its “future score”: does it contain temporal information (“next decade”), does it contain future/change-oriented words (“transformation”), and some more criteria…
This last step is the most challenging of all, because the software gets closest to something like “understanding” the text. Here it helps that we have a very specific focus on the future – the Extractor doesn’t need to build a full model of the text contents, but only of the future-relevant parts. Currently the software achieves around 80% success rate, which is already a considerable help for the user – and it keeps the user alert who still has to do the final approval of the Extractor’s results.
Of course this is not the end of the road, but just the beginning – 80% is already a good result, but we can do better. Real improvements can be made if we make more use of semantics and general domain knowledge, if we put more knowledge into the system to start with. More about this and other ideas for the future development of the Extractor will be the topic of the next blog entry.