Semantic analysis

I understand it is mean to talk about features that are not implemented in the downloadable versions, but I would like to share my ideas on the purpose behind our experimental semantic features.

The “mechanical” recognition and extraction algorithms used in most views of the Hub are mostly based on a combination of DOM analysis (when dealing with HTML pages) and morphological recognition of objects and strings. These techniques are very efficient for simple scraping of data, but they are not sufficient when we need to discriminately extract data about certain themes or topics. We are currently adding semantic capacities to our extractors (in professional applications only, for now).

At the moment, we are only focusing  on statistical analysis of the words and phrases, without performing any syntactic analysis of the texts. However, the results are very promising and seem to confirm our original ideas.

It is rather easy for instance to find documents about the Sun on Google. It becomes much more difficult though, when you are only searching for documents that treat of the subject from, say, a medical point of view instead of the astronomical and physical description of the sun.

Of course, querying ‘sun’ and ‘health’ will give better results in Google, but it will only find documents with occurrences of these two words; which is not really what we want. To make sure we find the most pertinent pages, the best Google query would probably be something like:

sun OR solar OR sunshine (…) health OR cancer OR skin OR pigmentation OR medicine (…)

We understand that the semantic analysis of the pages will be a more efficient approach than looking for a mere series of keywords. In our system the whole vocabulary of the document is compared to a large theme ontology, making the main topics stand out, even if the text doesn’t include the exact words that came to the user’s mind.

However big the search engine’s cloud of servers may be, we also understand that the processes we can include in a client application would be much too demanding if they had to be executed on the server side. This is why we strongly believe that using the clients’ CPU for in-depth analyses is a pertinent approach.



Comments are closed.