Automating the Shaping of Metadata Extracted from a Company Website with Open Source Tools

Automating the Shaping of Metadata Extracted from a Company Website with Open Source Tools

Robert Viseur, Automating the Shaping of Metadata Extracted from a Company Website with Open Source Tools, International Journal of Advanced Computer Science and Applications (IJACSA), Special Issue on Natural Language Processing 2014

Abstract

As part of a market analysis process, the objective was to automate the task of identifying the activities and skills of a collection of enterprises, namely Belgian and French open source companies. In order to avoid manual annotation through visual analysis of the websites’ content, a tool chain was developed to collect the content of websites and extract the important terms. Standard software libraries were identified, allowing to clean up HTML documents and to perform the part-of-speech tagging process used for extracting terminology. This procedure is supplemented by the extraction and the recognition of named entities. The terms extracted in the HTML pages of a company website were then merged and filtered and a circular tags cloud was generated. This presentation facilitates the identification of important terms, commonly referred to as activities and technologies supported by the company. Several changes are planned for this prototype, including, in particular, the extension to the texts in French, the association of extracted terms to the vocabulary of a classification scheme and the automatic generation of dashboards to facilitate the monitoring of the evolution of the industrial sector.