Nowadays the Internet has become a rich and invaluable source o information for businesses. In order to stay competitive, companies have to cope with this large amount of unstructured information. In this context, the CRAQ-Reverse project aims at providing a tool-supported methodology for web data extraction (also called web wrapping).
Web wrapping techniques transform unstructured web data sources to structured and semantically rich content which can be more easily interpreted and automatically used by computers.
The research team has developed Retroweb, a tool that generates extraction rules for web data sources (mostly web pages). The main benefit from using Retroweb is the graphical interface implemented to analyse web pages and extract data. Thanks to this component, Retroweb becomes very easy to use even for non-technical users. The generic approach adopted by CETIC allows Retroweb to be used in many contexts and applications: customised search engines, migration of (semi-) static web sites, toolboxes for competitive intelligence, etc. Technically, Retroweb is a Java 6 application based on the Eclipse framework; it uses the Firefox rendering engine to display html data.
The team has also developed strong expertise in document management and search engines. They have created a toolbox for crawling documents, extracting text from any common format (doc, pdf, html, rtf, ppt, etc.), and indexing document content.
The project ended in mid-2008 with several positive achievements. The wide range of targeted application has led to several missions in the fields of eHealth, document management, chemistry, and database management systems. Starting from a research prototype, Retroweb has been brought towards a fully functional and finalised product. In order to encourage the use of the tool, documentation has also been a major focus.
On the CRAQ-Reverse project, CETIC acts as a project leader and R&D provider. CETIC provides the tool-supported methodology and transfers its know-how to local SMEs according to their specific needs.
The expertise of the team in web data extraction, search engines and knowledge management has led to the realisation of missions in a wide range of application domains. Besides the development of its own tool for web data extraction, CETIC has notably implemented Illicopresto (Agoria), a web search engine focused on innovation in Wallonia, and ArcheWeb (DocLedge), a toolbox for competitive intelligence over the Internet.