A tool-supported method for web sites reverse engineering

This article describes a method for web sites reverse engineering. It is composed of five processes: Web pages classification, HTML cleaning, Semantic enrichment, Data/schema extraction and Schemas integration.

Date: 30 September 2003

Expertises

Data Science

About project

CRAQ - Reverse

Web pages classification

Web pages within a site can be gathered into semantic groups according to their informational content. A page type is a set of pages relative to a same concept. For example, all pages describing the departments of a company belong to the page type "Department".

HTML cleaning

All HTML pages that should be analysed are transformed into well-formed XML documents in order to allow easy parsing and extraction.

Semantic enrichment

Before data and schema extraction, we need to know, for each page type, what are the concepts displayed and where they are located in the HTML tree. For example, a page type "Department" will be composed of the concepts "Name", "Address" and "Activities". All semantic information provided by the user during that step will be stored into an XML document called the META file.

Data and schema extraction

When the META file have been completed (i.e. it contains sufficient information to describe all the pages of the same type), it can be used to extract data from HTML pages and to store them in an XML document. This data-oriented document has to comply with an XML Schema that is also automatically extracted from the META file.

Schemas integration and conceptualisation

If a site comprises several page types, their XML Schemas must be integrated into a integrated logical schema that can then be conceptualized to represent the application domain covered of the whole web site.


The web site reverse engineering method

See the paper submitted for WSE 2003