XMLBase - Semi-Structured Data Management (Gestão de Dados Semi-estruturados) researched analysis, design and implementation methods for systems for management of semi-structured data distributed over the Internet.
XMLBase is a component-based framework for indexing and searching collections of XML (and HTML) documents. The framework is used to conduct multiple performance analysis measurements that have been used do compare strategies for storing, versioning, indexing and querying XML data collections.
This provided a validation environment, upon which we built a prototype Web application (the Tumba! search engine) that we used as a benchmark for comparing various alternative strategies for collecting, storing, indexing and querying Web data.
XMLBase has 6 main components:
- Versus: a Web meta-data manager, vith versioning and parallel loading/updating capabilities.
- WebStore: a repository for Web contents
- WebCAT: a Web Contents Analysis Tool
- ViúvaNegra: a web crawler built upon Versus and WebCAT
- SIDRA: a text indexing and ranking system for Web pages
- XSuga: a combined XQuery/text engine for information selection from web data and meta-data.
Some of the software developed under XMLBase is used as part of Tumba!, a search engine for the Portuguese Web.
Our research included an implementation and a comparative evaluation of the alternatives for integrating and querying heterogeneous data across large-scale networks.
- A specification of a model for an XML data repository, with meta-data describing the organization and supporting versioning of the managed information.
- Query processing strategies for XML data repositories supporting XQuery and XSLT.
- POSI / SRI / 40193 / 2001
Daniel Gomes, André Santos, Mário J. Silva, Managing duplicates in a web archive.Proceedings of the 21th Annual ACM Symposium on Applied Computing (ACM-SAC-06) April, 2006.
Miguel Costa, Mário J. Silva 2005: Indexação Distribuída de Colecções Web de Larga Escala. IEEE Latin America Transactions. Special JISBD issue.
Miguel Costa, SIDRA: a Flexible Web Search System Master Thesis, University of Lisbon, Faculty of Sciences, November 2004. Also available as Technical Report DI/FCUL TR 4-17
Miguel Costa, Mário J. Silva, Distributed Index Creation of Large Scale Web Collections in the Sidra System.JISBD'2004, IX Jornadas de Ingeniería del Software y Bases de Datos Málaga, November, 2004.
Miguel Costa, Mário J. Silva, Optimizing Ranking Calculation in Web Search Engines: a Case Study.SBBD 2004, 19º Simpósio Brasileiro de Banco de Dados Brasilia, October, 2004.
Miguel Costa, Mário J. Silva, Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search.JISBD 2003 - VIII Jornadas de Ingeniería del Software y Bases de Datos Alicante, Spain, November, 2003.
João P. Campos, Versus: a Web Data Repository with Time Support Master Thesis, University of Lisbon, Faculty of Sciences, May 2003. Also available as Technical Report DI/FCUL 03-8
Daniel Gomes, João P. Campos, Mário J. Silva, Versus: A Web Repository.WDAS-2002: Workshop on Distributed Data Structures March, 2002.
João P. Campos, Mário J. Silva, Versus: A Model for a Web Repository.CRC'01 - 4ª Conferência de Redes de Computadores Covilhã, Portugal, November, 2001.