1
ONTOLOGICAL WAREHOUSING ON SEMANTICALLY INDEXED DATA Reusing Seman,c Search Engine Ontologies to Develop Mul,dimensional Schemas Filippo Sciarrone Paolo Starace f.sciarrone@openinforma/ca.org p.starace@openinforma/ca.org Business Intelligence Division, Open Informa/ca srl, Via dei Castelli Romani 12/A, Pomezia, Italy ABSTRACT: In this poster we present a first experimenta6on of a Business Intelligence solu6on to dynamically develop mul6dimensional OLAP schemas through a reuse of ontologies, stored in concept and rela6ons dic6onaries and used by seman6c indexing engines. The par6cular aspect of the proposed solu6on consists of the integra6on of seman6c indexing techniques of non‐structured documents, based on ontologies, with dynamic management techniques of unbalanced hierarchies in a Data Warehouse. The Two‐Step Indexing Process: We adapted our solu6on to a pre‐exis6ng system, implemented to execute a two‐step indexing process of non structured documents. During the first step, from the unstructured docs layer to the terms set layer, the engine indexed each document, thus obtaining a set of index‐terms. In the second step, from the terms set layer to the ontologies layer, these terms were contextualized and associated with the concepts of predefined ontologies. In order to integrate our module with the aforesaid system, the following assump6ons were imposed: The concepts included in the dic6onary were exclusively linked by hypernymy and hyponymy rela6ons; Each ontology was based on a hierarchic structure. The Star‐Schema with Bridge Table: The adopted solu6on envisages the inclusion of a bridge table between the concept dimension and the facts. The goal of the bridge table was to help the OLAP engine aggregate data more quickly. The measure on which the aggrega6on is to be made is the number of persons referring to the single concept. The resul6ng table will therefore be a factless fact because there isn’t a summa6ve value. The Resul6ng Pivot Table: The PENTAHO BI suite presents the data processed by the OLAP MONDRIAN engine through jsp libraries that generate pivot tables for the naviga6on of mul6dimensional cubes. The sum of the leaves values does not always correspond to the value of top node because there may be concepts that refer only to the parent node that you must add to the value of the sum. An Example Ontology: In order to ensure a hierarchic naviga6on it is necessary to bring it back to a tree structure. The presence of navigable cycles on the structures is ruled out ‐ even theore6cally ‐ from the typology of rela6on exis6ng between the concepts, i.e., the part of rela6on. One must therefore consider the management of Directed Acyclic Graphs (DAG) The Overall Process: We developed a custom ETL module in order to integrate seman6cally indexed data with opera6onal ones. The custom ETL process performs the following opera6ons: (1) Builds the ontological tree, extrac6ng it from the dic6onary; (2) Defines the dimension table; (3) Includes the ontology nodes; (4) Defines the bridge table; (5) Includes a record for each iden6fiable path on the tree, along with the distance between the relevant nodes (including the zero length path from a concept to itself). Case Study CONCLUSIONS and FUTURE WORKS: The defined process is currently stable and yields posi6ve results in a company environment. The fact‐defini6on process can be improved, extending the logic to the join base of the data. In order to provide a complete BI service, the system must be able to make several types of aggrega6ons, not just the basic ones. For the future we plan the enhancement of indexed data management, with the introduc6on of a Cache‐Based engine, and on the resolu6on of problems related to the management of many‐to‐many rela6ons.

International Conference on Knowledge Discovery and Information Retrieval 2009

Embed Size (px)

DESCRIPTION

Ontological Warehousing on Semantically Indexed Data: Reusing Semantic Search Engine Ontologies to Develop Multidimensional Schemas

Citation preview

Page 1: International Conference on Knowledge Discovery and Information Retrieval 2009

ONTOLOGICALWAREHOUSINGONSEMANTICALLYINDEXEDDATA

ReusingSeman,cSearchEngineOntologiestoDevelopMul,dimensionalSchemas

FilippoSciarrone PaoloStaracef.sciarrone@openinforma/ca.org p.starace@openinforma/ca.org

BusinessIntelligenceDivision,OpenInforma/casrl,ViadeiCastelliRomani12/A,Pomezia,Italy

ABSTRACT: In this poster we present a first experimenta6on of aBusiness Intelligence solu6on to dynamically develop mul6dimensionalOLAP schemas through a reuse of ontologies, stored in concept andrela6onsdic6onariesandusedbyseman6cindexingengines.Thepar6cularaspect of the proposed solu6on consists of the integra6on of seman6cindexing techniques of non‐structured documents, based on ontologies,withdynamicmanagementtechniquesofunbalancedhierarchiesinaDataWarehouse.

The Two‐Step Indexing Process: We adapted our solu6on to a pre‐exis6ngsystem, implemented to execute a two‐step indexing process of non structured documents.Duringthefirststep,fromtheunstructureddocslayertothetermssetlayer,theengineindexedeachdocument,thusobtainingasetofindex‐terms.Inthesecondstep,fromthetermssetlayerto the ontologies layer, these terms were contextualized and associated with the concepts ofpredefinedontologies.In order to integrate our module with the aforesaid system, the following assump6ons wereimposed:•  Theconcepts included in thedic6onarywereexclusively linkedbyhypernymyandhyponymyrela6ons;

• Eachontologywasbasedonahierarchicstructure.

TheStar‐SchemawithBridgeTable:Theadoptedsolu6onenvisagestheinclusionofabridgetablebetweentheconceptdimensionandthefacts.ThegoalofthebridgetablewastohelptheOLAPengineaggregatedatamorequickly.Themeasureonwhichtheaggrega6onistobemadeisthenumberofpersonsreferringtothesingleconcept.Theresul6ngtablewillthereforebeafactlessfactbecausethereisn’tasumma6vevalue.

TheResul6ngPivotTable:ThePENTAHOBIsuitepresentsthedataprocessedbytheOLAPMONDRIANenginethroughjsplibrariesthatgeneratepivottablesforthenaviga6onofmul6dimensionalcubes.Thesumoftheleavesvaluesdoesnotalwayscorrespondtothevalueoftopnodebecausetheremaybeconceptsthatreferonlytotheparentnodethatyoumustaddtothevalueofthesum.

AnExampleOntology:Inordertoensureahierarchicnaviga6onitisnecessarytobringitbacktoatreestructure.Thepresenceofnavigablecyclesonthestructuresisruledout‐eventheore6cally‐fromthetypologyofrela6onexis6ngbetweentheconcepts,i.e.,thepartofrela6on.OnemustthereforeconsiderthemanagementofDirectedAcyclicGraphs(DAG)

The Overall Process: We developed a custom ETL module in order to integrateseman6cally indexed data with opera6onal ones. The custom ETL process performs the followingopera6ons:(1)Buildstheontologicaltree,extrac6ngitfromthedic6onary;(2)Definesthedimensiontable; (3) Includes theontologynodes; (4)Defines thebridge table; (5) Includesa record for eachiden6fiablepathonthetree,alongwiththedistancebetweentherelevantnodes(includingthezerolengthpathfromaconcepttoitself).

CaseStudy

CONCLUSIONS and FUTUREWORKS: The defined process iscurrentlystableandyieldsposi6veresultsinacompanyenvironment.Thefact‐defini6on process can be improved, extending the logic to the joinbaseofthedata.InordertoprovideacompleteBIservice,thesystemmustbeabletomakeseveraltypesofaggrega6ons,notjustthebasicones.Forthe futureweplan the enhancement of indexed datamanagement,withthe introduc6on of a Cache‐Based engine, and on the resolu6on ofproblemsrelatedtothemanagementofmany‐to‐manyrela6ons.