Building Reference Ontologies from B2B XML Schema Files

... of XML standards (e.g.UBL, OAGIS, …) as input and proposes an integrated ontology from them using related knowledge and text mining techniques. ???. 2.
436KB taille 5 téléchargements 175 vues
Building Reference Ontologies from B2B XML Schema Files Ivan Bedini 1, Benjamin Nguyen 2, Georges Gardarin 2

1

Orange Labs, 42 Rue des Coutures, 14000 Caen, France [email protected] 2

PRiSM Laboratory, University of Versailles, 45 Avenue des Etats-Unis, 78035 Versailles, France {benjamin.nguyen, georges.gardarin}@prism.uvsq.fr

Abstract. The construction of a reference ontology still remains an hard human task that is sometimes assisted by software tools that facilitate the information extraction from a textual corpus. Despite of the great use of XML Schema files on the internet and especially in the B2B domain, tools that offer a complete semantic analysis of XML schemas are really rare. Such an analysis could be used to produce automatically a first level conceptual model that can be further enriched manually or automatically using annotations. In this paper we show some results of our experience of building a reference ontology starting from a consistent collection of XML Schema files defined by 23 B2B standard bodies. We show that several problems need to be resolved before developing a complete and automatic tool, but nevertheless we have interesting feedback from users that should encourage research works in this direction. Our contributions are mainly the presentation of a new methodology, the introduction of an OWL compatible concept model, and the exploitation of text mining techniques for ontology construction. Keywords: e-business, construction

1.

Web

Semantic,

B2B,

automatic

ontology

Introduction

Over the past ten years, the Semantic Web wave has given a new vision of ontology usage for application integration systems. Researchers have produced several software tools for building ontology [OntoEdit, Protégé, …] and merging them two by two [Chimarea, Prompt, Mafra, ...]. The goal is to produce an integrated ontology for a given domain, which is a useful tool to understand and map data exchanged by

heterogeneous applications. More recently, developers have focused their efforts on ontology alignment systems [S-Match, Cupido, H-Match, …]; some results are shown in [OAEI]. Alignment techniques have overtaken some complexities that appear when merging two ontologies and highlighted needs of quality matching. The next step in this research field seems to go beyond with new generation of Web Semantic applications proposed by Motta and Sabou [8]. The new vision expressed by these authors targets a more ambitious integration of heterogeneous data applied to more than two ontologies at a glance. This approach better meets our requirements as it integrates the advantages of using background knowledge and highlights the needs of run time semantics applications, versus design time, e.g. application capables to integrate data at any time in the application interoperability process, versus integration when deciding on mapping rules between a fixed set of input ontologies. Nevertheless Motta and Sabou approach lacks of two key points for adoption in run time application integration. The first one is a methodology for building a reference background knowledge from different sources of ontologies (often not expressed in RDF/OWL or similar, but as comments or free text annotations) and the second one is the lack of a model capable of facilitating run time matching of concepts. In fact when integrating two or more knowledge sources, the main difficulties are in: discovering possible relationships and; validating selected alignments. The former requires a lot of computational time and several techniques in order to find possible relationships, while the latter demand human intervention or an existing knowledge representation of reference. The aim of this paper is to introduce and illustrate a new tool based on a simple semantic data model for building integrated knowledge. As we show, this data model is able to maintain a collective memory resource that facilitates meetings and collaboration for matching concepts for a given domain. And thus it is able to provide a solution to the difficulties mentioned above. We also introduce the construction of a reference ontology generated from a collection of XML schemas as a view of the semantic data model. The tool considers a collection of XML standards (e.g.UBL, OAGIS, …) as input and proposes an integrated ontology from them using related knowledge and text mining techniques. ???.

2.

The Semantic Data Model

Benefits in using background knowledge for ontology mapping and generation has been highlighted in [18] and already demonstrated in [2]. Another important issue is time performance for computing data integration. As already pointed out in [2, 3], classical ontology alignment and mapping approaches focus on alignment precision and recall, which is of course the primary behaviour to target when mapping ontologies. However, they lack of efficiency. This can be explained by three main reasons: (i) the algorithm computational complexity order, as already exposed in [3]; (ii) the fact that algorithms compute measures between every

couple of items of the two ontologies, even when they do not have anything in common (like looking for similarities between “umbrella and sewing machine”1); (iii) the lack of memorisation that makes that a comparison is done every time two terms are met (like a “Sisyphean task”2), regardless of what has already been calculated. For these reasons, if we want to apply run time mappings to data integration without a loss of efficiency the definition of a model for representing and memorizing relationships between concepts is needed. In the following, we introduce our semantic model called the Concept Model and discuss its relationships to OWL and UML. 2.1.

Concept Context

The aim of the Concept Model (CM) is to provide an efficient knowledge representation for automatic data integration. Like in cognitive science, it is concerned with how information is stored and processed so that programs can exploit it and achieve the verisimilitude of human reasoning. Naturally providing a complete model able to cover every possible existing mapping is not a priori realistic, thus we focus here on mapping data for automatic integration. Flexibility and simplicity remain a primary requirement. The concept model, which is the internal data representation of our tool, is able to store several kinds of information regarding the mappings between concepts. It is also extensible. The final resolution of the best match between ontology elements depends of the context of instances to be integrated. It will therefore be computed at execution time. Figure 1 provides a graphical view of all types of relationships and information stored for a concept.

1 2

Comte de Lautréamont, Les Chants de Maldoror, VI, Roman, 1869 In Greek mythology Sisyphus was compelled to roll a huge rock up a steep hill, but before he reached the top of the hill, the rock always escaped him and he had to begin again (Odyssey, xi. 593).

Properties PropertyOf hasDataTypes

Properties Lattice

Structural Stems

InstanceOf

Concept

Source

Syntax

N-Grams Abbreviations

Semantic

RelatedTo

Words Lattice

Synonyms

Figure 1 - Concept Model definition A concept is defined as a tuple of values: • c = o c is the basic element of the model. In OWL it is instance of “rdfs:Class”. In UML it is a class as well as an attribute of a class. o the label l is a common word (simple or compound) that best represents the concept, but it is not an invariable or unique characteristic. In OWL as well as for UML it is simply the element name. o Hc is the set of structural relationship, which correspond to the subsumption hierarchy. In OWL, these are binary relations corresponding to “rdfs:subClassOf” for “PropertyOf” and to “rdf:Propery” for properties. Also if the target “rdf:Property” is a leaf of the concept sub-tree then it is considered as “owl:DatatypeProperty”, or in other cases like an owl:ObjectProperty. In UML they are similar to attributes for classes or classes associations. For example if a concept Address has concept Street and at the same time it is property of a class Person then in UML it can be modelled as two classes Person and Address with an association named Residence between them and Street should be an attribute of Address. In OWL the same Address is a sub-class of Person and Person has Address as object property named Residence, while Street is a datatype property for Address. This kind of relationship also allows a first basic approach to context discovery. o Rc is the set of relations and itself is partitioned to two subsets. One is the set of all assertions in which the relation is a semantic relation and the other one is the set of all assertions in which the relation is a non-semantic relation. Structural relations should also be considered

o

as semantic relationships, similarly to the WordNet approach with meronymy and hyponymy relationships. For better compatibility with existing modelling notations and larger expressivity it is preferable to maintain these two types separately. In OWL synonyms are expressed as “owl:equivalentClass”. In UML no real correspondence exists. I is the set of originating instances of a concept. In OWL an instance i ∈ I may be an instance of a class c ∈ C as “rdf:type”.

The special relationship marked with “RelatedTo” in the concept model stands for relations between concepts that have been previously merged. In OWL it becomes “owl:sameAs”. 2.2.

FCA Lattices Applications

Galois Lattices based relationships for concept context representations have already been adopted in [12] with associated Formal Concept Analysis as detailed in [13]. However, the approach followed in this paper is different because ontology merging is not the aim of this work. The two lattices relations concern relationships which are really useful for discovering semantic and structural affinities between concepts. The first case, Words Lattice, represents structures relations between compound terms. While the Groups of Properties Lattice creates frequent concepts item-sets of properties, that permit not only to unveil similarities between concepts and m:n properties mappings, but also concept polysemy. In this sense concepts are not real nodes of lattices but rather elements linked to nodes of the graph. Considering FCA concept lattice definition for the Concept Model Words Lattice the greatest upper bound (join) is represented by the set of all terms composing the label of a concept, while the least lower bound (meet) set of attributes is represented by the set of all distinct words. For the Concept Model, Properties Lattice, the greatest upper bound (join) is represented by the set of all groups properties (where a group is composed by properties of a concept), while the least lower bound (meet) is the set of attributes represented by the set of all distinct concepts.

3.

Mapping XML Schema to Semantic Graphs

Despite the great amount of XML files available, current tools and software are only able to extract semantics from text corpora, or ontologies. In reality tools that are able to build ontologies from these files exist, like MAFRA [9] and PROMPT [10] for example, but their approach does not consider them as an input resource to be mined but as an ontology itself. Thus in this section we define some basic rules to extract valuable semantics information from such a corpora and the corresponding mapping to the concept model graph, that is defined in this section as well.

3.1.

Concept Model to Graph view

As explained above the aim of the concept model is to store information about existing matches between elements of an ontology. Basically the graph provides a physical view of concepts and their relationships.

Figure 2 - Example of Graph view of extracted concepts from XML Schema files Concepts are distinguished



as concepts properties and concepts classes.

Properties are leaves nodes in the Hc hierarchy (see Section Erreur ! Source du renvoi introuvable.), and are illustrated as blue round rectangles; • Classes are intermediary and root nodes in the Hc hierarchy, and are illustrated as red ellipses in the graph. A more precise graph will share other leaves as datatype properties and intermediary nodes as object properties. We have currently identified and implemented the following relationships: • Properties for Hc relationships, illustrated as black lines with filled diamond; • Shared Terms for links within the lattice of words, illustrated by blue lines; • Synonymy for synonyms relationships, illustrated as symmetric pink lines and; • Related for merged concepts, illustrated with red lines with filled circle. The list of relationships is extensible and others relations can be added. Figure 2 above shows an example of the graph view of the concept address as extracted automatically from a limited set of B2B standards input XML files.

3.2.

Information Extraction from XML Schemas: XML Mining

. In this section, we introduce the proposed mapping from XSD structures and tags (as containers of semantical information) to the semantic model described above. We illustrte the technique with a fragment of a simple XSD file.

Listing 1 – An example of address XSD definition The extraction task basically considers structural information including XSD complex type tags, XSD element tags, XSD simple type tags, attribute names, and subelements of identified elements and complex types, associations between elements (e.g.: ) and Data types (e.g.: type="xs:dateTime”). Following the basic rules from Listing 1 we are able to obtain: • Semantics extracted: Shipping, Address, Delivery, Location, Postal, Country, Street, City, Code, Name. • Classes: ShippingAddress, DeliveryLocation, PostalAddressType • Object properties: Country, CountryType • DataType properties: Street, City, Name, Code, PostalCode (because the target type is an xs:simpleType) • Relationships: ShippingAddress, DeliveryLocation relatedTo PostalAddressType; Street, City, Name, Country propertyOf PostalAddressType; Name, Code propertyOf Country (thus subPropertyOf PostalAddressType); Address synonym with Location. • Figure 3 provides a view about how this example should be viewed following the Concept model graph view illustrated in the previous Section.

Shipping Address

Delivery Location

Postal Address

Street Address

City PostalCode

Country

Code Name

Figure 3 - Resulting graph from example of Listing 1 More precisely, XML schemas are complex and additional XSD structures must be considered to provide better quality results. The detailed mapping for information extraction adopted by our development is shown in Table 1. Table 1 – XSD Mining information extraction and correspondent mapping XSD Structure Include Import complexType

Action Structures are loaded into the parser Same than include and take care of namespace differences Create or update concept structure

simpleType extension et restriction Union

Create or update new datatype structure Update datatype properties Update complexTypes properties

Any

Add or update datatype property of the correspondent concept Create or update datatype. Create or update concept structure and add Hc relations (between the correspondent concept and the container concept Create new structure concept and relation to the correspondent concept defined in the attribute type Create new structure concept

simplecontent Element (with attribute "ref") Element with attributes "name" and "type" Element with only attribute "name"

4.

Mapping Concept class with attribute type Concept datatype Datatype properties ComplexType properties Concept datatype property Concept datatype Concept class and relation of type property Concept class with attribute type Concept class without attribute type

Janus: The Reference Ontology Construction Tool

In this section, we present Janus, a tool we have developed that manages information extraction from XML schema files to the semantic Model seen in Section 2. We also introduce the first results obtained from the automatic construction of a B2B global ontology, which aims at helping to generate local ontologies.

4.1.

Janus

Our tool implements an adaptation of several techniques originating from the text mining and information retrieval/extraction fields, applied to XML files (called XML Mining), in order to pre-process simple and compound terms from XML tags, such as XSD elements and XSD complex types. Figure 4 shows the overall architecture of Janus. Currently the firsts steps of corpus discovery and clustering is hand made by taking advantage of the natural subdivision of B2B standards in business areas (the B2B use case and corpus source is depicted below in the next section). This approach also permits us to understand better the feasibility of translations between different standards measuring the “distance” between them. In the future we aim at crawling the net and implementing a TF-IDF measure for clustering documents. Let us now detail the algorithm for term extraction and automatic taxonomy construction from XML tags.

Figure 4 - Janus overall architecture DE QUI SUIT SEMBLE PLUTOT SOLIDE, IL FAUDRAIT LISSER L’ANGLAIS; AUSSI, CE QUI PRECEDE NOUS A MAL PREPARE … Acquisition Step The aim of this step is to organize the corpus source and to select useful terms for the ontology. The extraction tasks are: 1. XSD parsing and extraction of XML tag values for complex types, elements and simple types. 2. Checking for composite words (e.g.: on-line) 3. Checking for previously identified "useless" words, like systematic addition of unrelated semantic sense to the tag (e.g.: CommonData for UnitOfMeasureCommonData). 4. Splitting compound terms forming the tag, using the UCC convention, or ‘_’ or ‘-‘ as separators, taking careful of special cases (e.g.: PersonIDCode = person + id + code).

5. Checking for known abbreviations (e.g.: Addr = Address, PO = Purchase Order) As output to this step we produce a set of extracted tags for each family in the form: Term1_Term2_..._TermX (ex.: ABIEPostalAddressType that becomes ABIE_Postal_Address)

Normalisation Step Extracted tag names may contain syntactic variation around the “core” concept, thus data is normalized linguistic, syntactic and semantics similarities around a "core" concept (e.g.: PostalAddress  DeliveryLocation  Address). During this step the machine is not able to say if a term composing a tag is a real term or something else (abbreviation for example). Thus in order to compute semantic similarities between tags and to cluster them better, we add the use of a dictionary as an external resource in order to be able to say if a term is a real human word or not. In our case we have integrated WordNet version 3.0 (Miller, 1995). Tasks for this step are: 1. Case normalisation, all terms are converted to lower case; 2. Stop-word normalisation, removes words like “of”, “a”, “for”,…; 3. Bad words detection, terms unknown by the dictionary are cast aside; 4. Morphological and semantic normalisation, which consists in finding the stem and lemma form. Build Taxonomy Step The aim of this step is to create a first level of semantic relationships and hierarchy between words of the taxonomy. 1. Calculate Terms Frequencies 2. Synonyms Check, applied to words belonging to the taxonomy itself. 3. Recompose tags. All tags are recomposed using their lemma in order to be able to detect similarities between terms (thus between tags, thus between concepts of the ontology that we are building). 4. Build Tags Lattice. Tags are usually composed by more than one word, thus: we build a graph, based on Galois lattice, to relate those tags having the same words (ex. address and postal_address); we calculate the frequency of graph nodes and; we remove the nodes that are insignificant (values below a threshold) Filtering Step In this step we analyze the words rejected by a first pass and we try to detect false semantics present within a tag. 1. Bad words “reconciliation”. During this step we try to detect as many abbreviations as possible applying a modified version of the N-Gram algorithm and Levenstain distance, with terms that already exist within the taxonomy. We restrict ourselves to terms within the taxonomy, because if we used the complete dictionary, we would detect too many similar terms, most of them out of context. 2. Useless words detection. Using the lattice we try to detect automatically those words that present disproportionate relationships between graph nodes (like Type or CommonData), and therefore do not convey any semantics in reality. 3. Finalize. Integrate new terms.

Merging Step The aim of this step is to look for similarities within the matching/alignment analysis in order to provide a merged view of elements defined within the previous tasks. Naming affinity is based on Galois Lattice method, with the terms frequency-based strategy, which lets us find the most representative concept carried by a tag at semantic level. For example considering the following tags: Address, PostalAddress, ScreeningPostalAddress and DeliveryReceiptLocation. The corresponding Words Lattice is illustrated in Figure 5. Numbers inside the node represents the cardinality of the node itself. G3

G2

Screening_postal_address 1

Postal_address 2

G1

address

Delivery_receipt_location 1

screening_postal screening_address 1 1

3

postal

2

screening 1

delivery_receipt

delivery

1

1

receipt_location

receipt

1

1

delivery_location 1

location

1

Synonym (equivalent to)

Figure 5 - Example of Words Lattice The main concept resulting from the merging of these four tags is represented by the node with the highest frequency that in this case is Address (and not Screening_postal or Delivery_receipt!). It appears that the use of cardinality is not enough for highlighting real term relevance (a term which appears only in a family but with high frequency can give false positives and false negatives). Thus results have been improved by the adoption of a weighted Term Frequency measure. With these techniques we have thus been able to overtake the problem of compound terms that tags have. However naming affinity is not enough to perform good matching because it does not provide a view over the nature of concepts. For example we are not able to say: if a concept has more senses (polysemy); if more concepts have the same sense because terms may be synonym in a context but not in another; and if a concept is an important class or simply a property. For all these reason we must look for structural affinity between concepts. To discover structural affinities between concepts we have also implemented an algorithm based on Galois lattice in order to produce a graph with the most frequent sets of properties. The final merging is obtained by the computation of all affinities found in the previous tasks. For each concept, we calculate the list of concepts with at least a relationship stored in the concept model. The overall algorithm that produces the merging task is as follow: cxList = BuildListOfReliableSimilarConcepts(ci, WordsLattice, PorpertiesLattice, SemanticRelationShips, SyntaxRelationships) for each (cx belonging to cxList) do if ci.name = cx.name then affinityValue = lookPropertiesAffinity (c1,c2) If affinityValue >

getStructThresholdEqual(reliabilityValue) then MergeConcepts(ci, cx); Else if (distance(ci.name, cx.name) < getSemanticTreshold (reliabilityValue)) then If affinityValue > structThreshold then MergeConcepts(ci, cx); Function MergeConcepts(ci, cx) { calculateMostFrequentConcept(ci, cx); calculate(ci.properties U cx.properties); merge(ci.relationships, cx.relationships); addRelatedToRelation(ci, cx); }

Listing 2 – Overall algorithm of merging operation Transform Appling the mapping defined in Section 2 the Concept model can be translated to UML (in XMI format), OWL or also XSD. Build Views Step We have implemented some visualization methods to view our taxonomy. Right now we have implemented the following views: as list, as tags lattice (with synonyms relationships) and as tag cloud. Others, like “Social Network of Word”, are under development. Figure 6 below shows an image of Graphic User Interface of Janus.

Figure 6 - Janus GUI Overview

5.

5.1.

Return on Experience

The B2B Use Case and Corpus Source

B2B provides an interesting use case for Semantic Web applications because by its nature it illustrates the problem of the different design and structuring of similar sets of concepts. As shown in the European e-business report [20], yet none of the existing approaches implement techniques based on semantics. Faire des paragraphes clairs; cf papier de Valduriez. The report shows that at least three enterprises over four that conduct business exchanges with partners declare implementing applications based on B2B standards solutions (at least for Europe). For experiencing our tool, we have investigated more than 30 B2B standards. Not all are freely available and some require membership fees. Between the considered standards, only one of them does not yet produce XML Schema files documents describing business messages and none produce OWL/RDF ontologies. For this reason we decided, at least for the prototype, to consider only those standards offering XML Schema files and to focus our efforts on information retrieval for just this format. It provides the great advantage, in respect to textual corpora, to define a structure for elements (candidate concepts for the ontology) and notably limits difficulties of natural language interpretation. Inversely some standards have limited semantics. Almost all organizations provide a package containing several XSD files, one for each specific message, one for grouping common data, others for grouping common data type definitions and code lists. At the end we get a corpus source composed by a collection of 23 standards (listed in Table 1), with more than 2000 XSD files that have been considered enough in order to have significant information about B2B business message definition practices and semantics. Others standards can be added in the future. After our experience, we can at least confirm that XML Schema is the most widely supported solution by consortiums and it is becoming the de-facto standard document format. It overtakes other formats such as the "old" EDI and the "new" RDF/OWL. This is not the first experience based on B2B standards, but as far as we know it remains the more complete. Giraldo and Reinault [14] have already proposed a semiautomatic generation of an ontology from standard based DTD files, but their solution is limited to the sole domain of tourism, which is defined in advance with great precision, and therefore the detection of relevant concepts does not produce conflicts between different representations. Others experiences like [15, 16, 17] are closer to the e-business domain than to specific B2B artifacts. They mainly target product classification, thesaurus, which include well defined hierarchical set of concepts.

5.2.

Building automatically B2B domain Ontology

Table 3 resumes the collection of considered B2B standards and some information about their declared compatibility with other organizations. This table also gives for each standard body the following information: number of XML Schema files provided (or in some cases, the files we considered), the total number of complex type and element tags, the resulting number of words composing the reference ontology. Values of Table 3 do not consider common words between different standards. For this, Erreur ! Source du renvoi introuvable. provides an interesting result from the sequential addition of words in a common global “B2B dictionary”. As we can see from Figure 7, by adding one standard at a time, even in random order, after half a dozen of additions less than 20% of the words are really new. We obtain about 9% after the whole integration, which usually represent terms characterizing the standard. Thus, this ratio shows a first result of mapping B2B XML Schema standards to semantic nets, which is that a dynamically constructed taxonomy evolves nicely; thence, shared vocabulary emerges naturally. 10

0, 00

Terms Addition Percentage

28 ,1 1

26 9, 13

9,

6,

74 12

,6 5

17 ,1 8 69

5, ISO

I FX 20 02 2 TW I ST HR -X ML eb Int er fa c Ad e sM L eb XM L Pa piN et P ID X ST AR Ag XM MIS L MO

3,

6, 06 8, 31 7, 84

21

13

10 ,9 4

13 ,5 0 20

,8 3 ,0 8

35 ,9 0 30 ,2 5 28 ,3 1 ,5 7

Addition % 100,00 60,22 58,95 59,21 11,57 35,90 30,25 28,31 3,13 13,50 20,83 21,08 10,94 5,69 17,18 6,06 8,31 7,84 28,11 6,74 12,65 9,26 9,13

11

Words Dictionary Words 271 271 274 436 704 851 1162 1539 216 1564 117 1606 734 1828 544 1982 32 1983 437 2042 552 2157 446 2251 256 2279 457 2305 949 2468 66 2472 301 2497 408 2529 530 2678 341 2701 1130 2844 216 2864 252 2887

X1 2 UB L OA GI S AC OR D GS 1 FIX AR TS Fp ML ET SO CID X OT A

Standard Body X12 UBL OAGIS ACORD GS1 FIX ARTS FpML ETSO CIDX OTA IFX ISO 20022 TWIST HR-XML ebInterface AdsML ebXML PapiNet PIDX STAR AgXML MISMO

60 ,2 2 58 ,9 5 59 ,2 1

Table 2 - Results from the families terms merging

Figure 7 - Graph of sequential of terms addition (measures are in percentage)

Table 3 - Presentation of involved B2B standard and of the correspondent extraction of XML semantics Standard Body

Business Area

Insurance, reinsurance and related financial service graphics communication AdsML Agriculture supply chain AgXML Retail ARTS Chemical CIDX Cross industry ebXML ebInterface Invoice Specific electric transaction ETSO Mainly banks, broker-dealers, FIX exchanges and institutional investors Financial FpML Supply chain for Healthcare, GS1 Defence, Transport & Logistics Human Resource HR-XML Financial IFX Financial ISO20022 Residential, commercial, MISMO eMortgage Cross industry OAGIS Tourist OTA Paper PapiNet Petroleum PIDX Automotive retail STAR Supply chain, payment TWIST Invoicing, ordering UBL Cross industry X12 ACORD

Files

Tags

Dictionary words

8

5263

1162

14 11 44 61 74 1 1 18

737 808 5853 1881 1401 105 27 552

301 216 734 437 408 66 32 117

FIX, FIXML ebXML

21 289

2124 2360

544 216

ACORD

166 310 74 14

12717 4256 11082 1432

949 446 256 252

Alliances X12, XBRL, HR-XML

ebXML, CIDX, RAPID ebXML, RAPID

ebXML SWIFT (ISO 20022), FpML

IFX, OAGIS, TWIST IFX, ACORD, ASC X12 ebXML

515 4584 233 3649 42 1394 ebXML, CIDX 26 745 OAGIS, ebXML 181 5518 FpML, FIX, SWIFT 18 2489 ebXML 11 650 9 1349 Sum*: 2141 70976

704 552 530 341 1130 457 274 271 10395

* This sum value does not consider eventual correspondence of common tags or words between different bodies

The step forward is the automatic construction of an ontology. The approach here is to take this taxonomy and to transform words into concepts as defined in Section 2. The consideration is that not all these words are main concepts. Some are classes, others are properties, and others derive from data type. The structural study led to define property groups providing enough information for defining a first version of ontology. For reasons of quality measurement, and complexity, structural tests have been executed over a small subset of input files. It has been done on 8 files from 7 standards defining the Address concept. Table 4 summarizes some results which provide a good resume of main concepts and relationships of source files. In fact, only one main concept, Contact, has not been captured by the automatic constructor while

about 15% of relationships have been “lost”. Table 4 – Concepts definition results for a subset of Address definitions Standard CIDX ebXML HR-XML OAGIS PapiNet STAR UBL Sum :

5.3.

Files 2 1 1 1 1 1 1 8

Elements 19 25 24 73 25 22 39 227

Structures 37 43 25 98 30 27 65 325

Concepts = 128 Relationships = 232 Root Concepts = 8 (country, preference, coordinate, status, position, address, location, measurement) Class Concepts = 18 Properties = 110 Data Types = 73

Problems

Several Problems remain to be solved. The complexity of the considered use-case is evident: • A classical ontology builder system should have a computational complexity order like O(n!). Where n is the number of structures. • Noise detection (e.g., UnitOfMeasureCommonData => CommonData has no semantics affinity with UoM and should be out of TF calculation). • Abbreviation detection (e.g., AcctRevRq, BankAcctTrnImgRevRq, Desc) • Acronym detection (e.g,. ShowBOMDataArea, SVATransaction). • Sequential merging requires information that is lost in previous merging operations (each Galois Node has cardinality, family attendance, and frequency that are lost in relevant node selections). • Quality of matching is fundamental and hard to realize. Some improvements will come (the best should be to have an extensible library of rules). • Lack of a reference ontology prevents us from good precision and recall measures (to be precised).

6.

Conclusion …

7.

References

1. Ganter, Bernhard; Stumme, Gerd; Wille, Rudolf (Eds.) (2005). Formal Concept Analysis: Foundations and Applications. Lecture Notes in Artificial Intelligence, no. 3626, SpringerVerlag. ISBN 3-540-27891-5 2. Marta Sabou, Mathieu d’Aquin, and Enrico Motta. Using the Semantic Web as Background Knowledge for Ontology Mapping. In Proc. of the International Workshop on Ontology Matching, collocated with ISWC'06

3. Marc Ehrig and Steffen Staab. QOM - Quick Ontology Mapping. In Proceeding of ISWC, 2004, pages 683-697 4. Vanessa Lopez, Marta Sabou and Enrico Motta. PowerMap: Mapping the Real Semantic Web on the Fly. In Proc. of the 5th International Semantic Web Conference (ISWC'06), Athens, GA, USA. 5. N. Noy, 2004. Semantic integration: a survey of ontology-based approaches. SIGMOD Record, Vol. 33, No. 4, December 2004. 6. Miller, G.A. (1995). WORDNET: A lexical database for English. Communications of ACM (11), 39-41. 7. Mathieu d'Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, and Enrico Motta, 2007. Watson: A Gateway for Next Generation Semantic Web Applications. Poster session of the International Semantic Web Conference, ISWC 2007. 8. Enrico Motta and Marta Sabou, 2006. Next Generation Semantic Web Applications. In Proc. of the 1st Asian Semantic Web Conference (ASWC), Beijing, China, 3-7 September, 2006 9. Maedche, A., Motik, B., Silva, N., and Volz, R.. MAFRA - Mapping Distributed Ontologies in the Semantic Web. Proc. 13th European Conf. Knowledge Eng. and Management (EKAW 2002), Springer- Verlag, 2002, pp. 235–250 10.N. F. Noy, M. A. Musen. The PROMPT Suite: Interactive Tools For Ontology Merging And Mapping. International Journal of Human-Computer Studies, 2003 11.Gregor Hohpe and Bobby Woolf. Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions. Addison-Wesley, October 2003. ISBN13:9780321200686 ISBN10: 0-321-20068-3 12.Ganter, Bernhard; Stumme, Gerd; Wille, Rudolf (Eds.) (2005). Formal Concept Analysis: Foundations and Applications. Lecture Notes in Artificial Intelligence, no. 3626, SpringerVerlag. ISBN 3-540-27891-5 13.G. Stumme and A. Maedche. FCA-Merge: Bottom-Up Merging of Ontologies.. In B. Nebel, editor(s), Proc. 17th Intl. Conf. on Artificial Intelligence (IJCAI '01), 225-230, Seattle, WA, USA,2001 14.Gloria Giraldo, Chantal Reynaud, 2002. Construction semi-automatique d'ontologies à partir de DTDs relatives à un même domaine. 13èmes journées francophones d'Ingénierie des Connaissances, Rouen 15.O. Corcho, A. Gomez-Perez, 2001. Solving integration problems of e-commerce standards and initiatives through ontological mappings. In Proceedings of the Workshop on e-business and Intelligent Web. IJCAI 2001 16.Omelayenko, B., & Fensel, D.. An Analysis of B2B Catalogue Integration Problems In: Proceedings of the International Conference on Enterprise Information Systems (ICEIS2001), Setúbal, Portugal, July 7-10. 17.Joerg Leukel. Standardization of Product Ontologies in B2B Relationships – On the Role of ISO 13584. In Proceedings of the Tenth Americas Conference on Information Systems, New York, New York, August 2004 18.Henry S. Thompson, David Beech, Murray Maloney, and Noah Mendelsohn. XML Schema Part 1: Structures Second Edition. W3C Recommendation 28 October 2004. (http://www.w3.org/TR/2004/REC-xmlschema-1-20041028) 19.Ivan Bedini and Benjamin Nguyen. Automatic Ontology Generation: State of the Art. Technical report, University of Versailles, 2007

20.E-Business [email protected] observatory. The European e-Business Report, 2006/07 edition. 5th Synthesis Report of the e-Business [email protected], on behalf of the European Commission's Directorate General for Enterprise and Industry. January 2007. (http://www.ebusinesswatch.org)