<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" version="2.0">

<channel>
	<title>systover.net</title>
	
	<link>http://systover.net/blog</link>
	<description>With our heads in the clouds and our feet in the dirt...</description>
	<pubDate>Sat, 24 Jan 2009 16:55:35 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.7</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/systover" /><feedburner:info xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" uri="systover" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><feedburner:emailServiceId xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">systover</feedburner:emailServiceId><feedburner:feedburnerHostname xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">http://feedburner.google.com</feedburner:feedburnerHostname><item>
		<title>Unified Data Integration for Situation Management</title>
		<link>http://systover.net/blog/2009/01/01/unified-data-integration-for-situation-management/</link>
		<comments>http://systover.net/blog/2009/01/01/unified-data-integration-for-situation-management/#comments</comments>
		<pubDate>Fri, 02 Jan 2009 02:21:47 +0000</pubDate>
		<dc:creator>Suzanne Yoakum-Stover</dc:creator>
		
		<category><![CDATA[data modeling]]></category>

		<category><![CDATA[database]]></category>

		<category><![CDATA[dataspaces]]></category>

		<category><![CDATA[publications]]></category>

		<guid isPermaLink="false">http://systover.net/blog/?p=37</guid>
		<description><![CDATA[Printable copy of article
S. Yoakum-Stover, Ph.D.
Potomac Institute for Policy Studies
US Army CERDEC I2WD Information Exploitation Futures Lab

T. Malyuta, Ph.D.
New York City College of Technology
Computer Systems Technology Department

Abstract
We propose a new solution for data integration and semantic enrichment in support of Situation Management (SIMA).  Our solution applies to any modality (e.g. text, images, audio, signals [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: left;"><a title="Printable copy of article" href="http://systover.net/blog/wp-content/uploads/2009/01/sima2008_2134_2-1.pdf">Printable copy of article</a></p>
<p style="text-align: center">S. Yoakum-Stover, Ph.D.</p>
<p style="text-align: center">Potomac Institute for Policy Studies</p>
<p style="text-align: center">US Army CERDEC I2WD Information Exploitation Futures Lab</p>
<p style="text-align: center">
<p style="text-align: center">T. Malyuta, Ph.D.</p>
<p style="text-align: center">New York City College of Technology</p>
<p style="text-align: center">Computer Systems Technology Department</p>
<p style="text-align: center">
<h2>Abstract</h2>
<p style="text-align: justify; margin-left: 36pt">We propose a new solution for data integration and semantic enrichment in support of Situation Management (SIMA).  Our solution applies to any modality (e.g. text, images, audio, signals etc.) and embraces the diversity of data sources, types, and models, placing no restrictions on processes, applications, or users.  It is database centric and proceeds in stages to address the unified storage of structured data and its semantic enrichment in a way that remains viable in an Ultra-Large Scale systems environment.  The result is a layered data integration architecture that can accommodate any kind of data to coherently support the multiplicity of processing required for SIMA.</p>
<p style="text-align: justify; margin-left: 36pt">
<h2>Challenge of Data Integration in Situation Management</h2>
<p style="text-align: justify">Though generally scoped around a particular set of circumstances, or state of affairs, Situation Management (SIMA) is a mega-process occurring in a heterogeneous and volatile data space resulting from a cacophony of human and automated systems.  To understand a situation and engineer the means for managing it, we must organize its data space.  In particular, the heavy load of sophisticated processing for the anticipation, recognition, and influence of a situation must be girded with an architecture that enables data sourced from wildly disparate systems, having different modalities, structures, and semantics, to be integrated into one coherent body of situational knowledge.</p>
<p style="text-align: justify">
<p style="text-align: justify">In most business intelligence applications, data is integrated across information systems to support a choreographed interplay of services comprising an established set of business processes.   In contrast, the constituent events in SIMA typically entail information systems that are far more diverse and whose dynamic interplay is less scripted, less repeatable, and therefore less predictable.  Since many of these information systems capture data for completely different and unrelated purposes, and were never intended as participants in a coherent process, for SIMA we require a data architecture that enables them to be dynamically re-used or re-purposed.  Because every situation is unique and we cannot anticipate all the right &#8220;business processes,&#8221; we need the capability to quickly fuse data often in high volumes from an ad-hoc set of systems, sometimes with knowledge asserted by analysts, in meaningful ways on the fly.</p>
<p style="text-align: justify">
<p style="text-align: justify">Traditional approaches to data integration, both physical and virtual [Batini 1986, <span style="font-family:Times New Roman">Parent 1998, Halevy 2005, </span>Bernstein 2007], cannot accommodate the complexity, heterogeneity, and volatility of the SIMA data space.  In actual practice, the canonical data-models that underlie such approaches, including federation, are simply too rigid.  They cannot adapt their structure to handle new data sources, associations, processes, or applications without heavy manual intervention.  Moreover, such approaches generally result in the loss and or distortion of data, semantics, and context, all of which may be useful or even critical in SIMA.  Even if initially successful, the IT costs associated with sustaining such systems as well as the human costs resulting from their deficiencies can be devastatingly high.</p>
<p style="text-align: justify">
<p style="text-align: justify">The scale and complexity of SIMA places it squarely in the domain of Ultra-Large-Scale systems which are characterized by decentralization; inherently conflicting, diverse, and  unknowable requirements; heterogeneous, changing and inconsistent elements; normal failures; continuous operation, evolution, and deployment; and immense scale along many dimensions [Northrop 2006].  As such, SIMA demands a supporting data architecture that remains viable in a freely evolving, interdependent collective of systems, people, policies, cultures, and economics, very little of which will ever be under our control. Our objective is to define such a solution.</p>
<p style="text-align: justify">
<h2>Data Description Framework</h2>
<p style="text-align: justify">To organize the SIMA data space in a ULS systems environment, we enable semantic data integration by providing for the unified storage of structured data.  We embrace the diversity of domain-specific data-models by taking a data-model agnostic approach wherein the integration model makes the least possible commitment to any particular data-model.  We achieve this by identifying the universal aspects inherent in all structured data and creating an integration model based on that.  A key aspect of our approach is that the character and meaning of the source data-model is preserved and made accessible by the data store.  The result is a data architecture that can accommodate any kind of data without placing restrictions on vocabulary, structure, semantics, or constraints, in a way that addresses the needs of the SIMA Community today while providing a seamless transition path toward a future of ULS systems imbued with semantic technologies.</p>
<p style="text-align: justify">
<p style="text-align: justify">The key to devising a domain-neutral storage model for structured data is to decouple that which varies, namely vocabularies and, more generally the data-models, from that which remains constant, namely the source artifact, and ideally the storage structure.  To achieve this, we consider structure, vocabulary, semantics, and constraints from a higher level of abstraction from which we then distill a minimal set of elements sufficient to capture any data-model.  These are illustrated in Fig. 1 and defined as follows:</p>
<p style="text-align: justify">
<p style="text-align: justify"><strong><em>Sign: </em></strong>A <em>sign</em> is a chunk of data, either physically located within a tangible artifact, or contained within an analyst&#8217;s mind.  Examples of the former include a string of text in a document; an object within an image; a segment of audio in an audio stream; a spike in a signal.  As illustrated in Fig. 1, regardless of the type of medium, tangible signs are always associated with a physical extent (i.e. quantifiable span which we call a mention) within the artifact.  In contrast, signs that reside in an analyst&#8217;s mind become tangible when she writes down her thoughts.</p>
<p style="text-align: justify">
<p style="text-align: justify"><strong><em>Concept:</em><br />
</strong> A <em>concept</em> is an abstract idea, defined explicitly or implicitly by a source data-model.  For example, the nodes of an ontology, the tag set in an XML Schema Document (XSD), and the attribute / table names in a relational database all represent concepts.  <em>Concept</em> is an abstraction of such representations, which in the example of Fig. 1 includes <span style="font-family:Arial; font-size:10pt">Message</span>, <span style="font-family:Arial; font-size:10pt">Person</span>, and <span style="font-family:Arial; font-size:10pt">Body_text</span>.</p>
<p style="text-align: justify">
<p style="text-align: justify"><img src="http://systover.net/blog/wp-content/uploads/2009/01/010209-0221-unifieddata1.png" alt="" /><img src="http://systover.net/blog/wp-content/uploads/2009/01/010209-0221-unifieddata2.png" alt="" /></p>
<p style="text-align: justify">
<p style="text-align: justify"><strong><em>Predicate:</em></strong> A <em>predicate</em> is an abstract idea used to express a relationship between &#8220;things.&#8221;  They are used in the formation of <em>statements </em>(described below) and may be defined either explicitly or implicitly by a source data-model.  For example, the arcs of an ontology, and the attributes of an XML or database schema represent <em>predicates</em>.   In Fig. 1, <span style="font-family:Arial; font-size:10pt">To</span>, <span style="font-family:Arial; font-size:10pt">From,</span> and <span style="font-family:Arial; font-size:10pt">Body</span> represent <em>predicates</em>.</p>
<p style="text-align: justify">
<p style="text-align: justify"><strong><em>Term: </em></strong> A <em>term</em> is a disambiguated <em>mention</em> abstracted from the source artifact or asserting analyst.  The process of disambiguation associates a <em>mention</em> with a <em>concept,</em> implicitly using the <span style="font-family:Arial; font-size:10pt">IsInstanceOf</span><br />
<em>predicate</em>.  However, not every such pairing results in a distinct <em>term</em>.  All <em>signs</em> that are identical, and that are identified as having the same meaning, are represented by a single <em>term</em>. In the example of Fig. 1, <span style="font-family:Arial; font-size:10pt">Suzi</span><br />
<span style="font-family:Arial; font-size:10pt">IsInstanceOf</span><br />
<span style="font-family:Arial; font-size:10pt">Person </span>represents a <em>term</em>.<span style="font-family:Arial; font-size:10pt"><br />
</span></p>
<p style="text-align: justify">
<p style="text-align: justify"><strong><em>Statement: </em></strong>A <em>statement</em> encodes a binary relationship between a subject and an object mediated by a <em>predicate</em>. In our design, subject and object may be either a <em>term</em> or <em>statement</em>.  The simplest kind of <em>statement</em> is one in which subject and object are <em>terms</em>.  <em>Statements</em> in which the object is itself another <em>statement</em> represent reifications.  Finally, a <em>statement</em> in which both subject and object are other <em>statements</em> represents a relationship between <em>statements</em>.  In Fig. 1, we see three <em>statements</em>, all with the same subject, which is the <em>term</em> corresponding to the message itself.</p>
<p style="text-align: justify">
<p style="text-align: justify">This organization of these elementary constructs (sign, concept, predicate, term, and statement) defines a data reference model, which we call the Data Description Framework (DDF) [Yoakum 2008 DAMA].  Because it effectively decouples data from data-models and structured data from data-structures, it can encapsulate any sort of data-model and support any data-structure.  Because it binds knowledge to data, it enables deep data integration and semantic enrichment.  Because it provides a foundation for implementing a stable database, it serves as a practical data integration platform.</p>
<p style="text-align: justify">
<p><span id="more-37"></span>In the subsequent text, we represent mentions, concepts, and predicates using <span style="font-family:Arial; font-size:10pt">Arial</span> font.  Terms are denoted as<span style="font-family:Arial; font-size:10pt"> [mention, concept] (</span>e.g.<span style="font-family:Arial; font-size:10pt"> [Adam, Chemist]) </span>and statements are denoted using an intuitive triple representation, e.g. <span style="font-family:Arial; font-size:10pt">[Adam, Chemist] hasInventoryID [1001,InventoryID].</span></p>
<h2>The Unified Data Space</h2>
<p style="text-align: justify">As illustrated in Fig. 2, the DDF forms a layer of data and semantics (Layer 2) lying between the indigenous source systems (Layer 1) and their knowledge models (Layer 3).   Layer 1 feeds the layers above, and Layers 2 and 3 interact:  Layer 3 provides semantic context for Layer 2 and Layer 2 participates in the formation of an overarching knowledge model in Layer 3.   Together Layers 2 and 3 form what we call the unified DDF data space.</p>
<p style="text-align: justify">
<p style="text-align: justify">
<h2>Illustrative Example</h2>
<p style="text-align: justify">To convey a more tangible understanding of the DDF to the user, in this section we present a simplified example that illustrates:</p>
<p style="text-align: justify">
<ul>
<li>
<div style="text-align: justify">Loading three disparate data sources into the DDF</div>
</li>
<li>
<div style="text-align: justify">Surveying the resulting integrated data space</div>
</li>
<li>
<div style="text-align: justify">Enhancing the data space with additional semantic associations</div>
</li>
<li>Exploring the enriched data space</li>
</ul>
<p style="text-align: justify">
<p style="text-align: justify"><img src="http://systover.net/blog/wp-content/uploads/2009/01/010209-0221-unifieddata3.png" alt="" /><img src="http://systover.net/blog/wp-content/uploads/2009/01/010209-0221-unifieddata4.png" alt="" /></p>
<h3>Loading the DDF</h3>
<p style="text-align: justify">Loading structured data into a DDF store is a straightforward, mechanical Extract – Transform – Load (ETL) process.  This process maps the original data and semantics into the DDF using a pattern that depends primarily on the type of data source because it needs only to capture the structure and semantics of the relational metamodel (not the structure and semantics of a specific instance).  For example, our prototype loader works out-of-the-box for most relational databases, extracting data structure and data from the source&#8217;s data dictionary and relations as follows:</p>
<p style="text-align: justify">
<ul>
<li>
<div style="text-align: justify">Data instances <span style="font-family:Symbol">®</span> signs</div>
</li>
<li>
<div style="text-align: justify">Table attributes <span style="font-family:Symbol">®</span> concepts</div>
</li>
<li>
<div style="text-align: justify">Signs are bound to their respective concepts to form terms</div>
</li>
<li>
<div style="text-align: justify">Predicates are derived from non-key attributes (i.e. concepts) using &#8216;has&#8217; semantics.  For example the predicate derived from the concept <span style="font-family:Arial; font-size:10pt">Project</span> is <span style="font-family:Arial; font-size:10pt">hasProject.</span></div>
</li>
<li>
<div style="text-align: justify">Within a record, terms associated with primary key columns are semantically linked via derived predicates to terms associated with non-primary key columns to form statements.  For example, <span style="font-family:Arial; font-size:10pt">[Adam, ChemistName] hasProject [P1, Project].</span></div>
<p style="text-align: justify">
</li>
</ul>
<p style="text-align: justify"><img src="http://systover.net/blog/wp-content/uploads/2009/01/010209-0221-unifieddata5.png" alt="" /><img src="http://systover.net/blog/wp-content/uploads/2009/01/010209-0221-unifieddata6.png" alt="" />Figures 4 and 5 illustrate the result of the mechanical ETL for the three data sources shown in Fig. 3.  For the purpose of our illustration, we assume that everything from the sources presented in Fig. 3 is loaded, but this need not be the case. We may freely choose which parts of a data source to load and when to load them.  For example, we may choose to load specific views of the source data, or perhaps only the structure of a data source, lazily loading instances only when requested.  Finally, the DDF can (and should) capture any desired metadata associated with the source artifacts, the ETL process itself, the quality / strength of semantic and association facts, or any other aspects of the data space elements. For simplicity we do not illustrate this.</p>
<p style="text-align: justify">
<h3>Surveying the Unified Data Space Floor</h3>
<p style="text-align: justify">We refer to the integrated data space that results simply from loading data into the DDF as the Unified Data Space Floor.   We may explore this space through querying.  For example, we may observe the spectrum of semantics of the sign <span style="font-family:Arial; font-size:10pt"><em>Adam</em></span> by issuing a query that asks, &#8216;What is <span style="font-family:Arial; font-size:10pt">Adam</span>?&#8217;  The result set will include all the concepts associated with the sign <span style="font-family:Arial; font-size:10pt">Adam</span><em><br />
</em>across all sources (i.e. <span style="font-family:Arial; font-size:10pt">ChemistName </span>and <span style="font-family:Arial; font-size:10pt">Chemist</span>).  Note that this simple yet penetrating question cannot be answered by any traditional data integration solution.</p>
<p style="text-align: justify">
<p style="text-align: justify">Another simple but useful question that traditional data integration solutions cannot answer is:  &#8216;Which data elements (i.e. signs)<em><br />
</em>in source B also appear in source C?&#8217;<sup><br />
</sup> The result is: <span style="font-family:Arial; font-size:10pt">E1001, E2119, </span>and<span style="font-family:Arial; font-size:10pt"> E3327</span>.  By looking at the range of concepts associated with this result set, one may glean useful insight for data-model harmonization.  For example, we find that <span style="font-family:Arial; font-size:10pt">E1001 </span>is associated with the concept <span style="font-family:Arial; font-size:10pt">InventoryID</span> in source B and <span style="font-family:Arial; font-size:10pt">EquipCode</span> in source C.   An analyst might suspect therefore, that that the two concepts are the same, and if confirmed, assert this equivalence at the data-model level. Thus insight obtained by the analysis of data instances may be applied more broadly as knowledge at the data-model level. This is but one example of how Layer 2 can inform Layer 3.</p>
<p><img src="http://systover.net/blog/wp-content/uploads/2009/01/010209-0221-unifieddata7.png" alt="" /><img src="http://systover.net/blog/wp-content/uploads/2009/01/010209-0221-unifieddata8.png" alt="" /></p>
<p style="text-align: justify">
<p style="text-align: justify">By chaining such queries we can explore semantic associations and traverse unified data space floor.  For example, we may ask:</p>
<p style="text-align: justify">
<ol>
<li>
<div>Query:  What terms are associated with the sign <span style="font-family:Arial; font-size:10pt">L1? </span><span style="font-family:Times New Roman; font-size:12pt"><br />
</span></div>
<p><span style="font-family:Times New Roman; font-size:12pt">Result: </span><span style="font-family:Arial; font-size:10pt">[E1001, EquipCode], [E3327, EquipCode]</span><span style="font-family:Times New Roman; font-size:12pt"><br />
</span></p>
<p><span style="font-family:Times New Roman; font-size:12pt">Analyst thinks:  &#8216;This stuff is located in the same lab.&#8217;<br />
</span></li>
<li>
<div><span style="font-family:Times New Roman; font-size:12pt">Query:  What other concepts are associated with signs </span><span style="font-family:Arial; font-size:10pt">E1001 </span><span style="font-family:Times New Roman; font-size:12pt">and</span><span style="font-family:Arial; font-size:10pt"> E3327? </span><span style="font-family:Times New Roman; font-size:12pt"><br />
</span></div>
<p><span style="font-family:Times New Roman; font-size:12pt">Result: </span><span style="font-family:Arial; font-size:10pt">InventoryID</span><span style="font-family:Times New Roman; font-size:12pt"> (from source B)<br />
</span></p>
<p><span style="font-family:Times New Roman; font-size:12pt">Analyst thinks: &#8216;I wonder if </span><span style="font-family:Arial; font-size:10pt">EquipCode</span><span style="font-family:Times New Roman; font-size:12pt"> is the same thing as </span><span style="font-family:Arial; font-size:10pt">InventoryID</span><span style="font-family:Times New Roman; font-size:12pt">.&#8217;<br />
</span></li>
<li><span style="font-family:Times New Roman; font-size:12pt">Query:  Which signs of </span><span style="font-family:Arial; font-size:10pt">EquipCode </span><span style="font-family:Times New Roman; font-size:12pt">match signs of</span><span style="font-family:Arial; font-size:10pt"> InventoryID?</span><span style="font-family:Times New Roman; font-size:12pt"><br />
</span></li>
</ol>
<p><span style="font-family:Times New Roman; font-size:12pt">Result: </span><span style="font-family:Arial; font-size:10pt">E1001, E2119, E3327</span><span style="font-family:Times New Roman; font-size:12pt"><br />
</span></p>
<p style="margin-left: 36pt"><span style="font-family:Times New Roman; font-size:12pt">Analyst thinks:  &#8216;The concepts </span><span style="font-family:Arial; font-size:10pt"><a name="OLE_LINK1"></a>EquipCode </span><span style="font-family:Times New Roman; font-size:12pt">and</span><span style="font-family:Arial; font-size:10pt"> InventoryID </span><span style="font-family:Times New Roman; font-size:12pt">probably do mean the same thing.&#8217;<br />
</span></p>
<ol>
<li><span style="font-family:Times New Roman; font-size:12pt">Query:  What other concepts are associated with</span><span style="font-family:Arial; font-size:10pt"> InventoryID?</span><span style="font-family:Times New Roman; font-size:12pt"><br />
</span></li>
</ol>
<p><span style="font-family:Times New Roman; font-size:12pt"> Result: </span><span style="font-family:Arial; font-size:10pt">Chemist<br />
</span></p>
<ol>
<li><span style="font-family:Times New Roman; font-size:12pt">Query: Which </span><span style="font-family:Arial; font-size:10pt">Chemists</span><span style="font-family:Times New Roman; font-size:12pt"> are associated with </span><span style="font-family:Arial; font-size:10pt">[E1001,InventoryID] and [E3327,InventoryID]?</span><span style="font-family:Times New Roman; font-size:12pt"><br />
</span></li>
</ol>
<p><span style="font-family:Times New Roman; font-size:12pt"> Result: </span><span style="font-family:Arial; font-size:10pt">[Adam, Chemist], [Mary, Chemist]<br />
</span></p>
<p style="margin-left: 36pt"><span style="font-family:Times New Roman; font-size:12pt">Analyst thinks:  &#8216;Adam and Mary have equipment in the same lab, so they probably know each other.&#8217;<br />
</span></p>
<p style="text-align: justify">
<p style="text-align: justify">These queries illustrate the ability to perform &#8220;semantic drilling&#8221; into the DDF data space.  We can ask series of questions that &#8220;surf&#8221; across the entire DDF data space unimpeded by barriers between source systems.  One need not have specific semantic knowledge of the source systems in order to explore the data space this way and to extract useful insight.  In the next section we will illustrate how this insight may be subsequently inserted back into the data space, as additional information and knowledge, to produce further semantic enrichment and fusion.</p>
<p><img src="http://systover.net/blog/wp-content/uploads/2009/01/010209-0221-unifieddata9.png" alt="" /></p>
<h3>Enhancing the Data Space</h3>
<p style="text-align: justify">Up to this point, we have discussed the data integration and analytic power of the unified data space floor that results simply by the mechanical loading of data into Layer 2.  The breadth of integration, depth of semantic enrichment, and analytic power can all be dramatically improved by building upon this floor, either by an analyst or an automated process.  This can be performed at the data instance level (Level 2), the data-model level (Level 3), or the combination of the two.  The first regards the assertion of new instances of DDF elements (i.e. signs, terms, concepts, predicates, and statements).  The second regards the enhancement and or harmonization of source specific data-models.  The third regards the association of concepts and predicates asserted in Level 2 with existing knowledge models in Level 3.</p>
<p style="text-align: justify">
<p style="text-align: justify">For example, as is illustrated in Fig. 5, we may introduce the predicate <span style="font-family:Arial; font-size:10pt">isEquivalent</span> and use it to assert the statement that [<span style="font-family:Arial; font-size:10pt">Ben, ChemistName] isEquivalent [Benjamin, Chemist]</span>.  Such statements, created at the data instance level, represent <em>data</em> integration.  In addition, we may assert new associations at the data-model level to achieve global <em>data-model</em> integration (e.g. harmonization).  This is illustrated in Fig. 6 wherein, concept <span style="font-family:Arial; font-size:10pt">ChemistName</span> is asserted to be the same as concept <span style="font-family:Arial; font-size:10pt">Chemist</span>.  The result of this assertion is that the <em>meaning</em> of all <span style="font-family:Arial; font-size:10pt">ChemistName</span> terms becomes <span style="font-family:Arial; font-size:10pt">sameAs</span> the <em>meaning</em> of all <span style="font-family:Arial; font-size:10pt">Chemist</span><em><br />
</em>terms.</p>
<p style="text-align: justify">
<p style="text-align: justify"><img src="http://systover.net/blog/wp-content/uploads/2009/01/010209-0221-unifieddata10.png" alt="" /><img src="http://systover.net/blog/wp-content/uploads/2009/01/010209-0221-unifieddata11.png" alt="" /></p>
<h3>Exploring the Enriched Data Space</h3>
<p style="text-align: justify">As we explore the enriched data space, surfing semantics and drilling associations, we find that previously disjoint regions of the space become reachable via the newly asserted data and associations.  For example, having equated the concept <span style="font-family:Arial; font-size:10pt">ChemistName </span>with <span style="font-family:Arial; font-size:10pt">Chemist</span>, and <span style="font-family:Arial; font-size:10pt">InventoryID </span>with <span style="font-family:Arial; font-size:10pt">EquipCode, </span>an analyst can simply retrieve the projects that are located in a particular lab with basically one query.</p>
<p style="text-align: justify">
<p style="text-align: justify">Query:  Which terms are associated with <span style="font-family:Arial; font-size:10pt">[L1, lab]</span>?</p>
<p style="text-align: justify">Result:  <span style="font-family:Arial; font-size:10pt">[E1001, EquipCode], [Adam, Chemist], [P1, Project]<br />
</span></p>
<p style="text-align: justify">
<p style="text-align: justify">Fig. 6 shows how the asserted associations (dashed) at the data-model level enable additional associations (dotted) to be inferred.  This interplay of data and data-model integration is what ultimately allows us to &#8220;connect the dots.&#8221;</p>
<p style="text-align: justify">
<h2>Application to SIMA</h2>
<p style="text-align: justify">To enable the rapid, ad-hoc assimilation of diverse data into situational views useful for SIMA, we must overcome system, structural, and semantic barriers between data sourced from different systems.  As illustrated in Fig. 7, traditional data integration approaches attempt to achieve this by imposing a tight commitment to a particular data-model or integration schema (i.e. canonical data-model).  Unfortunately, choosing which of the source data element to expose and mapping them to the canonical model inevitably leads to information loss, and or distortion, and the integration schema itself creates yet another semantic barrier.</p>
<p style="text-align: justify">In contrast, the DDF breaks the barriers between data sources to accommodate all within a single coherent data space.  Simply loading data into the DDF in a largely automated fashion produces a fundamental level of data unity - the Unified Data Space Floor.  No data-model harmonization need be made and yet non-trivial data integration results.  Upon this floor, the DDF supports the construction of deeper integration and semantic enrichment at both the data instance and data-model levels without prescribing or constraining the processing by which such enrichment may be achieved.  Any fusion or data integration method can be applied alone or in combination.  Moreover, unlike other integration approaches, new data and associations, regardless of their origin, join seamlessly into the unified data space.</p>
<p style="text-align: justify">
<p style="text-align: justify">The DDF data space also supports the complete spectrum of applications and clients, from generic (i.e. those operating at the level of the DDF structure) to specific (i.e. those that have knowledge of a particular source data-model).  Generic clients seamlessly span across the entire data space regardless of data source or associated data-model to perform analysis.  Such clients require no modification as new data or semantics are introduced.  Specific clients are able to operate with the same semantic depth in the DDF data space as they would on the source system itself since the DDF data space preserves the data and semantics of the source systems.  In other words, the expressiveness and search capability native to those systems are retained [Yoakum 2008 JDIQ].  As the data space is increasingly enriched with semantics that bridge data-models, the depth of specific clients is retained while their breadth increasingly widens toward that of a generic client.</p>
<p style="text-align: justify">
<p style="text-align: justify"><img src="http://systover.net/blog/wp-content/uploads/2009/01/010209-0221-unifieddata12.png" alt="" /><img src="http://systover.net/blog/wp-content/uploads/2009/01/010209-0221-unifieddata13.png" alt="" /><img src="http://systover.net/blog/wp-content/uploads/2009/01/010209-0221-unifieddata14.png" alt="" /></p>
<h2>Conclusion</h2>
<p style="text-align: justify">Successfully executing the constellation of activities that comprise SIMA, particularly in support of decision-making, requires exploiting information within a dynamic, heterogeneous, and distributed data environment that is largely beyond our control.  The challenge therefore, is to dynamically integrate data, information, and knowledge into one coherent intelligence repository to serve as a foundation for SIMA processes and operations.  Current practice is insufficient in the face of scale and complexity.</p>
<p style="text-align: justify">The approach presented in this paper overcomes the shortcomings of traditional data integration approaches using a framework, called the Data Description Framework, which enables the seamless integration of any structured data within and across data sources and models without the loss or distortion of data and semantics.  Moreover, the framework supports a practical, stable implementation using any standard database system.</p>
<p style="text-align: justify">
<p style="text-align: justify">The simple, mechanical loading of source data and semantics into the DDF creates a unified data space floor that exhibits a primary level of data integration unmatched by traditional integration approaches. No up-front, heavy investment in data-model harmonization is required – one simply pours data on the floor.  Deeper integration and semantic enrichment may then be pursued with any manual or automated processing operating either at the data instance or data-model levels.</p>
<p style="text-align: justify">
<p style="text-align: justify">The ultimate analytic power that is enabled by the DDF data space is essentially unlimited and exceeds that of any particular source system or traditional data integration solution at any level.  Having the power and flexibility required to organize the transient and complex SIMA data space, it provides the ideal foundation on which to pursue SIMA.</p>
<p style="text-align: justify">
<h2>Acknowledgements</h2>
<p style="text-align: justify">The authors would like to thank the following US Army CERDEC I2WD personnel for their continued support:  Mr. Anthony Lisuzzo, Director, Mr. Kesny Parent, DCGS-A Branch Chief, Ms. Virginia Goon IXFL Manager, and Mr. Norbert Antunes IXFL Computer Engineer.  This work was funded by US Army CERDEC I2WD under contract number W15P7T-06-D-A401/009.</p>
<p style="text-align: justify">
<h2>References</h2>
<p style="text-align: justify; margin-left: 28pt"><span style="font-family:Times New Roman">[Batini 1986] Batini, C. <em>et al</em>. <em>A comparative analysis of methodologies for database schema integration</em>, ACM Computing Surveys, (18) 4, 1986.<br />
</span></p>
<p style="text-align: justify; margin-left: 28pt">
<p style="text-align: justify">[Bernstein 2007] Bernstein P., Ho, H<span style="font-family:Arial">. </span><em>Model Management and Schema Mappings: Theory and Practice</em>, Proceedings of VLDB Conference, 2007.</p>
<p style="text-align: justify; margin-left: 36pt">
<p style="margin-left: 28pt"><span style="font-family:Times New Roman">[Halevy 2005] Halevy, A. <em>et al</em>. <em>Enterprise information integration: successes, challenges and controversies</em>, Proceedings of 24th International Conference on Management of Data, Baltimore, 2005.<br />
</span></p>
<p style="margin-left: 28pt">
<p style="text-align: justify; margin-left: 36pt">
<p style="text-align: justify; margin-left: 36pt">[Northrop 2006]  Northrop, L., <em>et al.</em>, <em>Ultra-Large-Scale Systems The Software Challenge of the Future</em>, Pittsburgh: Carnegie Mellon University, 2007. <a href="http://www.sei.cmu.edu/publications/books/engineering/uls.html">http://www.sei.cmu.edu/publications/books/engineering/uls.html</a></p>
<p style="text-align: justify; margin-left: 36pt">
<p style="text-align: justify; margin-left: 28pt"><span style="font-family:Times New Roman">[Parent 1998] Parent, C. and Spaccapietra, S. <em>Issues and approaches of database integration</em>, Communications of the ACM, 41(5), 1998.<br />
</span></p>
<p style="text-align: justify; margin-left: 36pt">
<p style="text-align: justify; margin-left: 37pt">[Yoakum 2008 DAMA] Yoakum-Stover, S. and Malyuta, T. <em>Unified Integration Architecture for Intelligence Data,</em> DAMA International Europe Conference 2008, November 2008, London, UK.</p>
<p style="margin-left: 36pt">[Yoakum 2008 JDIQ] Yoakum-Stover, S. and Malyuta, T. <em>Unified Architecture for Integrating Intelligence Data,</em> ACM Journal of Data and Information Quality. September 2008. Pending decision.</p>
]]></content:encoded>
			<wfw:commentRss>http://systover.net/blog/2009/01/01/unified-data-integration-for-situation-management/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Unified Architecture for Integrating Intelligence Data (full paper)</title>
		<link>http://systover.net/blog/2009/01/01/unified-architecture-for-integrating-intelligence-data-full-paper/</link>
		<comments>http://systover.net/blog/2009/01/01/unified-architecture-for-integrating-intelligence-data-full-paper/#comments</comments>
		<pubDate>Fri, 02 Jan 2009 02:18:46 +0000</pubDate>
		<dc:creator>Suzanne Yoakum-Stover</dc:creator>
		
		<category><![CDATA[data modeling]]></category>

		<category><![CDATA[database]]></category>

		<category><![CDATA[dataspaces]]></category>

		<category><![CDATA[publications]]></category>

		<guid isPermaLink="false">http://systover.net/blog/?p=22</guid>
		<description><![CDATA[S. Yoakum-Stover, Ph.D.
Potomac Institute for Policy Studies
US Army CERDEC I2WD Information Exploitation Futures Lab

T. Malyuta, Ph.D.
New York City College of Technology
Computer Systems Technology Department

August 24, 2008
Abstract
The principal problem spanning the Intelligence Community today is how to integrate the great variety of disparate data into one single coherent repository of knowledge.  Current practice whereby all [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: center">S. Yoakum-Stover, Ph.D.</p>
<p style="text-align: center">Potomac Institute for Policy Studies</p>
<p style="text-align: center">US Army CERDEC I2WD Information Exploitation Futures Lab</p>
<p style="text-align: center">
<p style="text-align: center">T. Malyuta, Ph.D.</p>
<p style="text-align: center">New York City College of Technology</p>
<p style="text-align: center">Computer Systems Technology Department</p>
<p style="text-align: center">
<p style="text-align: center">August 24, 2008</p>
<h2>Abstract</h2>
<p style="text-align: justify; margin-left: 36pt"><span style="font-size:10pt">The principal problem spanning the Intelligence Community today is how to integrate the great variety of disparate data into one single coherent repository of knowledge.  Current practice whereby all data-models would be merged into a single &#8220;Uber-model&#8221; simply does not work.  We require a solution that remains viable in a freely evolving, interdependent collective of human and computational systems, very little of which will ever be under our control.  Our approach is database-centric and proceeds in stages.  The first addresses the unified storage of the broad spectrum of artifacts existing within the Intelligence Enterprise today regardless of modality or representation.  The second builds upon the foundation provided by the first to address the unified storage of structured data and semantic data integration.  In both we embrace the diversity of data-models employed throughout the Intelligence Community. The result is a layered data architecture that can accommodate any kind of data without placing restrictions on vocabulary, structure, semantics, or constraints in a way that addresses today&#8217;s Intel needs while providing a seamless transition path toward a future of ULS systems imbued with semantic technologies.<br />
</span></p>
<p style="text-align: justify; margin-left: 36pt">
<h2>Introduction</h2>
<p style="text-align: justify">The principal problem spanning the Intelligence Community today is how to integrate the great variety of disparate data stores and streams, both legacy and bleeding-edge, into one single coherent repository of knowledge.  Pieces of the Intel puzzle lay scattered in data silos sequestered by the very systems that served to create them.  Each of these systems, to include most of today&#8217;s Army Programs of Record, was built as an end-to-end solution with its own sensors, processors, and data stores, implemented and operated to achieve a specific intelligence objective.  They were never meant to interoperate, share data, or even expose data beyond a narrow mission-focused enclave.  The advent of network technologies and protocols, which have effectively eliminated the physical barriers between systems, has done little to bridge the chasm between these data silos.  Although we can now transfer data over the wire, disparate and utterly incompatible data-models characterized by straightforward and subtle differences in vocabulary, structure, semantics, and constraints continue to stymie data integration efforts.</p>
<p style="text-align: justify">
<p style="text-align: justify">Data quality professionals widely recognize the importance of data integration and the need for efficient data integration approaches to redress a panoply of data quality problems [Lee 2006].  Unfortunately, current practice in data integration, whereby all data-models would be merged or harmonized, either physically or virtually [Batini 1986, <span style="font-family:Times New Roman">Parent 1998, Halevy 2005, </span>Bernstein 2007] fails to accommodate the demands of our fluid and rapidly growing Intelligence Enterprise.  The physical mapping of disparate models into a single canonical data-model [Omelayenko 2001] is simply untenable as the scale and complexity of their subjects quickly overwhelms our tools and methods.  Federation approaches share this defect and introduce new ones [Izydor 2007, Yero 2008].   In practice, these approaches provide only the illusion of data integration as they mainly integrate data-models, not the data itself, and in so doing confine all data to a model that is incapable of adapting itself or its contents as our knowledge about the domain evolves.</p>
<p style="text-align: justify">
<p style="text-align: justify">In all but the most constrained situations, what begins as a perfectly neat solution for a handful of systems quickly becomes intractable with scale, exposing not only the limitations of traditional implementations, but also of our grasp at the foundations of knowledge representation itself.  This phenomenon is but one early symptom of our evolution toward Ultra-Large Scale (ULS) systems [Northrop 2006] and as such, invites a completely different approach - one that remains viable in a freely evolving, interdependent collective of systems, people, policies, cultures, and economics, very little of which will ever be under our control.  Our objective is to define such a solution.</p>
<h2>Conceptual Approach</h2>
<p style="text-align: justify">Our approach to integrating intelligence data in a ULS systems environment is data-centric (as opposed to data-model centric) and proceeds in stages.  The first addresses the unified storage of the entire spectrum of intelligence artifacts regardless of modality or representation.  The second stage builds upon the foundation provided by the first to address the unified storage of structured data to enable semantic data integration. A third stage (beyond the scope of this paper) addresses unified storage of knowledge models. In all stages we embrace the diversity of domain-specific data-models employed throughout the Intelligence Community by taking a data-model agnostic approach wherein the integration model makes the least possible commitment to any particular data-model.  In the case of &#8220;raw&#8221; artifacts, this means storing each according to its native representation without the application of structural or semantic transformations. In the case of structured artifacts, it means identifying the universal aspects inherent in all structured data and creating an integration model based on that.  A key aspect of our approach is that the character and meaning of the source data-model is preserved and made accessible by the data store.  The result is a layered Data Integration Framework that can accommodate any kind of data without placing restrictions on vocabulary, structure, semantics, or constraints, in a way that addresses the needs of the Intelligence Community today while providing a seamless transition path toward a future of ULS systems imbued with semantic technologies.</p>
<h2>Scope</h2>
<p style="text-align: justify">The types of intelligence collected by sensors and systems today span the electro-magnetic spectrum to include all manner of signals, audio, video, and images, in addition to so-called human intelligence (e.g. text artifacts such as reports, messages, web pages).  Our approach to data integration supports all of these simultaneously regardless of their underlying source data-model, or lack thereof.  It does not however, <em>prescribe</em> a solution for data-model harmonization.  In particular, our approach imposes no relationship between the data-models to which the artifacts adhere.  It does however, allow such relationships, created by external processes of any sort, to be effectively represented and integrated together.</p>
<p style="text-align: justify"><img src="http://systover.net/blog/wp-content/uploads/2009/01/010209-0218-unifiedarch1.png" alt="" /></p>
<p style="text-align: justify">As the <em>business</em> of intelligence is to develop and communicate understanding (which entails the collection, exploitation, and provisioning of intelligence), intelligence business <em>processing</em> includes any automated activity that moves intelligence artifacts with respect to the cognitive hierarchy (see Fig.1).  This includes data collection, semantic enhancement and fusion from data to information to knowledge, and communication / collaboration to create understanding.  In these terms, Layer 1 of our Data Integration Framework supports an aspect of collection and rudimentary exploitation.  Layer 2 supports the processing by which data is enhanced with semantics to produce information, and the processing by which information is enhanced with richer associations to produce knowledge.   Layer 3 supports the management and integration of knowledge models, and Layer 4 supports human computer interfaces through which the analyst &#8220;sees&#8221; all of this intelligence.  The scope of this paper is limited to Layers 1 and 2, which together support the provisioning of integrated intelligence at the level of data, information, and knowledge.  Layers 3 and 4 will be the subject of subsequent papers.</p>
<h2>Technical Approach</h2>
<p style="text-align: justify">The broad and ever-changing spectrum of intelligence artifacts existing within the Intelligence Enterprise today reflects a nearly equally broad and ever-changing spectrum of intelligence collectors, producers, and consumers.  The types of artifacts they generate vary tremendously in their modality (e.g. text, images, audio, video, signals) and representation (e.g. free text, XML, SQL, vector, raster).  As this diversity is beyond our control, we term all such artifacts as &#8220;indigenous.&#8221;</p>
<p style="text-align: justify">
<p style="text-align: justify">In Layer 1 of our Data Integration Framework, we seek to integrate the entire spectrum of indigenous artifacts by simply collecting them in one (possibly distributed) database using standard means for physical and or virtual data integration.  Crucial to our approach  however, is that we (a) avoid making any data or data-model transformations in the process of data ingestion and (b) make the least possible commitment to a data-model in the target storage schema.  Consequently, the Layer 1 database schema is quite simple and flat, exposing a minimal set of essential meta-data fields whose main purpose is to support back-tracking to the original artifact and or source.  As illustrated in Fig. 2, the principal data element within a database record is the artifact itself, which is captured either physically or virtually (by way of a link or reference) in as close to its indigenous form as possible.</p>
<p style="text-align: justify">
<p style="text-align: justify">Using a familiar analogy, if each indigenous artifact were to represent a single piece of a colossal Intel jigsaw puzzle, then Layer 1 of our Data Integration Framework is just the box in which we keep all the pieces.  This most trivial form of integration has several important benefits:  It provides a manageable yet powerful and standard interface to the source data.  It gives us the option to either &#8220;lazily&#8221; load and cache data as &#8220;virtual artifacts&#8221; for performance sake, or persist and control data as &#8220;tangible artifacts&#8221; for the long term.  It provides &#8220;one stop shopping&#8221; access to the indigenous data for analysts who would otherwise need to navigate and obtain access to multiple disparate systems.  And most importantly, this universal indigenous data store establishes a foundation upon which deep data integration can be more effectively pursued.</p>
<p><img src="http://systover.net/blog/wp-content/uploads/2009/01/010209-0218-unifiedarch2.png" alt="" /><img src="http://systover.net/blog/wp-content/uploads/2009/01/010209-0218-unifiedarch3.png" alt="" /></p>
<h3>Structured Data</h3>
<p style="text-align: justify">Every analyst engaged in intelligence processing either creates or uses structured data.  Just as we do not control the sources or format of indigenous artifacts, we also do not control the various methods by which such artifacts might be structured or the data-models employed therein.  Thus as the objective of Layer 1 is to accommodate the diversity of indigenous artifacts regardless of type or format, the objective of Layer 2 is to accommodate the diversity of all structured data regardless of vocabulary, organization, representation, or semantics.</p>
<p style="text-align: justify">
<p style="text-align: justify">Structured data necessarily adheres to some sort of model, which in general specifies vocabulary, structure, semantics, and constraints.  Though not all data-models specify all of these, at minimum, every structured artifact entails a vocabulary reflecting a set of entity types (e.g. person, message) and an organization reflecting their relationships (e.g. message to person).  These basic elements are illustrated in the simplified example of Fig. 3.  Part (a) of the figure shows a short unstructured text message, and part (b) shows a data-model according to which a message might be structured.  Part (c) then shows the original message structured according to the data-model and part (d) shows how that structured message is typically persisted in a database.</p>
<p style="text-align: justify">
<p style="text-align: justify"><span id="more-22"></span></p>
<p style="text-align: justify">Notice how the database schema is tightly coupled to the data-model that was used to structure the data, and how the raw message is bound to the data-model by the database.  The data-model is imposed on the database, and the data itself is frozen into it such that no additional attributes or relationships are possible (without modifying the database schema).  This is a severe shortcoming considering the tremendous variety of ways in which a given artifact might be structured or enhanced with additional features and associations.  Even for the simple case shown in the figure, we can easily imagine data-models that use different entities (e.g. <span style="font-family:Arial; font-size:10pt">&#8216;Individual&#8217;</span> instead of <span style="font-family:Arial; font-size:10pt">&#8216;Person&#8217;</span>), different relationships (e.g. <span style="font-family:Arial; font-size:10pt">&#8216;Sender&#8217;</span> instead of <span style="font-family:Arial; font-size:10pt">&#8216;From&#8217;</span>), and different organizations (e.g. by including <span style="font-family:Arial; font-size:10pt">&#8216;MessageDate&#8217;</span>), not to mention the wealth of other information external to the message itself (e.g. about <span style="font-family:Arial; font-size:10pt">&#8216;Tanya&#8217;</span>) that might be brought to bear.</p>
<p style="text-align: justify">
<p style="text-align: justify">In a ULS systems environment, it is simply unreasonable to presume that the data-models or the various processes, either automated or manual, that structure of data can be controlled or constrained.  It is also unreasonable to presume that it is possible to anticipate the totality of their breadth or their application. To the contrary, the urgency and diversity driving our Intelligence Enterprise essentially guarantees that as many different methods for extracting entities, relationships, and events will be brought to bear as our imaginations and increasingly powerful technologies can support.  Thus, although we might like to enhance Layer 1 of our Data Integration Framework by exposing all possible extracted elements along with their properties and attributes in order to support efficient querying, introducing an ever expanding array of fields and tables into a database is as impractical as attempting to accommodate every kind of data and purpose within a single canonical data-model.</p>
<p style="text-align: justify"><img src="http://systover.net/blog/wp-content/uploads/2009/01/010209-0218-unifiedarch4.png" alt="" /></p>
<p style="text-align: justify">The challenge therefore, is to build the next layer of the Data Integration Framework to accommodate structured data in a way that exposes that structure for use, without imposing the structure on the data store itself.  In other words, we must determine a method for storing and managing any kind of structured data, reflecting any data-model, so that it can be shared, efficiently exploited, and extended in unforeseen ways without requiring model-specific storage implementations.  In other words, we seek a universal  storage model for structured data.</p>
<h3>Data-Model Abstraction</h3>
<p style="text-align: justify">The key to devising a domain-neutral storage model for structured data is to decouple that which varies, namely vocabularies and, more generally the data-models, from that which remains constant, namely the source artifact, and ideally the storage structure.  To achieve this, we consider structure, vocabulary, semantics, and constraints from a higher level of abstraction from which we then distill a minimal set of elements sufficient to capture any data-model.  These are defined as follows:</p>
<p style="text-align: justify">
<p style="text-align: justify"><strong><em>Sign: </em></strong>A <em>sign, </em><span style="font-family:Arial; font-size:10pt">g</span>, is a representation of a chunk of data, either physically located within a tangible artifact, or contained within an analyst&#8217;s mind.  Examples of the former include a string of text in a document; an object within an image; a segment of audio in an audio</p>
<p style="text-align: justify">
<p style="text-align: justify">stream; a spike in a signal.  As illustrated in Fig. 4, regardless of the type of medium, a sign for tangible data is always associated with a physical extent within the artifact and has a quantifiable span, which we call a mention. In contrast, signs that reside in an analyst&#8217;s mind become tangible only when she writes down her thoughts.  We explicitly include such intangible signs here to support the analyst&#8217;s ability to assert information directly into the data store without having to first represent it in a physical artifact.  The set of all signs, <span style="font-family:Arial; font-size:10pt">G = {g<sub>i</sub>}</span>, spans across all data sources.  In the set, each element is unique: <span style="font-family:MS Mincho">?</span><span style="font-family:Arial; font-size:10pt">i,j  (i<sub><br />
</sub>? j) g<sub>i </sub>? g<sub>j</sub></span>.  <span style="font-family:Arial; font-size:10pt">G</span> is the construct by which the DDF represents data.  From the text data shown in Fig. 4, signs <span style="font-family:Arial; font-size:10pt">G&#8217; = {&#8217;Suzi&#8217;, &#8216;Tanya&#8217;, &#8216;July 4, 2007&#8242;, &#8216;Bring lunch&#8217;, &#8216;Message1&#8242;</span>}  contribute to <span style="font-family:Arial; font-size:10pt">G</span> (i.e. <span style="font-size:10pt"><span style="font-family:Arial">G&#8217; </span><span style="font-family:Symbol">Í</span></span><br />
<span style="font-family:Arial; font-size:10pt">G</span>), though many more signs may be identified even from this simple example.</p>
<p style="text-align: justify"><img src="http://systover.net/blog/wp-content/uploads/2009/01/010209-0218-unifiedarch5.png" alt="" /></p>
<p style="text-align: justify"><strong><em>Concept:</em><br />
</strong> A <em>concept,</em><br />
<span style="font-family:Arial; font-size:10pt">c</span>, is a representation of an abstract idea, defined explicitly or implicitly by a source data-model.  For example, the nodes of an ontology, the tag set in an XML Schema Document (XSD), and the attribute / table names in a relational database all represent concepts. In the set of all concepts <span style="font-family:Arial; font-size:10pt">C = {c<sub>i</sub>}</span>, each element is unique: <span style="font-family:MS Mincho">?</span><span style="font-family:Arial; font-size:10pt">i,j  (i<sub><br />
</sub>? j) c<sub>i </sub>? c<sub>j</sub></span>.  From the text data shown in Fig. 4, concepts <span style="font-family:Arial; font-size:10pt">C&#8217; = </span>{<span style="font-family:Arial; font-size:10pt">&#8216;Message&#8217;</span>, <span style="font-family:Arial; font-size:10pt">&#8216;Person&#8217;</span>, <span style="font-family:Arial; font-size:10pt">&#8216;Body_text&#8217;}</span> contribute to the full set of concepts <span style="font-family:Arial; font-size:10pt">C</span> (i.e. <span style="font-size:10pt"><span style="font-family:Arial">C&#8217; </span><span style="font-family:Symbol">Í</span></span><br />
<span style="font-family:Arial; font-size:10pt">C</span>).</p>
<p style="text-align: justify">
<p style="text-align: justify"><strong><em>Predicate:</em></strong> A <em>predicate,</em><br />
<span style="font-family:Arial; font-size:10pt">p</span>, is a representation of an abstract idea used to express a relationship between &#8220;things.&#8221;  Predicates are used in the formation of <em>statements </em>(described below) and may be defined either explicitly or implicitly by a source data-model.  For example, the arcs of an ontology, and the attributes of an XML or database schema represent predicates.   In the set of all predicates <span style="font-family:Arial; font-size:10pt">P = {p<sub>i</sub>}</span>, each element is unique: <span style="font-family:MS Mincho">?</span><span style="font-family:Arial; font-size:10pt">i,j  (i<sub><br />
</sub>? j) p<sub>i </sub>? p<sub>j</sub></span>.  The text example of Fig. 4 contributes predicates <span style="font-family:Arial; font-size:10pt">P&#8217; =</span> {<span style="font-size:10pt">&#8216;<span style="font-family:Arial">To&#8217;</span></span>, <span style="font-family:Arial; font-size:10pt">&#8216;From&#8217;,</span><br />
<span style="font-family:Arial; font-size:10pt">&#8216;Body&#8217;} </span>to the set of all predicates<span style="font-family:Arial; font-size:10pt"> P </span>(i.e. <span style="font-size:10pt"><span style="font-family:Arial">P&#8217; </span><span style="font-family:Symbol">Í</span></span><br />
<span style="font-family:Arial; font-size:10pt">P</span>).  The only predicate that is &#8220;built into&#8221; (i.e. defined by) our storage model is the <span style="font-family:Arial; font-size:10pt">&#8216;IsInstanceOf&#8217;</span> predicate<em>, </em>which is used to disambiguate <em>signs</em> to form <em>terms</em> as described below.  Concepts and predicates are the constructs by which we link to data-models and, thereby, explicitly expose data-semantics.</p>
<p style="text-align: justify"><strong><em>Term: </em></strong> A <em>term,</em><br />
<span style="font-family:Arial; font-size:10pt">t<sub>ij</sub></span>,<sub><br />
</sub>is an ordered pair &lt;<span style="font-family:Arial; font-size:10pt">g<sub>i</sub>,c<sub>j</sub></span>&gt; where <span style="font-size:10pt"><span style="font-family:Arial">g<sub>i </sub></span><span style="font-family:MS Mincho">?</span><span style="font-family:Arial"> G </span></span>and <span style="font-size:10pt"><span style="font-family:Arial">c<sub>j </sub></span><span style="font-family:MS Mincho">?</span><span style="font-family:Arial"> C</span></span>.  Each term represents a disambiguated <em>sign</em>.  The process of disambiguation associates a <em>sign</em> with a <em>concept</em> using the <span style="font-family:Arial; font-size:10pt">&#8216;IsInstanceOf&#8217;</span><br />
<em>predicate</em> (though not every sign from <span style="font-family:Arial; font-size:10pt">G</span> is necessarily disambiguated, and not every concept from <span style="font-family:Arial; font-size:10pt">C</span> is necessarily used for disambiguation).  In the set of all terms <span style="font-family:Arial; font-size:10pt">T = {t<sub>ij</sub>}</span>, each element is unique:  <span style="font-size:10pt"><span style="font-family:MS Mincho">?</span><span style="font-family:Arial"> i,j,k,l</span></span><span style="font-family:MS Mincho"><br />
</span><span style="font-family:Arial; font-size:10pt">(i ? k</span> or<span style="font-family:MS Mincho"><br />
</span><span style="font-family:Arial; font-size:10pt">j ? l)</span><br />
<span style="font-family:Arial; font-size:10pt">t<sub>ij</sub> ? t<sub>kl</sub></span>.  The text example of Fig. 4 contributes terms <span style="font-family:Arial; font-size:10pt">T&#8217; = {t<sub>1</sub>, t<sub>2</sub>, t<sub>3</sub>, t<sub>4</sub>}</span> where <span style="font-family:Arial; font-size:10pt">t<sub>1</sub> = &lt;&#8217;Suzi&#8217;, person&gt;, t<sub>2</sub> = &lt;&#8217;Tanya&#8217;, person&gt;,  t<sub>3</sub> = &lt;&#8217;Bring lunch&#8217;, Body_text&gt;, t<sub>4</sub> = &lt;Message1, message&gt; </span>to the complete set of terms<span style="font-size:10pt"><span style="font-family:Arial"> T (i.e. T&#8217; </span><span style="font-family:Symbol">Í</span></span><br />
<span style="font-family:Arial; font-size:10pt">T).<br />
</span></p>
<p style="text-align: justify"><strong><em>Statement: </em></strong>A <em>statement</em>, <span style="font-family:Arial; font-size:10pt">s</span>, encodes a binary relationship between a subject and an object mediated by a predicate<em>. </em> A statement is represented by an ordered triple <span style="font-family:Arial; font-size:10pt">s<sub>ijh</sub> = &lt;subject<sub>i</sub>, predicate<sub>j</sub>, object<sub>h</sub>&gt;</span>.  Among the set of all statements, each element is unique: <span style="font-size:10pt"><span style="font-family:MS Mincho">?</span><span style="font-family:Arial"> i,j,h,l,m,n</span></span><span style="font-family:MS Mincho"><br />
</span><span style="font-family:Arial; font-size:10pt">(i ? l</span> or <span style="font-family:Arial; font-size:10pt">j ? m</span> or <span style="font-family:Arial; font-size:10pt">h ? n)</span><br />
<span style="font-family:Arial; font-size:10pt">s<sub>ijh </sub>? s<sub>lmn</sub></span>.  In our model, subject and object may be either a <em>term</em> or <em>statement</em>.  The simplest kind of <em>statement</em> is one in which subject and object are <em>terms </em><span style="font-family:Arial; font-size:10pt">s0<sub>ijh </sub>= &lt;t<sub>i</sub>, p<sub>j</sub>, t<sub>h</sub>&gt;</span>.  <em>Statements</em> in which the object is itself another <em>statement</em> represent reifications: <span style="font-family:Arial; font-size:10pt">s1<sub>klm </sub>= &lt;t<sub>k</sub>, p<sub>l</sub>, s<sub>m</sub>&gt;</span>.  Finally, a <em>statement</em> in which both subject and object are other <em>statements</em> represents a relationship between <em>statements</em>: <span style="font-family:Arial; font-size:10pt">s2<sub>xyz </sub>= &lt;s<sub>x</sub>, p<sub>y</sub>, s<sub>z</sub>&gt;</span>.  The set of all statements <span style="font-family:Arial; font-size:10pt">S = {s0<sub>ijh</sub>} U {s1<sub>klm</sub>} U {s2<sub>xyz</sub>}</span>.  The text example of Fig. 4 shows three <em>statements</em>: <span style="font-family:Arial; font-size:10pt">S&#8217; = {&lt;t<sub>4</sub>, to, t<sub>1</sub>&gt;, &lt;t<sub>4</sub>, from, t<sub>2</sub>&gt;, &lt;t<sub>4</sub>, body, t<sub>3</sub>&gt;}</span> all with the same subject, which is the <em>term</em> corresponding to the message itself.  These statements contribute to the set of all statements, i.e. <span style="font-size:10pt"><span style="font-family:Arial">S&#8217; </span><span style="font-family:Symbol">Í</span></span><br />
<span style="font-family:Arial; font-size:10pt">S.</span></p>
<p style="text-align: justify">
<p style="text-align: justify">Note that the above definitions are formulated to be clear and unambiguous with respect to our particular approach and may not match those found in other literature.  Throughout the paper, we will denote instances of signs, concepts, predicates, terms, and statements using Arial font within single quotes (e.g. <span style="font-family:Arial; font-size:10pt">&#8216;person&#8217;</span>).</p>
<h2>DDF</h2>
<p style="text-align: justify">Abstracted from the milieu of all possible data-models, these elementary constructs (concept, predicate, sign, term, and statement) provide the fixed-points of a data reference model that will ultimately form the basis of a practical data integration platform.  We call it the Data Description Framework (DDF).   Despite its simplicity, the DDF is an amazingly rich model that can be viewed from at least two different perspectives.  From one perspective, the DDF encompasses a synergistic combination of two higher order models lying along different dimensions of abstraction – one that is outward-looking (&#8221;extrospective&#8221;), one inward-looking (&#8221;introspective&#8221;).</p>
<p style="text-align: justify">
<p style="text-align: justify">The extrospective portion of the model is a meta-model formed by (a) <span style="font-family:Courier New">C</span> and <span style="font-family:Courier New">P</span>, which look outward to domain knowledge (represented in data / knowledge models), and (b) <span style="font-family:Courier New">G</span>, which looks outward toward the data.  Signs bring data into the DDF as first class entities which may then participate in various, unlimited conceptualizing relationships created by any sort of automated or manual process at any time.  Signs provide a fundamental level of <em>data</em> integration (that traditional approaches lack) resulting from having eliminated data-model barriers.  Concepts and predicates are to domain knowledge what signs are to data.  They are the mechanism by which such knowledge (typically encoded in domain-specific data / knowledge models) is linked into the DDF and exposed by our Data Integration Framework for use and re-use.</p>
<p style="text-align: justify">
<p style="text-align: justify">The introspective portion of the model is a semantic model formed by <span style="font-family:Courier New">T</span> and <span style="font-family:Courier New">S</span> which abstract data-model internals to expose structure in a uniform way.  Terms link instances to concepts, exposing the meaning of the data unambiguously with respect to the original source data-model.  Statements represent semantic relationships about, within, and between disambiguated data elements.</p>
<p style="text-align: justify">Together the introspective and extrospective models that comprise the DDF enable both horizontal and vertical <em>data</em> integration. The extrospective abstraction bridges data and domain knowledge (vertical integration). The instrospective abstraction bridges data structured by various disparate processes (horizontal integration) and binds the two outward looking faces of the extrospective model to provide a comprehensive data integration model.</p>
<p style="text-align: justify">
<p style="text-align: justify">From the second perspective, the DDF may be regarded as a synergistic combination of two interaction patterns – one that decouples, one that binds.  DDF achieves decoupling in two ways.  First, as a higher order data-model abstraction, DDF effectively decouples data from <em>data-models</em>.  Thus, the DDF can encapsulate any sort of data regardless of the source data-model.  Second, as a higher order <em>data-structure</em>, DDF effectively decouples structured data from data storage structures.  Thus, the DDF can accommodate any data regardless of the source storage structure.  As a result, the DDF provides a practical foundation for implementing a stable database that can accommodate any sort of structured data.</p>
<p style="text-align: justify">The ways in which DDF implements binding are illustrated in Fig. 5.   Specifically, sign <span style="font-family:Arial; font-size:10pt">g</span> binds with concept <span style="font-family:Arial; font-size:10pt">c</span> to form term <span style="font-family:Arial; font-size:10pt">t</span>,<sub><br />
</sub>and predicate <span style="font-family:Arial; font-size:10pt">p</span><sub><br />
</sub>binds with term <span style="font-family:Arial; font-size:10pt">t</span><sub><br />
</sub>to form statement <span style="font-family:Arial; font-size:10pt">s</span>.  The diagram also indicates that predicate may bind term and statement to form reification or predicate may bind statement with statement to form a statement relationship.  These bindings allow data to be integrated within and across data-models and continuously enriched into knowledge.</p>
<p style="text-align: justify">
<p style="text-align: justify"><img src="http://systover.net/blog/wp-content/uploads/2009/01/010209-0218-unifiedarch6.png" alt="" />Together these interaction patterns make the DDF a powerful yet practical platform for data fusion.  Decoupling gives DDF the character of a universal data store and successive bindings progressively move intelligence artifacts (or their constituent elements) upward through the cognitive hierarchy.  The result is a universal data fusion platform that supports data structured by any means, unrestricted associations within and between them, and increasingly rich semantics.</p>
<h3>Expressiveness</h3>
<p style="text-align: justify">Although the expressiveness of the DDF is sufficient to capture the data and data-semantics of any structured data source, we illustrate this for the relational model since it is the most commonly used.  Similar arguments can be made for other model types, such as hierarchical, object-oriented, and graph.</p>
<p style="text-align: justify">
<p style="text-align: justify">In accordance with common relational formalism [Date 2004], a relation <span style="font-family:Arial; font-size:10pt">R </span>is defined by the set of attributes <span style="font-family:Arial; font-size:10pt">A = {A<sub>i</sub>}</span>.  The subset of  attributes that comprise the primary key are denoted as <span style="font-family:Arial; font-size:10pt">K={ K<sub>l</sub>}</span>,  <span style="font-size:10pt"><span style="font-family:Arial">K </span><span style="font-family:Symbol">Í</span></span><br />
<span style="font-family:Arial; font-size:10pt">A</span>.  The set of all data values in <span style="font-family:Arial; font-size:10pt">R</span> is <span style="font-size:10pt"><span style="font-family:Arial">D = {d<sub>ij</sub>},</span><br />
</span>where <span style="font-family:Arial; font-size:10pt">d<sub>ij</sub></span> is a value on the intersection of attribute <span style="font-family:Arial; font-size:10pt">A<sub>i</sub></span> and row <span style="font-family:Arial; font-size:10pt">W<sub>j</sub></span>.  We can integrate data and its original semantics from <span style="font-family:Arial; font-size:10pt">R</span> into a DDF data space consisting of <span style="font-family:Arial; font-size:10pt">G<sub>0</sub></span>, <span style="font-family:Arial; font-size:10pt">C<sub>0</sub></span>, <span style="font-family:Arial; font-size:10pt">P<sub>0</sub></span>, <span style="font-family:Arial; font-size:10pt">T<sub>0</sub></span>, and S<sub>0</sub> according to the following procedure:</p>
<p style="text-align: justify">
<ul>
<li>
<div style="text-align: justify">All attributes of <span style="font-family:Arial; font-size:10pt">R</span> are added to the set of concepts:</div>
<p style="text-align: justify; margin-left: 18pt"><span style="font-family:Arial; font-size:10pt">C = C<sub>0</sub> U A<br />
</span></li>
<li>
<div style="text-align: justify">Non-key attributes are added to the set of predicates:</div>
<p style="text-align: justify; margin-left: 18pt"><span style="font-family:Arial; font-size:10pt">P = P<sub>0</sub> U (A - K)</span></p>
</li>
<li>
<div style="text-align: justify"><span style="font-family:Arial; font-size:10pt">D&#8217; = { d&#8217;<sub>i</sub> }</span> is the set of unique values of <span style="font-family:Arial; font-size:10pt">D</span>: <span style="font-size:10pt"><span style="font-family:MS Mincho">?</span><span style="font-family:Arial">i,j  (i<sub><br />
</sub>? j) d<sup>&#8216;</sup><sub>i </sub>? d<sup>&#8216;</sup><sub>j</sub></span></span> . The values in <span style="font-family:Arial; font-size:10pt">D&#8217;</span> that are not already present in <span style="font-family:Arial; font-size:10pt">G<sub>0</sub></span> are added to the set of signs:</div>
<p style="text-align: justify; margin-left: 18pt"><span style="font-family:Arial; font-size:10pt">G = G<sub>0</sub> U (D&#8217;– G<sub>0</sub>)</span></p>
</li>
<li>
<div style="text-align: justify">We build the set of terms <span style="font-family:Arial; font-size:10pt">T<sub>R</sub> =</span> {<span style="font-family:Arial; font-size:10pt">t<sub>ij</sub></span>} where <span style="font-family:Arial; font-size:10pt">t<sub>ij</sub>=&lt;d<sub>ij</sub>, A<sub>i</sub>&gt; </span>and <span style="font-family:Arial; font-size:10pt">1 ? i ? n</span>, <span style="font-family:Arial; font-size:10pt">1? j ? m</span>. <span style="font-family:Arial; font-size:10pt">T&#8217;<sub>R</sub></span> is the subset of unique terms of <span style="font-family:Arial; font-size:10pt">T<sub>R</sub></span>. Terms of <span style="font-family:Arial; font-size:10pt">T&#8217;<sub>R</sub></span> are added to <span style="font-family:Arial; font-size:10pt">T<sub>0</sub></span>.</div>
<p style="text-align: justify; margin-left: 18pt"><span style="font-family:Arial; font-size:10pt">T = T<sub>0</sub> U T&#8217;<sub>R</sub></span></p>
</li>
<li>
<div style="text-align: justify">We build the set of statements <span style="font-family:Arial; font-size:10pt">S<sub>R</sub> =</span> {<span style="font-family:Arial; font-size:10pt">s<sub>ij</sub></span>} where <span style="font-family:Arial; font-size:10pt">s<sub>ij</sub> = &lt; &lt;d<sub>kj</sub>, K&gt;, A<sub>i</sub>, &lt;d<sub>ij</sub>, A<sub>i</sub>&gt; &gt;</span> and <span style="font-family:Arial; font-size:10pt">d<sub>kj</sub></span> represents the combination of values of the key attributes for the row <span style="font-family:Arial; font-size:10pt">W<sub>j</sub></span>.  Statements of  <span style="font-family:Arial; font-size:10pt">S<sub>R</sub></span> are added to <span style="font-family:Arial; font-size:10pt">S<sub>0</sub></span>:</div>
<p style="text-align: justify; margin-left: 18pt"><span style="font-family:Arial; font-size:10pt">S = S<sub>0</sub> U S<sub>R</sub></span></p>
</li>
</ul>
<p style="text-align: justify">
<p style="text-align: justify">Representation of <span style="font-family:Arial; font-size:10pt">R</span> in DDF is lossless (no loss or distortion of data and semantics, even though semantics of <span style="font-family:Arial; font-size:10pt">R</span> is not explicitly represented in DDF) because we can restore <span style="font-family:Arial; font-size:10pt">R</span> from DDF:</p>
<ol>
<li>
<div style="text-align: justify"><span style="font-family:Arial; font-size:10pt">R</span> is contained in statements <span style="font-family:Arial; font-size:10pt">S</span>, therefore, using processing metadata (described in the following section and shown in Fig. 6), extract from <span style="font-family:Arial; font-size:10pt">S</span> the statements that originated from <span style="font-family:Arial; font-size:10pt">R</span>:</div>
<p style="text-align: justify; margin-left: 18pt"><span style="font-family:Arial; font-size:10pt">S<sub>R</sub> =</span> {<span style="font-family:Arial; font-size:10pt">s<sub>ij</sub></span>} where <span style="font-family:Arial; font-size:10pt">s<sub>ij</sub> = &lt; &lt;d<sub>kj</sub>, K&gt;, C &lt;d<sub>ij</sub>, A<sub>i</sub>&gt; &gt;<br />
</span></li>
<li>
<div style="text-align: justify">From <span style="font-family:Arial; font-size:10pt">S<sub>R</sub></span> restore the structure and rows of <span style="font-family:Arial; font-size:10pt">R</span> as follows:</div>
</li>
</ol>
<div>
<table style="border-collapse:collapse" border="0">
<colgroup><col style="width: 65px;"></col><col style="width: 65px;"></col><col style="width: 65px;"></col><col style="width: 65px;"></col><col style="width: 65px;"></col></colgroup>
<tbody>
<tr>
<td style="padding-left: 7px; padding-right: 7px; border-top:  solid 0.5pt; border-left:  solid 0.5pt; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt">
<p style="text-align: center"><span style="font-family: Times New Roman; font-size: 10pt; text-decoration: underline;"><strong>K</strong></span></p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  solid 0.5pt; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt">
<p style="text-align: center"><span style="font-family:Arial; font-size:10pt"><strong>A<sub>k+1</sub></strong></span></p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  solid 0.5pt; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt">
<p style="text-align: center"><span style="font-family:Arial; font-size:10pt"><strong>A<sub>k+2</sub></strong></span></p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  solid 0.5pt; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt">
<p style="text-align: center"><span style="font-family:Times New Roman; font-size:10pt"><strong>…</strong></span></p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  solid 0.5pt; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt">
<p style="text-align: center"><span style="font-family:Arial; font-size:10pt"><strong>A<sub>n</sub></strong></span></p>
</td>
</tr>
<tr>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  solid 0.5pt; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">d<sub>k1</sub></span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">d<sub>k+1,1</sub></span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">d<sub>k+2,1</sub></span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Times New Roman; font-size:10pt">. . .</span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">d<sub>n1</sub></span></td>
</tr>
<tr>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  solid 0.5pt; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">d<sub>k2</sub></span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">d<sub>k+1,2</sub></span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">d<sub>k+2,2</sub></span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Times New Roman; font-size:10pt">. . .</span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">d<sub>n2</sub></span></td>
</tr>
<tr>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  solid 0.5pt; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Times New Roman; font-size:10pt">. . .</span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Times New Roman; font-size:10pt">. . .</span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Times New Roman; font-size:10pt">. . .</span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Times New Roman; font-size:10pt">. . .</span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Times New Roman; font-size:10pt">. . .</span></td>
</tr>
<tr>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  solid 0.5pt; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">d<sub>km</sub></span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">d<sub>k+1,m</sub></span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">d<sub>k+2,m</sub></span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Times New Roman; font-size:10pt">. . .</span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">d<sub>nm</sub></span></td>
</tr>
</tbody>
</table>
</div>
<p>The process that was used to build combinations of values of the key attributes can be reversed to get to the relation in its original form:</p>
<div>
<table style="border-collapse:collapse" border="0">
<colgroup><col style="width: 65px;"></col><col style="width: 65px;"></col><col style="width: 65px;"></col><col style="width: 65px;"></col><col style="width: 65px;"></col><col style="width: 65px;"></col><col style="width: 65px;"></col></colgroup>
<tbody>
<tr>
<td style="padding-left: 7px; padding-right: 7px; border-top:  solid 0.5pt; border-left:  solid 0.5pt; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt">
<p style="text-align: center"><span style="font-family:Arial; font-size:10pt"><strong>A<sub>k</sub></strong></span></p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  solid 0.5pt; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt">
<p style="text-align: center"><span style="font-family:Times New Roman; font-size:10pt"><strong>. . .</strong></span></p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  solid 0.5pt; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt">
<p style="text-align: center"><span style="font-family:Arial; font-size:10pt"><strong>A<sub>k</sub></strong></span></p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  solid 0.5pt; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt">
<p style="text-align: center"><span style="font-family:Arial; font-size:10pt"><strong>A<sub>k+1</sub></strong></span></p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  solid 0.5pt; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt">
<p style="text-align: center"><span style="font-family:Arial; font-size:10pt"><strong>A<sub>k+2</sub></strong></span></p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  solid 0.5pt; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt">
<p style="text-align: center"><span style="font-family:Times New Roman; font-size:10pt"><strong>…</strong></span></p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  solid 0.5pt; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt">
<p style="text-align: center"><span style="font-family:Arial; font-size:10pt"><strong>A<sub>n</sub></strong></span></p>
</td>
</tr>
<tr>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  solid 0.5pt; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">d<sub>11</sub></span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">. . .</span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">d<sub>k1</sub></span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">d<sub>k+1,1</sub></span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">d<sub>k+2,1</sub></span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Times New Roman; font-size:10pt">. . .</span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">d<sub>n1</sub></span></td>
</tr>
<tr>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  solid 0.5pt; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">d<sub>12</sub></span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">. . .</span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">d<sub>k2</sub></span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">d<sub>k+1,2</sub></span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">d<sub>k+2,2</sub></span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Times New Roman; font-size:10pt">. . .</span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">d<sub>n2</sub></span></td>
</tr>
<tr>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  solid 0.5pt; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">. . .</span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">. . .</span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Times New Roman; font-size:10pt">. . .</span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Times New Roman; font-size:10pt">. . .</span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Times New Roman; font-size:10pt">. . .</span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Times New Roman; font-size:10pt">. . .</span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Times New Roman; font-size:10pt">. . .</span></td>
</tr>
<tr>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  solid 0.5pt; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">d<sub>1m</sub></span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">. . .</span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">d<sub>km</sub></span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">d<sub>k+1,m</sub></span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">d<sub>k+2,m</sub></span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Times New Roman; font-size:10pt">. . .</span></td>
<td style="padding-left: 7px; padding-right: 7px; border-top:  none; border-left:  none; border-bottom:  solid 0.5pt; border-right:  solid 0.5pt"><span style="font-family:Arial; font-size:10pt">d<sub>nm</sub></span></td>
</tr>
</tbody>
</table>
</div>
<p style="text-align: justify">
<p style="text-align: justify">Therefore, by the integration procedure described above, the data and data-semantics from <span style="font-family:Arial; font-size:10pt">R</span> are faithfully represented with the DDF.  The structure of <span style="font-family:Arial; font-size:10pt">R</span> itself and its identity integrity are explicitly captured in Layer 3.</p>
<p style="text-align: justify">
<p style="text-align: justify">This procedure further reveals two powerful and <em>distinguishing</em> features of the DDF:</p>
<ul>
<li>
<div style="text-align: justify">The DDF can accommodate data and data-semantics from structured sources without loss or distortion.</div>
</li>
<li>
<div style="text-align: justify">Data sources may be integrated within the DDF in a mechanical fashion without requiring prior knowledge, and or analysis of, their domain-specific data-models.</div>
</li>
</ul>
<p style="text-align: justify">
<h3>Towards Implementation</h3>
<p style="text-align: justify">A universal storage model based on DDF can be implemented in a variety of ways (e.g. objects, relations, triples).  We chose to use a relational Dimensional Data Modeling (DDM) approach [Kimball 2002] mainly because it handily accommodates the capture and use of the kinds of metadata that the Intelligence Community favors.  In particular, we need to maintain not only metadata about the indigenous artifact itself (e.g. the who what when and where of its creation and transmission), but also metadata regarding the processing by which signs, terms and statements are created.  The former (i.e. contextual metatdata) is captured in the Layer 1 storage structure as described previously.  The latter, which we term &#8220;process metadata,&#8221; must be accommodated in Layer 2.</p>
<p style="text-align: justify">
<p style="text-align: justify">A high-level, conceptual database view of the DDF storage model based on the DDM design pattern is depicted in Fig. 6.  Before discussing the diagram in detail, we begin with a very brief overview of the DDM.  In general, the DDM is a business-process-centric database design pattern that aims to decouple rapidly changing business metrics (e.g. stock quantities) from slowly changing business objects (e.g. stock items).  For each business process, it uses a star schema consisting of a central &#8220;fact-table&#8221; for storing quantitative metrics, linked to multiple &#8220;dimension-tables&#8221; for storing descriptive objects.  The DDM as a pattern is most effective when dimensions are re-usable across business processes and a natural a separation of time scales exists between the rate at which new facts are added (fast) and the rate at which dimensions change (slow).</p>
<p style="text-align: justify">
<p style="text-align: justify">As reflected in Fig. 6, the essential intelligence &#8220;business processes&#8221; that the DDF captures are semantic disambiguation and association formation.  Thus, the DDF storage model consists of two main fact-tables, <span style="font-family:Arial; font-size:10pt">SemanticFact</span><em><br />
</em>and<em><br />
</em><span style="font-family:Arial; font-size:10pt">AssociationFact</span>. The <span style="font-family:Arial; font-size:10pt">SemanticFact</span> table records metrics relating to the formation and disambiguation of signs, and references dimension tables that record signs, concepts, and process metadata.  The signs themselves are represented using two tables, <span style="font-family:Arial; font-size:10pt">Sign</span> and <span style="font-family:Arial; font-size:10pt">Mention</span>.  The value of a mention is identified by the region of the artifact in which it is localized.  The boundary of such a region is recorded in the <span style="font-family:Arial; font-size:10pt">Mention</span> table.  The value of a sign may represent any number of source mentions that are exactly the same or are considered to be the same from the perspective of the process which extracts / identifies them.  The <span style="font-family:Arial; font-size:10pt">Concept</span> dimension records elements from the domain knowledge which includes the source artifacts&#8217; data-models.  Each record in the <span style="font-family:Arial; font-size:10pt">SemanticFact</span> table binds a sign to a concept using <span style="font-family:Arial">&#8216;<span style="font-size:10pt">isInstanceOf&#8217;</span><span style="font-size:9pt"><br />
</span></span>semantics.</p>
<p style="text-align: justify">
<p style="text-align: justify"><img src="http://systover.net/blog/wp-content/uploads/2009/01/010209-0218-unifiedarch7.png" alt="" /></p>
<p style="text-align: justify">The <span style="font-family:Arial; font-size:10pt">AssociationFact</span> table records metrics relating to the formation of associations and references dimension tables that record statements, predicates, and process metadata.  Recall that statements come in three types – an association between terms (i.e. statement), an association between a term and another statement (i.e. reification), and an association between two statements (statement relation).  These are accommodated by the three subclasses of the <span style="font-family:Arial; font-size:10pt">Statement</span> dimension which are <span style="font-family:Arial; font-size:10pt">Statement0, Statement1, Statement2 </span>respectively.  The <span style="font-family:Arial; font-size:10pt">Predicate</span> dimension records predicates from the domain knowledge.</p>
<p style="text-align: justify">
<p style="text-align: justify">The <span style="font-family:Arial; font-size:10pt">ProcessMetadata</span> package shown in Fig. 6, represents a collection of dimensional tables used to record operational and contextual metadata about the various external processes that create <span style="font-family:Arial; font-size:10pt">SemanticFact</span> and <span style="font-family:Arial; font-size:10pt">AssociationFact</span> records.  The particular elements and formulation of this metadata would be designed to support the information assurance needs of the Intelligence Community.  Typically these would include <span style="font-family:Arial; font-size:10pt">Date</span>, <span style="font-family:Arial; font-size:10pt">Time</span>, <span style="font-family:Arial; font-size:10pt">Creator</span>, and <span style="font-family:Arial; font-size:10pt">SecurityClassification</span> dimensions.</p>
<p style="text-align: justify">
<p style="text-align: justify">The DDF does not prescribe or constrain the processing by which the DDF storage model would be populated, and the nature of such processing depends both on the modality and structure (or lack thereof) of the indigenous artifacts.  Nevertheless, to illustrate how DDF works, and provide more insight into the relationship between external processes and our Data Integration Framework, the interested reader may find a brief discussion of the processing by which Layers 1 and 2 would be populated in the Appendix.</p>
<p style="text-align: justify">
<h2>Relation to Other Approaches</h2>
<p style="text-align: justify">A large body of work exists on data integration approaches [Batini 1986, <span style="font-family:Times New Roman">Parent 1998, Halevy 2005, </span>Bernstein 2007], many of which have contributed to successful Enterprise Information Integration solutions. However, because they all are based on some kind of data-model harmonization (i.e. mapping), they fail to provide practical solution for ULS intelligence data integration.  In particular, data-model integration does not address <em>data</em> integration, which intelligence data processing requires. Physical data integration, typical of data warehouse applications, also requires heavy up-front data-model analysis and harmonization as well.  This activity is not only resource intensive, it often results in the loss and or distortion of data and its semantics which, in the context of intelligence, may reduce the richness and power of the data.  DDF addresses the needs of the Intelligence Community by providing ad-hoc, lossless data integration without imposing a heavy pre-processing burden.</p>
<p style="text-align: justify"><span style="font-size:12pt">Because they are born from a similar abstraction, the elementary constructs at the foundation of our reference model &#8220;share DNA&#8221; with those of the Resource Description Framework (RDF) [RDF 2004].  In particular, DDF </span><span style="font-family:Arial; font-size:11pt">terms</span><span style="font-size:12pt"> are cousin to RDF </span><span style="font-family:Arial; font-size:11pt">resources</span><span style="font-size:12pt"> –  both existing at the atomic level of data as so-called &#8220;first class citizens&#8221; which may participate in arbitrary associations.  However, whereas RDF aims at exposing machine-processable semantics and supporting logical inference, DDF aims at data integration and breaking the barriers between data sources. Consequently, DDF reaches further down into data to explicitly capture the grounding of </span><span style="font-family:Arial; font-size:11pt">terms</span><span style="font-size:12pt"> within artifacts (and analyst&#8217;s thought) through the use of </span><span style="font-family:Arial; font-size:11pt">signs</span><span style="font-size:12pt">, and reaches up more broadly into knowledge models to expose data-semantics regardless of their machine processability. The fundamental difference is that RDF is an instance of a language for expressing semantic relationships, while DDF is a framework for data integration that can accommodate data represented by any language.  Thus, while DDF powerfully supports RDF, it neither requires nor replaces it.<br />
</span></p>
<p style="text-align: justify">
<p style="text-align: justify">The Object Management Group has defined four increasing levels of software program abstraction from implementation / platform to pure abstract model [MOF 2000].  Decoupling the program model from the implementation makes it possible to develop tooling that can automatically generate platform specific implementations by combining the program model with implementation specific configuration information.   Essentially a program instance = abstract model + specific &#8220;configuration&#8221; data.  In the case of DDF, we present increasing levels of abstraction of structured data from implementation / representation to pure conceptual model (i.e. from Layer 1 to Layer 3).  Decoupling the conceptual model from the implementation makes it possible to store variously structured data within a single DB.  Essentially, structured data = data + abstract conceptual model (i.e. DDF)  + specific data-model.</p>
<p style="text-align: justify">
<p style="text-align: justify">The Information Model Interoperability Reference Model [Melnik 2000; Omelayenko 2001], proposed for presenting information on the web, consists of three layers –  syntax, object, and semantic.  The syntax layer represents serialized data content, similar to our indigenous text artifacts.  The semantic layer provides semantics through data-models and languages, and the object layer provides a bridge between the two.  In contrast to the DDF however, the IMI does not provide a practical model for implementation of the layers and their interfaces.</p>
<p style="text-align: justify">
<p style="text-align: justify">The Data Reference Model (DRM) of the Federal Enterprise Architecture (FEA) aims to provide standards for the description, categorization, and sharing of data [DRF 2005].  Like DDF, the DRM entails a data-model metamodel, but unlike DDF it does not resolve the issues of data integration and unfortunately exhibits the typical shortcomings of most physical and virtual data integration approaches.</p>
<p style="text-align: justify">
<p>Finally, the Common Warehouse Model (CWM) [CWM 2001] offers a standardized approach (and tools that support it) for representing and mediating the automated interchange of metadata in warehouse applications that involve multiple data sources and data processing applications.   Being focused on metadata integration, as opposed to data integration, the CWM mainly addresses issues relating to Layer 3 of our Data Integration Framework.</p>
<h2>Current &amp; Future Work</h2>
<p style="text-align: justify">Today there is a deployed system called the Joint Intelligence Operational Capability in Iraq (JIOC-I) that essentially implements Layer 1 of our Data Integration Framework, though only for text artifacts.  Unfortunately, the JIOC-I by itself falls short of a complete integration solution because it does not address structured data in a way that exposes that structure to support further analytical processing and visualization.  In other words, it lacks Layer 2.  Consequently, there has been much criticism of the JIOC-I, along with various suggestions for &#8220;fixing&#8221; it (e.g. by extending the schema to accommodate structured data).  In contrast, we recognize the JIOC-I as a foundational element (that got it mostly right) and a first step toward a ULS intelligence system that integrates data while embracing data diversity.  Indeed, the JIOC-I was the inspiration that led us to develop the layers above, and the DDF in particular.</p>
<p style="text-align: justify">
<p style="text-align: justify">Implementations of Layers 1 and 2 of our Data Integration Framework are being developed and tested in the Army CERDEC I2WD Information Exploitation Futures Laboratory (IXFL).  As there are many possible physical implementations of the logical model, the challenge is to find one that optimally satisfies the functional (e.g. usability) and non-functional requirements (e.g. performance, manageability, and maintainability) of the Intelligence Community.  Beyond the physical schema development, we have implemented a data ingest system along with processes for structuring unstructured data in order to fully exercise the system.</p>
<p style="text-align: justify">
<p style="text-align: justify">Other key aspects of our Data Integration Framework are described elsewhere.  [Yoakum 2008 IQIS] highlights the low barrier to entry for data integration by describing the process for lossless mechanical data ingestion which requires no costly pre-processing or data-model harmonization. Data surfing, drilling, and discovery on the DDF unified data space are described in [Yoakum 2008 IQIS].  Finally, [Yoakum 2008 SIMA] addresses the utility of DDF in Situation Management – another activity that requires rapid, ad-hoc data integration.  Forthcoming papers will address Layer 3, insight and results from our DDF prototype work, and fundamental aspects relating to knowledge representation.</p>
<p style="text-align: justify">
<p style="text-align: justify">As they are developed, Layers 3 and 4 of our Data Integration Framework will provide fertile ground for entirely new work in knowledge interaction and perception.  Layer 3 will become a universal substrate on which to explore, discover, and encode relationships between knowledge models that go well beyond harmonization and integration to include, for example, dissonant perspectives which can not and should not be &#8220;harmonized.&#8221;  Layer 4 provides the lenses through which the human user looks into this morass of knowledge, information, and data to explore and make sense of the object of his interest (e.g. a domain, situation, entity) according to a chosen perspective. Having all four layers present will close the loop between data and knowledge in both directions so that they may co-evolve to yield more complete and accurate understanding.  Atop the immense foundation of integrated data provided by Layers 1 and 2, Layers 3 and 4 will fuel the engines of ULS systems research for a very long way into the future.</p>
<h2>Conclusion</h2>
<p style="text-align: justify">The Intelligence Enterprise is inexorably evolving into an Ultra Large Scale Systems world that can not, and will not, be constrained in its processes or products.  The data integration problem is but one early symptom of this burgeoning reality.  Although this knowledge does not provide a recipe for good solutions, it makes it rather easy to spot bad ones.  Unfortunately, current data integration approaches generally represent the latter.</p>
<p style="text-align: justify">
<p style="text-align: justify">In this paper, we have presented the first two layers of a multi-layer Data Integration Framework that enables deep semantic data integration in a ULS systems environment.  The model on which it is founded, the DDF, supports both horizontal and vertical data integration (i.e. across disparate data-models and from data to knowledge) by embracing the diversity of data / knowledge models and processes by which data is structured.  More importantly, the model admits a practical implementation (i.e. &#8220;hard running code&#8221;) that accommodates artifacts of any modality (e.g. text, audio, images, video, signals) in a single unified data store that enables true data fusion and the continuous enrichment of data into knowledge.  Awash in a sea of fragmented data, and driven by a palpable sense of urgency, we aspire to drive both the theory and practice of data integration forward.</p>
<h2>References</h2>
<p style="margin-left: 28pt"><span style="font-family:Times New Roman">[Batini 1986] Batini, C. <em>et al</em>. <em>A comparative analysis of methodologies for database schema integration</em>, ACM Computing Surveys, (18) 4, 1986.<br />
</span></p>
<p style="margin-left: 28pt">
<p>[Bernstein 2007] Bernstein P., Ho, H<span style="font-family:Arial">. </span><em>Model Management and Schema Mappings: Theory and Practice</em>, Proceedings of VLDB Conference, 2007.</p>
<p style="margin-left: 28pt">
<p style="margin-left: 28pt">[CWM 2001] Object Management Group &#8220;Common Warehouse Model (CWM) Specification&#8221;, OMG, 2001. <a href="http://www.omg.org/docs/ad/01-02-01.pdf">http://www.omg.org/docs/ad/01-02-01.pdf</a></p>
<p style="margin-left: 28pt">
<p style="margin-left: 28pt">[Date 2004] Date, C. <em>An Introduction to Database Systems, 8<sup>th</sup> edition, </em>Addison Wesley, 2004.</p>
<p style="margin-left: 28pt">
<p style="margin-left: 28pt">[DRF 2005] Federal Enterprise Architecture Program &#8220;The Data Reference Model&#8221;, 2005. <a href="http://www.whitehouse.gov/omb/egov/documents/DRM_2_0_Final.pdf">http://www.whitehouse.gov/omb/egov/documents/DRM_2_0_Final.pdf</a></p>
<p style="margin-left: 28pt">
<p style="margin-left: 28pt"><span style="font-family:Times New Roman">[Halevy 2005] Halevy, A. <em>et al</em>. <em>Enterprise information integration: successes, challenges and controversies</em>, Proceedings of 24th International Conference on Management of Data, Baltimore, 2005.<br />
</span></p>
<p style="margin-left: 28pt">
<p style="margin-left: 28pt">[Izydor 2007] <a href="http://www.dmreview.com/authors/1086246.html">Izydor</a>, C.  and <a href="http://www.dmreview.com/authors/1086247.html"> McCollum</a>, P. <em>B<span style="color:#373632">I, Process and Integration Trends</span></em>. DM Review Magazine, August 2007. <a href="http://www.dmreview.com/issues/20070801/1089409-1.html?portal=data_integration">http://www.dmreview.com/issues/20070801/1089409-1.html?portal=data_integration</a></p>
<p style="margin-left: 28pt">
<p style="margin-left: 36pt">[Kimball 2002]  Kimball, R. and Ross, M. <em>The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling</em>,  Wiley,  2002.</p>
<p style="margin-left: 28pt">
<p style="margin-left: 28pt"><span style="color:black">[Lee 2006] Lee, Y., Pipino, L., Funk, J., Wang, R. <em>Journey to Data Quality</em>, The MIT Press, Cambridge, MA,  2006</span></p>
<p style="margin-left: 28pt">
<p style="margin-left: 28pt">[Melnik 2000]  Melnik, S. and Decker, S.  <em>A layered approach to Information Modeling and Interoperability on the Web</em>. Proc. ECDL&#8217;00 Workshop on the Semantic Web, Lisbon, Portugal, Sept 2000. <a href="http://dbpubs.stanford.edu:8090/pub/showDoc.Fulltext?lang=en&amp;doc=2000-30&amp;format=pdf&amp;compression=&amp;name=2000-30.pdf"></a><a href="http://infolab.stanford.edu/~melnik/pub/sw00/">http://infolab.stanford.edu/~melnik/pub/sw00/</a>.</p>
<p style="margin-left: 28pt">
<p style="margin-left: 28pt">[MOF 2000] Object Management Group &#8220;MetaObject Facility (MOF) Specification&#8221;, OMG, 2000. <a href="http://www.omg.org/docs/formal/00-04-03.pdf">http://www.omg.org/docs/formal/00-04-03.pdf</a></p>
<p style="margin-left: 36pt">[Northrop 2006]  Northrop, L., <em>et al.</em>, <em>Ultra-Large-Scale Systems The Software Challenge of the Future</em>,  Pittsburgh: Carnegie Mellon University,  2007. <a href="http://www.sei.cmu.edu/publications/books/engineering/uls.html">http://www.sei.cmu.edu/publications/books/engineering/uls.html</a></p>
<p style="margin-left: 28pt">
<p style="margin-left: 28pt">[Omelayenko 2001]  Omelayenko, B. and Fensel, D.  <em>An Analysis of B2B Catalogue Integration Problems.</em> Proceedings of the International Conference on Enterprise Information Systems (ICEIS-2001), July 7-10, 2001, p. 945-952.</p>
<p style="margin-left: 36pt">
<p style="margin-left: 28pt"><span style="font-family:Times New Roman">[Parent 1998] Parent, C. and Spaccapietra, S. <em>Issues and approaches of database integration</em>, Communications of the ACM, 41(5), 1998.<br />
</span></p>
<p style="margin-left: 36pt">
<p style="margin-left: 28pt"><span style="font-family:Times New Roman">[RDF 2004] </span><span style="color:black">RDF Core Working Group</span><span style="font-family:Times New Roman"> &#8220;Resource Description Framework (RDF)&#8221;, W3C, 2004. <a href="http://www.w3.org/RDF/">http://www.w3.org/RDF/</a>.<br />
</span></p>
<p style="margin-left: 28pt">
<p style="margin-left: 28pt">[Steinberg 1998]  <span style="font-family:Times New Roman">Steinberg, N.,  Bowman, C. L. and White F. E. <em>Revision to the JDL Data Fusion Model</em>, Joint NATO/IRIS Conference, Quebec City, October 1998.<br />
</span></p>
<p style="margin-left: 28pt">
<p style="margin-left: 28pt">[Yero 2008] Yero, J. <span style="color:black"><em>Logical vs. Physical Data Integration: A Practical Decision Guide</em>,  The DAMA International Symposium &amp; Wilshire Meta-Data Conference. San-Diego, CA, 2008.<br />
</span></p>
<p style="margin-left: 28pt">
<p style="margin-left: 28pt">[Yoakum 2008 IQIS] Yoakum-Stover, S. and Malyuta, T. <em>Unified Architecture for Integrating Intelligence Data</em>, Proceedings of MIT Information Quality Industry Symposium, MIT, Cambridge, MA, 2008.</p>
<p style="margin-left: 28pt">[Yoakum 2008 DAMA] Yoakum-Stover, S. and Malyuta, T. <em>Unified Integration Architecture for Intelligence Data</em>, Proceedings of DAMA International Europe Conference, London, UK, 2008.</p>
<p style="margin-left: 28pt">[Yoakum 2008 SIMA] Yoakum-Stover, S. and Malyuta, T. <em>Unified Data Integration for Situation Management</em>, Proceedings of the 4th IEEE Workshop on Situation Management (SIMA 2008) at MILCOM 2008, San Diego CA, 2008.</p>
<h2>Appendix - Processing</h2>
<h3>Ingestion</h3>
<p style="text-align: justify">Consider first, processes that load indigenous artifacts into Layer 1 either physically or virtually so that they may be unambiguously referenced within Layer 2. Typically these are called ingestion processes. Such processes insert either the entire indigenous artifact, or a reference to its location within the authoritative data source, into Layer 1.  In addition, both artifact and process metadata are recorded in the appropriate metadata tables.  The former essentially provides a card catalogue for the artifact and the latter provides information assurance.</p>
<p style="text-align: justify">
<h3>Unstructured Information</h3>
<p style="text-align: justify">Processes that structure unstructured artifacts generate SemanticFact and AssociationFact records in Layer 2.  Each such process necessarily entails a particular data-model.  This data-model is persisted in Layer 3.  Concepts and predicates from the data-model (or references to them) are also persisted in the <span style="font-size:10pt"><span style="font-family:Arial">Concept</span><br />
</span>and <span style="font-family:Arial; font-size:10pt">Predicate</span> dimension tables of Layer 2 along with sufficient metadata to identify and retrieve the data-model source artifact (i.e. schema, ontology, etc..).</p>
<p style="text-align: justify">
<p style="text-align: justify">Unstructured information processing typically identifies all instances of the concepts within its data-model or type system.  For example, a given text extractor may identify all ocurrences of  <span style="font-family:Arial; font-size:10pt">&#8216;IBM&#8217; </span>and associate them with the concept<span style="font-family:Arial; font-size:10pt"> &#8216;Company.&#8217; </span> Each such instance is represented as a DDF mention.  The position of each mention within the source artifact is recorded in the <span style="font-family:Arial; font-size:10pt">Mention</span> table (e.g. using <span style="font-family:Arial; font-size:10pt">beginChar</span>, <span style="font-family:Arial; font-size:10pt">endChar</span>) and a single record is added to the <span style="font-family:Arial; font-size:10pt">Sign</span> table using, for example, the actual contents of the span (<span style="font-family:Arial; font-size:10pt">&#8216;IBM&#8217;) </span>as the sign value.  Each disambiguation ocurrence (i.e. the association made by the text extractor between a mention and a concept) is recorded in the <span style="font-family:Arial; font-size:10pt">SemanticFact table</span> along with appropriate process metadata, and a term consisting of <span style="font-family:Arial; font-size:10pt">&lt;sign, concept&gt; </span>is created in the <span style="font-family:Arial; font-size:10pt">Term</span> table (if such term does not already exist).</p>
<p style="text-align: justify">
<p style="text-align: justify">Further semantic processing may identify relationships between elements within the artifact.  The elements themselves would have already been recorded as SemanticFacts.   For each such relationship, an AssociationFact is recorded along with appropriate process metadata, and a <span style="font-family:Arial; font-size:10pt">Statement</span> table entry is created.</p>
<p style="text-align: justify">
<p style="text-align: justify">Unstructured information processing of other than text artifacts is similar.  The main differences being that entries in the <span style="font-family:Arial; font-size:10pt">Mention</span> table will have a different <span style="font-family:Arial; font-size:10pt">spanCoordinateType</span>, and the method for assigning a sign value will be different. For example, consider object recognition software that extracts faces from within an image of a crowd.  For each extracted face, the corresponding rectangular area of the image could be recorded in the <span style="font-family:Arial; font-size:10pt">Mention</span> table with the help of <span style="font-family:Arial; font-size:10pt">pixelUpperLeft</span> and  <span style="font-family:Arial; font-size:10pt">pixelLowerRight</span>, and a sign (e.g. <span style="font-family:Arial; font-size:10pt">&#8216;faceImage&#8217;)</span> would be assigned to all extracted mentions.</p>
<p style="text-align: justify">
<h3>Extract-Transform-Load</h3>
<p style="text-align: justify">Consider next, Extract-Transform-Load (ETL) processes that pull data from other structured data sources, typically databases, into Layer 2.  The initial phase of the ETL loads the source data-model (e.g. database data dictionary) into Layer 3, and concepts and predicates (or their references) into in the <span style="font-family:Arial; font-size:10pt">Concept</span> and <span style="font-size:10pt"><span style="font-family:Arial">Predicate</span><br />
</span>dimension tables of Layer 2.  Sufficient metadata necessary to identify and retrieve the data-model source artifact (i.e. schema), are also stored.  Subsequent ETL processing, which entails a mapping to the DDF structure, inserts signs, terms, and statements into the <span style="font-family:Arial; font-size:10pt">SemanticFact</span> and <span style="font-family:Arial; font-size:10pt">AssociationFact</span> tables along with appropriate process metadata.</p>
<p style="text-align: justify">
<p style="text-align: justify">Because the ETL process needs only to capture the explicit semantics of the underlying model of the source (e.g. relational, hierarchical, graph…), one ETL can be developed for a whole class of data stores.  For example a discussion of ETL for relational stores may be found in [Yoakum 2008 IQIS].</p>
<p style="text-align: justify">
<h3>Interactive</h3>
<p>Finally, consider an interactive user interface that enables an analyst to assert semantic and association facts directly into the DDF. The analyst will have the option to use existing concepts, predicates, terms, and statements or to create new ones.  In the case of the latter, recorded and asserted mentions will reference the source analyst. Metadata recorded for manual processes with also reference the source analyst.</p>
]]></content:encoded>
			<wfw:commentRss>http://systover.net/blog/2009/01/01/unified-architecture-for-integrating-intelligence-data-full-paper/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Unified Architecture for Integrating Intelligence Data</title>
		<link>http://systover.net/blog/2008/12/30/unified-architecture-for-integrating-intelligence-data/</link>
		<comments>http://systover.net/blog/2008/12/30/unified-architecture-for-integrating-intelligence-data/#comments</comments>
		<pubDate>Tue, 30 Dec 2008 23:06:52 +0000</pubDate>
		<dc:creator>Suzanne Yoakum-Stover</dc:creator>
		
		<category><![CDATA[publications]]></category>

		<guid isPermaLink="false">http://systover.net/blog/?p=4</guid>
		<description><![CDATA[Abstract:
The principal problem spanning the Intelligence Community today is how to integrate the great variety of disparate data into one single coherent repository of knowledge.  Current practice whereby all data-models would be merged into a single “Uber-model” simply does not work.  We require a solution that remains viable in a freely evolving, interdependent collective of [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Abstract:</strong></p>
<blockquote><p>The principal problem spanning the Intelligence Community today is how to integrate the great variety of disparate data into one single coherent repository of knowledge.  Current practice whereby all data-models would be merged into a single “Uber-model” simply does not work.  We require a solution that remains viable in a freely evolving, interdependent collective of human and computational systems, very little of which will ever be under our control.  Our approach is database-centric and proceeds in stages.  The first addresses the unified storage of the broad spectrum of artifacts existing within the Intelligence Enterprise today regardless of modality or representation.  The second builds upon the foundation provided by the first to address the unified storage of structured data and semantic data integration.  In both we embrace the diversity of data-models employed throughout the Intelligence Community. The result is a layered data architecture that can accommodate any kind of data without placing restrictions on vocabulary, structure, semantics, or constraints in a way that addresses today’s Intel needs while providing a seamless transition path toward a future of ULS systems imbued with semantic technologies.</p></blockquote>
<p>Full paper: <a href="http://systover.net/blog/wp-content/uploads/2008/12/20080824-ddf.pdf">Unified Architecture for Integrating Intelligence Data</a></p>
]]></content:encoded>
			<wfw:commentRss>http://systover.net/blog/2008/12/30/unified-architecture-for-integrating-intelligence-data/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
