A primary distinction between basic grid infrastructure and the requirements identified in caBIG and implemented in caGrid is the attention given to data modeling and semantics. caBIG adopts a model-driven architecture best practice and requires that all data types used on the grid are formally described, curated, and semantically harmonized. These efforts result in the identification of common data elements, controlled vocabularies, and object-based abstractions for all cancer research domains. caGrid leverages existing NCI data modeling infrastructure to manage, curate, and employ these data models. Data types are defined in caCORE UML and converted into ISO/IEC 11179 Administered Components, which are in turn registered in the Cancer Data Standards Repository (caDSR). The definitions draw from vocabulary registered in the Enterprise Vocabulary Services (EVS), and their relationships are thus semantically described.
In caGrid, both the client and service APIs are object oriented, and operate over well-defined and curated data types. Clients and services communicate through the grid using respectively Globus grid clients and service infrastructure. The grid communication protocol is XML, and thus the client and service APIs must transform the transferred objects to and from XML. This XML serialization of caGrid objects is restricted in that each object that travels on the grid must do so as XML which adheres to an XML schema registered in the Global Model Exchange (GME). As the caDSR and EVS define the properties, relationships, and semantics of caBIG data types, the GME defines the syntax of the XML serialization of them. Furthermore, Globus services are defined by the Web Service Description Language (WSDL). The WSDL describes the various operations the service provides to the grid. The inputs and outputs of the operations, among other things, in WSDL are defined by XML schemas (XSDs). As caBIG requires that the inputs and outputs of service operations use only registered objects, these input and output data types are defined by the XSDs which are registered in GME. In this way, the XSDs are used both to describe the contract of the service and to validate the XML serialization of the objects which it uses.
Proper semantic integration requires that each class and it's attributes from the UML domain model gets mapped to appropriate concepts in a controlled terminology. The caCORE SDK utilizes the NCI Thesaurus as its primary terminology source, but any well structured, concept-based description logics terminology should in principle be suitable. The concept selection process can be entirely manual, or it can be partially automated using the Semantic Connector, a tool supplied by the caCORE SDK. The Semantic Connector uses the UML domain Model expressed in XMI as input and uses the caCORE EVS APIs hosted at the NCI to search the NCI Thesaurus for appropriate concepts. Semantic annotations for classes and attributes are specified using tagged values in the UML domain model.
The UML domain model, annotated with semantic concept codes, contains a considerable amount of metadata about the ultimate system – both data and analytical services - that will be deployed to the grid. However, it is not in a form that is amenable to query and retrieval in a runtime environment nor easily queried by humans to make use of this information for other purposes. UML domain model loader addresses these limitations by transforming and loading the models into the caDSR, which provides APIs that support runtime access to metadata. UML domain model annotated with semantic concept information is exported to XMI format using a UML modeling tool such as Enterprise Architect. It is then used as an input to the UML domain model loader, which uses a set of mapping rules to load metadata represented by Classes, Attributes and Associations into entities of caDSR. Following section contains the details of the UML to caDSR mapping rules.
Metadata represented in UML domain model is mapped to caDSR administered component types, and using the following mapping rules:
- A UML Class is mapped to an Object Class, which according to ISO 11179 specification represents a thing in real-world.
- An attribute of a UML Class is mapped to a Property, which according ISO 11179 specification represents an attribute of a real-world thing.
- Combination of a UML Class and one of it's attributes is mapped to a Data Element Concept.
- Combination of UML Class, one of it's attributes and data type of the attribute is mapped to a Data Element, commonly referred to as a Common Data Element (CDE).
- Project to which the UML domain model belongs to is mapped to a Classification Scheme.
- Packages in the UML model – which may represent sub-projects within a project – are mapped to Classification Scheme Items
- Association between two classes is mapped to Object Class Relationship Refer to "Registration of Metadata in the caDSR" chapter of caCORE SDK Programmer's guide for complete details on loading UML domain models to caDSR
After a UML domain model is transformed, loaded and curated in caDSR, the model is ready to be used as the basis of an object oriented grid client and service. All data movement in caGrid between client and service is done so using instances of Classes registered in the caDSR. caGrid requires that all data types used in the grid are registered in caDSR, and come from a given Project version. That is, even though Attributes and other items in caDSR can be versioned individually, in order to use those types on the grid, they need to be associated with a specific Project version. Several components of caGrid make use of the wealth of information in the caDSR. As mentioned above, grid services use registered data models as their information model. By doing so, they are able to advertise both the syntax and semantics of the model by exposing an export of the relevant caDSR information as service metadata. The details of the model used to expose this information are shown in the section below. Once the information is exposed in this model, caGrid leverages for grid service advertisement and discovery. These processes are described in the discovery section. Finally, the information models registered in caDSR are used as the conceptual foundation for the actual communication format used to exchange data on the grid. This process of serializing and deserializeing data instances on the grid, is detailed in the serialization overview.
All caGrid Services are expected to publish a set of standard metadata which draws heavily from the metadata registered in caDSR and EVS; it details the functionality of the service, and the institution providing it. The following sections describe these models.
The ServiceMetadata class is the main entry point for the standard service metadata. Shown below in the metadata domain, this model references heavily from the common and service packages, also shown below. Instances of this model describe the grid service, its hosting environment, and the underlying semantics of the data models used by the service's operations.
caGrid Data Services, in addition to caGrid standard service metadata, expose a standard data service metdata (DomainModel), which details not only the UML Classes exposed by the service, but their relationships such as associations and inheritance. This information describes the logical model over which data service queries are executed.