Table of Contents
This guide uses a number of terms that may be unfamiliar to readers. As you read, please refer back to this section for definitions of unfamiliar terms.
Software that provides the support for higher-level (application-level) software components and applications to execute and interact with each other. Middleware consists of a suite of components, services, tools, and runtime system that can be employed collectively to develop and deploy applications and application-level software components.
A process that facilitates collective access to and use of multiple, disparate, and potentially independently developed resources, services, and applications. It is also the process of such resources and services agreeing on operation and interaction standards to enable collective access and use.
- Grid computing:
An architecture design and framework that encapsulates 1) applications, computational, networking, and storage resources, and services, which are deployed at geographically distributed locations and managed under different security and administrative domains; and 2) middleware and tools that enable federation of those applications, resources, and services.
- Service Oriented Architecture (SOA):
An architecture framework in which the functionality of a software component is accessible remotely and programmatically via well-defined interfaces. A service oriented architecture environment consists of software components (e.g., applications, tools, databases) that are loosely coupled to other software components and exchange information with each other and clients through messages. The most common realization of SOA is Web Services.
- Model Driven Architecture (MDA):
A software design and engineering approach for development of (interoperable) software systems. In MDA, the information structure and interface specifications of a software component are expressed as models, generally using the Unified Modeling Language (UML). These models can then be mapped to specific architecture or technology platform realizations.
- Analytical resource (and analytical service):
An application or software system that is wrapped as a service and that receives data, processes the data, and returns a data product.
- Data resource (and data service):
A database or database system that is accessible as a service and that manages one or more datasets and enables query and retrieval of data from these datasets.
- Administration and Security domain:
Infrastructure, systems, data, people that must comply with a unified set of policies regarding access control, authentication, authorization, identity management, information exchange protocols, and trust relationships.
- Common Data Elements:
Common information building blocks that are published and well-defined and that can be reused to create more complex data models and structures. Common Data Elements (CDEs) are standardized terms for the collection and exchange of data. CDEs are metadata; they describe the type of data being collected, not the data itself. See https://wiki.nci.nih.gov/display/caDSR/CTEP+Common+Data+Elements and https://cabig.nci.nih.gov/overview/caBIG_core_concepts
- Controlled Vocabularies:
A controlled vocabulary is an agreed-upon set of standard terminology. For example, a controlled vocabulary might define "study" as "a detailed critical inspection" (rather than, say, "a room devoted to literary pursuits"). If two systems use the same controlled vocabulary, the terms in one system will match those in the other system. See https://cabig.nci.nih.gov/overview/caBIG_core_concepts
The process of registering a data model as common data elements annotated with terms from a controlled vocabulary in caBIG. The key concepts in harmonization are: 1) data model elements (attributes) are annotated with terms from a controlled vocabulary, so that their semantic meaning is well defined, and 2) existing common data elements are reused, when appropriate, to represent the elements of the data model so that semantically equivalent information is expressed in the same way in this and other data models.
- caCORE SDK:
A software development kit created by the NCICBIIT to assist developers in implementing interoperable, caBIG compatible data oriented systems, i.e., systems that manage and enable querying and retrieval of datasets. See http://ncicb.nci.nih.gov/infrastructure/cacoresdk
The caGrid Query Language (CQL) that is used to express queries against a data source using an object oriented language. More information
- Index Service:
A core service in caGrid that provides standards-based support advertisement, registration, and discovery for services in the caGrid environment. A service advertises its presence and its metadata to the Index Service. Clients can use the Index Service to discover available services based on service metadata.
Grid Authentication and Authorization with Reliably Distributed Services (GAARDS) is the security infrastructure of caGrid. It provides services and tools for the administration and enforcement of security policy in an enterprise Grid: 1) Grid user management, 2) identity federation, 3) trust management, 4) group/VO management 5) Access Control Policy management and enforcement, and 5) Integration between existing security domains and the Grid security domain. It consists of (a) Dorian: A Grid service for the provisioning and management of Grid users accounts; (b) Grid Trust Service (GTS): A Grid-wide mechanism for maintaining and provisioning a federated trust fabric consisting of trusted certificate authorities, allowing Grid services to make authentication decisions against the most recent information; (c) Grid Grouper: A group-based authorization solution for the Grid; and (d) Authentication Service: A framework for issuing SAML assertions for existing credential providers so they may easily integrate with Dorian and other Grid credential providers.
Dorian is a caGrid service that provides support for Grid account management, host certificate management, and other functions related to Grid security. Learn more
For other terms, please view the Glossary.
This guide provides an introduction to developing software using caGrid. It is targeted to software developers who are just getting started with caGrid and want to learn how to use caGrid to develop and deploy Grid services. You do not need to be familiar with the concepts of Grid computing, Service Oriented Architecture, or Model Driven Architecture. This document demonstrates the basics of caGrid and provides links to tutorials for hands-on experience developing Grid services. This guide also provides suggestions of additional technical information for those readers who are interested in the design and implementation of caGrid.
This guide is part of a more complete introduction to caGrid 1.4 as outlined in the caGrid 1.4 Quick Start.
caGrid is middleware designed to facilitate secure and federated access to information and analytical resources in a multi-institutional environment. Typically, resources available in this environment have been developed by independent groups. caGrid provides tools, libraries, and runtime support for: 1) resource providers to implement and deploy their analytical and data resources as secure, interoperable services and 2) resource consumers to discover available resources and use them (e.g., submit queries to multiple data sources and retrieve the query results).
caGrid is designed to solve the problem of sharing data and analytical resources in an environment where resources are hosted by multiple organizations and located in multiple administrative and security domains. In addition, caGrid works just as well within a single institution, providing the tools required to share data seamlessly across departments. For example, a research project may require integrative analysis of microarray, imaging, and clinical data. These datasets may be collected by different entities, such as shared resources and medical information warehouses, and may not be stored in a centralized system. caGrid can be used to create a "virtually centralized" data warehouse of such datasets. Each dataset is managed by the respective owner but is integrated as a virtually centralized data warehouse using caGrid service interfaces and tools so that a researcher can access data from any of those datasets through a common interface.
Authentication and authorization controls can be used to limit access to the datasets. A key benefit of using caGrid is that caGrid makes it easy to evolve from sharing data within an institution to sharing data with external collaborators. In most cases, no new software needs to be deployed. Resources can be shared both within an institution and with external collaborators simply by changing the security access restrictions.
caGrid employs a Grid computing model. Grid computing refers to the notion of using distributed resources hosted at multiple institutions to solve large-scale, challenging problems in science and engineering. It was initially conceived as a mechanism to enable remote access to computational and storage machines across the administrative boundaries of supercomputer centers in order to solve large-scale, compute-intensive scientific and engineering problems. Over the years it has evolved into a platform made up of standards, tools, and middleware infrastructures for sharing data and analytical resources as well as computation and storage systems.
At its foundation, caGrid employs the basic principles of Grid computing and existing Grid computing tools, more specifically the Globus Toolkit, to enable access to remote and disparate data and analytical resources. As a user of caGrid, you will likely not need to know the details of Grid computing and Grid computing tools. These details are hidden from the caGrid user by higher level tools and middleware components provided by the core infrastructure. For the purposes of getting started with caGrid, it suffices to say that by using caGrid one can create an environment where resources are located at multiple institutions but can be accessed securely across institutional boundaries. Such an environment is referred to as a "Grid".
caGrid is a service-oriented system. In a service-oriented system, each resource is made available to the (Grid) environment as a service. A service wraps the functionality of the resources in a set of well-defined interfaces. These interfaces, and the associated client side application programming interfaces, are used by client applications to interact with the resource. For example, a Gene expression database, stored in a relational database system, may be wrapped as a service with two operations: query and insert. The query operation allows a client program to issue queries for the Gene data. The insert operation can be used to insert data into the database. With a service-oriented interface, the client program does not directly interact with the relational database system. Note that by providing a service interface, a service developer can change the implementation (hidden to the user). For example, a service developer can upgrade the service to use multiple threads in response to tighter performance requirements.
Most SOA systems employ Web Services technologies as the underlying platform. Web Services provides access to services via standard web protocols. caGrid uses the Web Services Resource Framework (WSRF) standards.
The WSRF draws from the Web Services standards but extends them with such concepts as stateful services, service lifetime, service context, etc. These extensions enables the implementation of more efficient and richer services for scientific application scenarios. The caGrid infrastructure provides the Introduce toolkit for service providers to easily implement service stubs and service interfaces for their resources. The Introduce toolkit also provides support for client application developers to interact with remote services using high-level Java language APIs. You can find more information about Service-Oriented Architecture, Grid Computing, and the WSRF standards in the following references:
- Service-Oriented Architecture: http://en.wikipedia.org/wiki/Service-oriented_architecture
- Grid Computing: http://www.globus.org/alliance/publications/papers/anatomy.pdf
- Web Services Resource Framework (WSRF): http://www.globus.org/wsrf
caGrid draws from Model Driven Architecture. The model driven architecture (MDA) paradigm has gained popularity in recent years. This paradigm promotes the use of object-oriented design practices and rich metadata in order to facilitate implementation of interoperable systems. caGrid adopts a Model Driven Architecture approach to enable interoperability through object-oriented abstractions, common data elements, and controlled vocabularies. That is, client and service APIs in caGrid are object-oriented. These objects, in turn, are defined using common data elements and controlled vocabularies registered on the Grid. For example, the names of an object's fields are terms from the controlled vocabularies. In addition, the type of a field (Integer, String, etc.) matches the type specified in a common data element. The benefit of this approach is that resources are defined in one location (the vocabulary or common data element) and used to generate all Grid artifacts, preventing any issues with re-modeling (the same) data at each Grid layer. A caGrid data service abstracts data as objects. Similarly, an analytical resource (e.g., an analysis program) implemented as a caGrid analytical service provides methods that input objects and return objects.
While the caGrid infrastructure builds on several complex frameworks and standards, caGrid provides a suite of high-level tools and graphical user interfaces that make it easy to use. Most of the details of the underlying standards and frameworks and lower level middleware tools are hidden from the user. These tools and GUIs are covered extensively in caGrid tutorials, presented next.
So far we have provided introductory background information on caGrid. It is now time to start developing with caGrid. In the following section we will outline some of the basic steps to start using caGrid and provide links to relevant tutorials.
As mentioned above, this Getting Started guide is part of a larger quick start. Take a moment to refer to the beginning steps of the caGrid 1.4 Installation Quick Start if you have not yet installed caGrid.
There are several tutorials available for caGrid 1.4. With these tutorials, you can begin service development and also explore advanced features of the Introduce service development toolkit.
Congratulations, you have completed a set of tutorials directing you through the major caGrid components and are now a caGrid Service Developer!
For more detailed technical information on caGrid, please read the caGrid Technical Overview.
From here on, you will determine what you'll need from caGrid to build more complete Grid applications. A good place to start looking for information is in the caGrid 1.4 documentation. Also peruse the knowledgebase for articles on effectively developing with caGrid, troubleshooting, and more.
We encourage you to explore Community Projects. You will find information on many projects to help you accomplish your goals, including projects that are included in the official caGrid release.
Learn about Grid Communities that use caGrid. The Community Training Grid is a caGrid deployment specifically provided for you to develop and test services without worrying about impacting production Grids.
The larger caGrid user community is available to help you use caGrid. Learn more about support resources.