Table of Contents
The Taverna Workbench allows users to construct complex workflows which consists multiple types of components, each type of component is called a processor. These components may locate on different machines, are orchestrate by Taverna, and the results are gathered and shown in the workbench. Current version of Taverna (18.104.22.168) supports many types of processors, including apiconsumer processor, beanshell processor, biomart processor, biomody processor, java processor, soaplab processor, wsdl processor, etc. Taverna also provides a set of Service Provider Interfaces (SPI) which are extending points for the developers to provide additional functionality for a specific purpose. Using SPI, a plug-in is developed for the CaGrid users to add grid services in a Taverna workflow, and this plug-in is called taverna-gt4-processor. With this processor, a Taverna workflow is aware of the CaGrid services, and could orchestrate the grid services in CaGrid.
Briefly, here are some useful web resources:
An up-to-date online demo, showing the use of Taverna to model and execute caGrid workflows: https://webmeeting.nih.gov/p44387759/*(We STRONGLY recommend you to watch this first!!)*
GT4 Plug-in download: http://www-unix.mcs.anl.gov/~madduri/taverna/
Create CaGrid Workflow Using Taverna (this article): How to Create CaGrid Workflow Using Taverna
Before using the GT4-processor plug-in to create, run and monitor CaGrid workflow with Taverna, users are strongly suggested to read Taverna 1.7 Users Guide located at: http://www.mygrid.org.uk/usermanual1.7/user_guide.html, through which one could get to know the basic instructions on how to create a workflow in Taverna.
In order to use the GT4 processor, one must first add it into Taverna workbench as a plug-in. Start Taverna, open Tools?Plugin Manager?Find New Plugins?Add Plugin Site, add the name and URL of the site where the plug-in resides in (GT4 and http://www-unix.mcs.anl.gov/~madduri/taverna/, respectively, in the example below).
Then select "GT4 Processor 22.214.171.124" and click install. Again, it may take a while to download Maven artifacts (jars and pom files) from remote repositories.
In Taverna, a scavenger represents a processor type. For example, a WSDL scavenger represents a web service with a WSDL description. When you add a WSDL scavenger into Taverna workbench, all the porttypes and operations in this WSDL is visible and each operation could be added into the workflow as a WSDL processor. Often the URL of a service of interest is not a "well known" value, and is something that is discovered at runtime. To locate the caGrid services, we integrate caGrid discovery API in the GT4 plug-in's scavenger. The GT4 scavenger has two usage scenarios:
- The simplest discovery scenario is to just query the Index Service for all registered services. The scavenger returns an array of EPRs, and the operations belonging to each EPR. Each EPR represents a valid caGrid service, and can be used to by Taverna to create processors to invoke operations on the corresponding services (detailed later). Right-click on "Available Processors" in the top left-hand panel and then "Add new caGrid(GT4) Scavenger..."
Select the URL of the default caGrid index service: http://cagrid-index.nci.nih.gov:8080/wsrf/services/DefaultIndexServicein "Location (URL) of the index service" (and keep other fields unchanged), double click "Send Service Query" and then Taverna workbench will list all the register services together with their operations.
Then the GT4 scavenger is added to the Available Processors list in the workbench panel.
Double click the GT4 Scavenger and get services & operations list in the newly added scavenger
- Semantic based service query. caGrid discovery APIs provide many discovery capabilities, from "full text search" suitable for a freeform webpage-like interface, simple text-based criteria such as specifying operation names or concept code, and complex criteria ("query by example") such as specification of point of contact information or UML class criteria.
The GT4 plug-in support semantic based service query. Users can input multiple (up to three, in current scenario three is enough. We can add more in our program upon request) service query criteria and input the corresponding value.
We can combine multiple criteria. The initial GUI only shows one query criteria, but more can be added by clicking "Add Service Query" button. For example, we can query the caGrid services whose "Research Center" name is "Ohio State University", with Service Name "DICOMDataService", and has operation "PullOp".
We got the services list like this:
Now the user could add GT4 processors in the scavenger into Taverna workflow. We'll show how to do this in Section 3.
A Taverna workflow is made up of: Input and output, Processors, XML splitters which aggregates/splits the input/output data for the processors, Data links, Control links.
Here we will not explain these concepts in detail, again, the users are recommended to read Taverna Users Manual to get instructions.
Adding a processor In the first step, we could add a new processor into an empty workflow. See the figure below, we add a processor findProjects. Find the operation findProjects in Available Processors, right click and choose Add to model, the processor is added into a new workflow which is shown in the diagram in the right.
Adding a XML Splitter In Taverna it is possible to directly provide the XML data needed by WSDL services, but sometimes users might find some XML data elements are too verbose to handle. Taverna provides 'XML splitters' which interrogate the data structure and present to the user the internal data elements. One XML splitter will resolve the input XML data structure by a single level, so multiple splitters might be needed when the XML data contains multiple-level complex types. For example, the XML element parameters is the input of processor findProjects, it contains a text node context as its sub-element, by adding an XML splitter in the input port of processor findProjects, the user could directly input the value of element context. Double click the data element on which you want to add the XML splitter, choose Add XML splitter.
A new splitter is added, with a data link to the processor to which the data element belongs.
Assigning a default value to a parameter We could assign a default value to a parameter. For example, we could assign value caCore to parameter context (the input of XML splitter parametersXML). Each time the workflow is initiated, the default value is assigned to the parameter.
Adding a data link Data links exist between workflow inputs, processors and workflow outputs. For example, a data link between processor A and B will feed the output of A to the input of B. In the figure below we could see many data links, and these data links are added automatically when we add XML splitters for processors.
Data links can also be added manually. For example, if we want to feed the output of XML splitter parametersXML2 (i.e., XML element Project) to the input of XML splitter projectXML (also XML element Project), we can find the output in the Advanced model explorer panel, right click and choose the target to connect to.
In the figure below we can see that a data link between parametersXML2 and projectXML is added.
Adding a control link Control links represents the control flow between processors. The target processor of a control link cannot start until the source processor completes.
Adding an input/output In Advanced model explorer, Workflow inputs and Workflow outputs nodes are used to create workflow inputs and outputs respectively. Right click Workflow inputs/outputs, select Create new input or Create new output. After these nodes are created, the users can connect them to the processors.
A sample workflow The sample workflow is made up of two processors, findProjects and findPackagesInProject, together with some XML splitters to process their input/output. The purpose of this workflow is straightforward. Step 1: use processor findProjects to get a list of projects related to a context, and in Step2: findPackagesInProject use processor find all the packages in each of the projects. Because the input and output data of these two steps do not fit exactly, we add three xml splitters to transform the output of findProjects into the input of findPackagesInProjects. These three XML splitters are getProject, projectXML and prepareProject, and they are with color purple. The processors are with color green. We use another XML splitter, inputContext, to help the user input the context of the projects to query. In this example, variable context has a default value of caCore. An output node, projectsinformation, is created to store the packages information for multiple projects.
Select "File"-->"Run workflow..." to run a workflow in workbench.
Then the workbench is switched to the Results perspective. The execution trace of the workflow, the status of each processor could be seen with text as well as graph.
The intermediary output of processor findPackagesInProject:
The result of this sample workflow:
This workflow shows the coordinated use of two services in CaGrid, i.e., the caDSR (Cancer Data Standards Repository) and EVS (Enterprise Vocabulary Services) services. caDSR is to define a comprehensive set of standardized metadata descriptors for cancer research terminology used in information collection and analysis. EVS provides resources and services to meet NCI needs for controlled terminology, and to facilitate the standardization of terminology and information systems across the Institute and the larger biomedical community. This sample workflow is to find all the concepts related to a given context, for example, caCore. It is made up of four processors, several XML splitters between them, and a beanshell processor to do some XML transformation which can not be done by XML splitters (XML splitter cannot handle XML attributes). To accelerate the demo, we use two local java widgets, extract elements from a list, to filter out some intermediate results so that the workflow will complete quickly. This reduction does not influence the effect of the demo.
- Use context information to invoke findProjects in caDSR, and get the project(s) information.
- Use project information to invoke findClassesInProject in caDSR, to get the classes' metadata.
- Use project and classes metadata to invoke findSemanticMetadataForClass in caDSR, to get the semantic data for classes, including conceptName.
- Use conceptName to invoke searchDescLogicConcept in EVS, to get detailed concept information.
The execution trace.
For source code: access the Taverna SVN by:
The caGrid 1.3 release stream provides access to the official source code repository for caGrid 1.3. On Windows systems, we recommend the following 3rd party tool as a GUI front-end to subversion to check out a caGrid release: http://tortoisesvn.tigris.org The command line version of subversion can be obtained from http://subversion.apache.org/source-code.html
Although source code access is not a must for general users, in the SVN repository you can find not only the source code, but also a bunch of other useful materials, including some example workflow files, an up-to-date tutorial, a poster to be presented at the caBIG annual meeting 2008, etc.
If you have any questions/concerns regarding the workflow tool we have built, please do not hesitate to contact any of us:
Wei Tan: firstname.lastname@example.org
Ravi Madduri: email@example.com