Anthony J. Bonner Adel Shrufi Steve Rozen
University of Toronto, Department of Computer Science, Toronto, ON, Canada.
Whitehead/MIT, Center for Genome Research, Cambridge, MA, USA.
A critical requirement for a large genome laboratory is software to control laboratory workflow while managing the data produced in the laboratory. This software knits together a complex web of manual and automated laboratory activities, including experiment scheduling and setup, robot control, raw-data capture, multiple stages of preliminary analysis and quality control, and release of finished results. Appropriate software is necessary to make the coordination of these activities both intellectually manageable and operationally efficient. A key component of this software is a database management system (DBMS) for controlling and tracking workflow activity. This DBMS maintains a record of what happened in the laboratory. This record constitutes an audit trail, or event history, and has much the same function as a laboratory notebook: it records what was done, when it was done, who did it, and what the results were. It can also be used to analyze workflows, to find rate limiting steps or to investigate anomalous results. This paper discusses the requirements of this DBMS.
The discussion centers on LabFlow-1 [3, 2], a recently developed database benchmark for high-throughput workflow management systems (WFMSs), i.e., systems for managing high-volume, mission-critical workflows. LabFlow-1 is based on the data and workflow management needs of a large genome laboratory, and reflects their real-world experience. An overview of the benchmark can be found in [3], and a detailed description in [2]. Benchmark software is available at ftp://db.toronto.edu/pub/bonner/papers/workflow/software/.
Although it is based on genome-laboratory workflow, we believe that LabFlow-1 captures the database requirements of a common class workflow management applications: those that require a production workflow system [16]. In production workflow, activities are organized into a kind of production line, involving a mix of human and computer activities. Examples in business include insurance-claim or loan-application processing. Production workflow systems are typically complex, high-volume, and central to the organizations that rely on them; certainly these characteristics apply to the laboratory WFMSs used in high-throughput genome laboratories. Many production workflows are organized around central materials of some kind, which the workflow activities operate on. Examples of central materials include insurance claims, loan applications, and laboratory samples. As a central material is processed, workflow activities gather information about it.
Production workflow systems include the class of Laboratory Information Management Systems, or LIMS [21, 1, 19]. LIMS are found in analytical laboratories in a wide range of industries, including pharmaceuticals, health care, environmental monitoring, food and drug testing, and water and soil management. In all cases, the laboratory receives a continual stream of samples, each of which is subjected to a battery of tests and analyses. Workflow management is needed to maintain throughput and control quality [18].
Much of the research on workflow management in computer science has focussed on developing extended transaction models, especially in a heterogeneous environment [10, 20, 6, 14, 13, 8, 22]. However, the performance of workflow management systems has so far received little attention. The need to study performance arises because commercial products cannot support applications with high-throughput workflows [10].
High-throughput workflows are characteristic of large genome laboratories, such as those operated at the Whitehead/MIT Center for Genome Research (hereafter called ``the Genome Center''). Workflow management is needed to support the Genome Center's large-scale genome-mapping projects [4, 15, 5]. Because of automation in sample handling and testing, instrumentation, data capture and workflow management, transaction rates at the Genome Center have increased dramatically in the last three years, from processing under 1,000 queries and updates per day in 1992 [11], to over 15,000 on many days in 1995. Of course, peak rates can be much higher, with a rate of 22.5 updates and queries per second recently observed over a 5-minute period. These rates are expected to increase by another order of magnitude in the near future if the Genome Center begins large scale sequencing of the Human genome [4]. Moreover, unlike the simple banking debit/credit transactions of some TPC benchmarks [23], these transactions involve complex queries, plus updates to complex objects, such as arrays, sequences, and nested sets.
The LabFlow-1 benchmark is a first step towards measuring the performance of WFMSs. It does not account for all the components that affect performance, such as networks, hardware platforms, and operating systems. Instead, it focuses on one dimension of the problem: the DBMS that controls and tracks workflow. Like other components, the DBMS can become a workflow bottleneck, especially in high-throughput applications. Certainly, this is often the case at the Genome Center.
Workflow management at the Genome Center imposes numerous requirements on the DBMS. First, it requires standard database features, such as concurrency control, crash recovery, consistency maintenance, a high-level query language, and query optimization. It also requires complex data types. The DBMS must provide this functionality on a mixed workload of queries and updates. In addition, it must provide the following two features, which are typical of many WFMSs:
It is worth observing that because the database is historical and the schema is evolving, data at different points in the history will be stored under different schemas. Thus, an historical query or view may access data with many different schemas. This presents a challenge both to database design and to the formulation of queries and views. For instance, an application program may have to query an object's schema, as well as its value.
In sum, the database requirements of the Genome Center are typical of applications with the following characteristics: (i) high-volume, mission-critical workflows, (ii) frequent workflow change and process re-engineering, (iii) an audit trail of workflow activity, and (iv) complex-structured data. The LabFlow-1 benchmark is intended for such applications.
This section provides an overview of data and workflow management in LabFlow-1 from the perspective of the DBMS. To keep the discussion concrete, we frame it in terms of laboratory workflow. Additional details can be found in [2].
The database contains two main kinds of object: materials and steps. In object-oriented terms, the database has a material class and a step class, with subclasses representing different kinds of materials and steps. Step and material instances (objects) are created in two distinct situations. (i) As the laboratory receives or creates new materials, new material instances are created in the database to represent them. (ii) Each time a material instance is processed by a workflow step, a new step instance is created in the database to represent the activity. The step instance records the results of the workflow step (e.g., measurements of length and sequence) and the conditions under which the step was carried out (e.g., temperature and salinity).
As a material is processed by a workflow, more-and-more step instances are created in the database, all associated with the same material. These steps constitute the material's event history. Workflow steps are not always successful, however, and earlier steps sometimes have to be repeated. Thus, the event history of a material may contain several different (though similar) versions of a step. Commonly, the most recent version is of greatest interest to scientists, since it represents the most up-to-date results.
As described above, the database schema depends on the workflow. For each kind of workflow step, there is a class in the database, and for each measurement made by the step, the class has an attribute. Consequently, workflow changes are reflected in the database as schema changes. In particular, as laboratory steps change, the corresponding classes also change. For instance, if scientists decide that a particular step should measure more properties, then attributes must be added to the corresponding class in the database. Likewise, if scientists add a new step to the workflow, then a new class is added to the database schema. If scientists split a complex step into a combination of simpler steps, then new classes are introduced into the schema to represent the simpler steps.
Classes are never removed from the database schema, even if the corresponding steps are no longer carried out in the laboratory. This is because the database is historical: old classes are needed to store and interpret data gathered under earlier workflows. Thus, the database schema expands (as classes are created and attributes are added to them), but never shrinks. In effect, as a step evolves, new versions of the step are created. Each step object is associated forever with the same version of a step class; so schema changes do not require data re-organization. A similar approach to schema evolution can be found in [24].
The data representation described above is event oriented. That is, information about a step is kept in one place, but information about a material is scattered among different steps. This provides a straightforward record of laboratory activity, but an awkward representation of materials. In particular, retrieving information about a material requires a detailed knowledge of laboratory workflow. For each property of a material, one must know what step(s) measured its value. Moreover, because workflows change constantly, a detailed knowledge of workflow changes and workflow history are also needed. This is a common occurrence in laboratory notebooks. When the notebook is stored in a database, the problem is compounded, since application programs may have to be changed each time the workflow changes.
To alleviate these problems, the database provides a view that is material oriented, i.e., in which information is associated with materials, not steps. In the database, observed values ( e.g., sequence, length, and mass) are attributes of steps; while in the view, they are attributes of materials. Using the view, an application programmer can retrieve the length of a DNA segment without knowing what step measured the length. In this way, the view isolates the application programmer from the details of workflow and workflow change.
Defining this view is not a straightforward matter. For one thing, the view definition depends on the workflow and its history. For another, different instances of the same kind of material can have different attributes in the view. For instance, segments of DNA may or may not have a length attribute, depending on whether or not they were processed by a step that measured their length. Thus, the attributes (or type) of a material depend on the history of the material, as well as its class. This rather dynamic feature reflects the flexibility demanded by workflow management.
In addition to views, the database must support historical queries. Most of these queries can be divided into four categories:
Detailed examples of queries and updates are given in [2].
The functions described above come largely under the heading of workflow tracking. In addition, a workflow management system must provide a means of workflow modeling [10]. A workflow model specifies the dependencies among workflow steps. The Genome Center represents the most important dependencies as a workflow graph [25, 2]. Workflow graphs are based on the idea that each material has a workflow state, and as the material is processed, it moves from one state to another. The workflow graph largely determines the workload for the DBMS. A sample graph is given in [2], one that forms the basis of the workload for the LabFlow-1 benchmark.
It is worth noting that the idea of assigning states to materials contrasts with transactional workflow, in which states are assigned to long-running activities ( e.g., [22, 26, 9]). This difference might be resolved if each material were associated with a single long-running activity. This activity would exist as long as the material is being processed, and would correspond to the sequence of workflow steps that process the material. In this case, material state and activity state might be identified.
This work was supported in part by a research grant from the Natural Sciences and Engineering Research Council of Canada (NSERC), and by funds from the U.S. National Institutes of Health, National Center for Human Genome Research, grant number P50 HG00098, and from the U.S. Department of Energy under contract DE-FG02-95ER62101.
Anthony J. Bonner  (bonner@db.toronto.edu)
Adel Shrufi  (shrufi@db.toronto.edu)
Steve Rozen  (steve@genome.wi.mit.edu)
University of Toronto,
Department of Computer Science,
10 King's College Rd,
Toronto, Ont,
Canada   M5S 3G4
University of Toronto,
Department of Computer Science,
10 King's College Rd,
Toronto, Ont,
Canada   M5S 3G4
Whitehead/MIT Center for Genome Research,
Whitehead Institute for Biomedical Research,
One Kendall Square, Building 300, Floor 5,
Cambridge MA 02139,
USA
References