Using parallel Pipeline execution

Using parallel Pipeline execution#

QIIME 2 provides formal support for parallel computing of Pipelines through Parsl.[1] This allows for faster execution of QIIME 2 Pipelines, assuming the compute resources are available, by ensuring that pipeline steps that can run simultaneously do run simultaneously.

Parallel Pipeline execution is accessible in different ways depending on which interface you’re using. Here we illustrate how to run Pipelines in parallel using q2cli and QIIME 2’s Python 3 API.

Reminder

These examples assume that you have a QIIME 2 deployment that includes the q2-dwq2 educational plugin. Follow the instructions in Getting started if you’d like to follow along with this tutorial. If you’ve already followed those instructions, before following this tutorial be sure to activate your conda environment as follows:

conda activate using-qiime2

q2cli#

Review the help text for a QIIME 2 Pipeline. Pay special attention to the usage examples at the bottom of the help text.

qiime dwq2 search-and-summarize --help

Have QIIME 2 generate example data that can be used to run the usage example.

qiime dwq2 search-and-summarize --example-data ss-usage

This will create a new directory for search-and-summarize usage example data. Change into that new directory by running:

cd ss-usage/Serial

Run the usage example serially first. Note that in the following commands the output filenames are adapted from the usage example to prepend serial- to each file name.

Note

The following command may take several minutes to run. On my Apple MacBook Pro (M3) it ran for approximately 6 minutes.

qiime dwq2 search-and-summarize \
    --i-query-seqs query-seqs.qza \
    --i-reference-seqs reference-seqs.qza \
    --m-reference-metadata-file reference-metadata.tsv \
    --p-split-size 1 \
    --o-hits serial-hits.qza \
    --o-hits-table serial-hits-table.qzv

To re-run this Pipeline in parallel, append the --parallel flag. This will run this command in parallel using a default parallel configuration (learn more about this in Parallel Pipeline configuration). Note that the output filenames this time are adapted to prepend parallel- to each file name.

qiime dwq2 search-and-summarize \
    --i-query-seqs query-seqs.qza \
    --i-reference-seqs reference-seqs.qza \
    --m-reference-metadata-file reference-metadata.tsv \
    --p-split-size 1 \
    --o-hits parallel-hits.qza \
    --o-hits-table parallel-hits-table.qzv \
    --parallel

If you’re using a system with parallel computing capabilities (e.g., at least six cores) the parallel execution of this command should have run faster than the serial execution.

Python 3 API#

Parallel Pipeline execution through the Python API is done using a ParallelConfig object as a context manager. These objects take a parsl.Config object and an optional dictionary mapping action names to executor names as input. If no config is provided your default configuration will be used (see QIIME 2 configuration file precedence).

from qiime2.sdk.parallel_config import ParallelConfig
from qiime2.plugins.dwq2.pipelines import search_and_summarize
from qiime2 import Artifact, Metadata

query_seqs = Artifact.load('query-seqs.qza')
reference_seqs = Artifact.load('reference-seqs.qza')
reference_metadata = Metadata.load('reference-metadata.tsv')

with ParallelConfig():
    future = search_and_summarize.parallel(query_seqs=query_seqs,
                                           reference_seqs=reference_seqs,
                                           reference_metadata=reference_metadata,
                                           split_size=1)
    # call future._result() inside of the context manager
    result = future._result()

To use a specific configuration, you can create it directly, or load one from file. For example:

from qiime2.sdk.parallel_config import ParallelConfig, get_config_from_file
from qiime2.plugins.dwq2.pipelines import search_and_summarize
from qiime2 import Artifact, Metadata

query_seqs = Artifact.load('query-seqs.qza')
reference_seqs = Artifact.load('reference-seqs.qza')
reference_metadata = Metadata.load('reference-metadata.tsv')

path_to_config_file = # set this to the path to the file you'd like to load

c, m = get_config_from_file(path_to_config_file)

with ParallelConfig(parallel_config=c, action_executor_mapping=m):
    future = search_and_summarize.parallel(query_seqs=query_seqs,
                                           reference_seqs=reference_seqs,
                                           reference_metadata=reference_metadata,
                                           split_size=1)
    # call future._result() inside of the context manager
    result = future._result()

Parsl configuration#

To learn how to configure parsl for your own usage, refer to Parallel Pipeline configuration.