Using an Artifact Cache

An artifact cache allows users to have finer control over where and how QIIME 2 Results are stored on disk.

Artifact caches serve two primary purposes:

Providing the user with control over where QIIME 2 stores its working (temporary) files.
Avoiding the overhead of unzipping QIIME 2 Results every time they’re used.

Users can create and interact with artifact caches via both the Python API and the CLI. This tutorial will provide instructions for both.

An artifact cache is created in a specific location on your file system. After an artifact cache is created, it can be used to store Results as unzipped directories, as opposed to as .qza or .qzv files. Results in a cache are referred to by the path to the cache followed by a user-defined key.

A use case for the artifact cache¶

Consider a use case where you have a very large artifact, say an 80 gigabyte reference sequence database, and you are regularly using this database as an input to QIIME 2 Actions on a cluster. It would be ideal to avoid constantly unzipping this large database into and out of a .qza file. It may also be ideal to store this artifact in a location that all users and all worker nodes on the cluster have access to, to avoid multiple copies of this file being stored on the system (e.g., under different users’ home directories).

These issues can be resolved by putting the reference data artifact in a cache in a location on the cluster’s file system that is globally accessible by the users and worker nodes. By nature of being stored in a cache, the artifact will be stored unzipped, so the action will not need to unzip the artifact before using it. As long as the cache is stored in a location that all worker nodes can access, it will not need to be moved around the filesystem before the action can execute.

Tutorial¶

Reminder

These examples assume that you have a QIIME 2 deployment that includes the q2-dwq2 educational plugin. Follow the instructions in Getting started if you’d like to follow along with this tutorial. If you’ve already followed those instructions, before following this tutorial be sure to activate your conda environment as follows:

conda activate using-qiime2

The following steps will illustrate how to create a new cache, add an artifact to it, use an artifact from it, and more. Note that the steps in the tutorial may require that some or all of the preceding steps have been run.

First, have QIIME 2 generate some data that we can use.

qiime dwq2 search-and-summarize --example-data ss-usage

Then, change to the directory containing the data.

cd ss-usage/Serial

Creating a cache¶

Now, let’s create a new artifact cache in the current working directory:

Command line interface

Python 3 API

qiime tools cache-create --cache my-cache

This will create a cache at the given path if one does not exist. It is also how you get an object (cache in this case) referring to an existing cache, if the given path does exist.

from qiime2 import Cache

cache = Cache('./my-cache')

Loading entries in a cache¶

Next, let’s simulate the example described above where you have a reference data set that you’d like to store in the cache, so it doesn’t need to be unzipped every time it’s used.

This will store an artifact in the specified cache (my-cache) with the specified key (my-reference).

Command line interface

Python 3 API

qiime tools cache-store \
   --cache my-cache \
   --artifact-path ./reference-seqs.qza \
   --key my-reference

Reviewing entries in a cache¶

You can confirm that the Artifact was added to the cache as follows.

Command line interface

Python 3 API

qiime tools cache-status --cache my-cache

The cache-status command in q2cli (i.e., the command line interface equivalent of this command) is currently the easiest way to get information about what’s in a cache. You can access similar information through the Cache API.

cache.get_keys()
cache.load('my-reference')

Using caches with Actions¶

Now, you can reference the artifact from the cache when calling Actions. Artifacts in a cache are referenced as path-to-cache:key. So, as long as you’re working in the same directory as where your cache is stored, the following command should run.

You can also use your cache to store all temporary files created by your QIIME 2 action. This is done using the --use-cache parameter on the cli and by withing in your Cache object as a context manager in the Python API. It is not mandatory to do this; however, it is recommended to do this any time you are loading data from or saving data to a cache. If you are loading data from/saving data to multiple caches in one action, prefer using the one that has the larger amount of data moving into/out of it. This will prevent unnecessary copying of data and can save a large amount of time.

(This command does take a couple of minutes to run. If you want it to go faster, you can run it in parallel.)

Command line interface

Python 3 API

qiime dwq2 search-and-summarize \
    --i-query-seqs query-seqs.qza \
    --i-reference-seqs my-cache:my-reference \
    --m-reference-metadata-file reference-metadata.tsv \
    --p-split-size 1 \
    --o-hits hits.qza \
    --o-hits-table hits-table.qzv \
    --use-cache my-cache

from qiime2.plugins.dwq2.actions import search_and_summarize
from qiime2 import Metadata

reference_from_cache = cache.load('my-reference')
reference_metadata = Metadata.load('./reference-metadata.tsv')

with cache:
    hits, hits_table = search_and_summarize(
        query_seqs=query_seqs,
        reference_seqs=reference_from_cache,
        reference_metadata=reference_metadata,
        split_size=1)

Outputting Results directly to a cache

It’s also possible to write outputs from QIIME 2 Actions to a cache. To do that, you’d reference the cache and key in the same way as for an input. In this example, the hits output artifact is being written to the cache.

Command line interface

Python 3 API

qiime dwq2 search-and-summarize \
    --i-query-seqs query-seqs.qza \
    --i-reference-seqs my-cache:my-reference \
    --m-reference-metadata-file reference-metadata.tsv \
    --p-split-size 1 \
    --o-hits my-cache:my-hits \
    --o-hits-table hits-table.qzv \
    --parallel \
    --use-cache my-cache

After that completes, how can you check that the resulting Artifact was written to the cache? (Hint: we used this command above.)

Remove a Result from the cache¶

If there’s an item in your cache that you no longer need, you can remove. We can remove our reference data.

Command line interface

Python 3 API

qiime tools cache-remove \
   --cache my-cache \
   --key my-reference

After removing the artifact, check the status of your cache to confirm that it was removed.

There are some other command line tools and APIs accessible to help you interact with your cache(s). You can learn about these as follows.

Command line interface

Python 3 API

qiime tools --help

See the tools that begin with cache-.

Remove a cache¶

If you no longer need your cache, you can just remove the directory from disk as follows.

rm -r my-cache

A use case for the artifact cache¶

Tutorial¶

Creating a cache¶

Loading entries in a cache¶

Reviewing entries in a cache¶

Using caches with Actions¶

Remove a Result from the cache¶

Other cache-related functionality¶

Remove a cache¶