Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Metadata file format

QIIME 2 metadata is most commonly[1] stored in a TSV (i.e. tab-separated values) file. These files typically have a .tsv or .txt file extension, though it doesn’t matter to QIIME 2 what file extension is used. TSV files are simple text files used to store tabular data, and the format is supported by many types of software. TSV files can be imported to, edited in, and exported from most spreadsheet programs and databases. Thus it’s usually straightforward to manipulate QIIME 2 metadata using the software of your choosing. If in doubt, we recommend using a spreadsheet program such as Google Sheets to edit and export your metadata files.

Because metadata files contain tabular data, we describe their formatting in terms of rows and columns. The commonality across QIIME 2 metadata files is that the first non-comment, non-empty row of the file defines the column headers, and the first column contains a unique identifier for each metadata entry. The following sections describe the formatting requirements for QIIME 2 metadata files.

There is no universal standard for TSV files. It is important to adhere to the requirements described in this document to understand how QIIME 2 will interpret your metadata file’s contents.

Comments, comment directives, and empty rows

Rows whose first cell begins with the pound sign (#) are interpreted as comments and may appear anywhere in the file. Comment rows are ignored by QIIME 2 and are for informational purposes only. Inline comments (i.e., comments that begin part-way through a row or at the end of a row) are not supported.

Rows beginning with #q2: are interpreted as comment directives and should not be used unless they are used in a comment directive (e.g., q2:types or q2:missing). We discuss use cases for these below. We reserve the right to add new comment directives, beyond those that are already defined, in the future.

Empty rows (e.g. blank lines or rows consisting solely of empty cells) may appear anywhere in the file and are ignored.

Identifier column

The first column in the metadata file is the identifier (ID) column. This column defines the sample or feature IDs described by your metadata. It is not recommended to mix sample and feature IDs in a single metadata file; keep sample and feature metadata stored in separate files.

The ID column name (also referred to as the ID column header) must be one of the following values. The values listed below are reserved for use as ID column names and may not be used as IDs or names of other columns in the metadata file.

Case-insensitive (i.e., uppercase or lowercase, or a mixing of the two, is allowed):

Case-sensitive (i.e., these must appear exactly as presented here):

The following rules apply to IDs:

Metadata columns

The ID column is the first column in the metadata file, and can optionally be followed by additional columns defining metadata associated with each sample or feature ID. Metadata files are not required to have additional metadata columns, so a file containing only an ID column is a valid QIIME 2 metadata file.

The following rules apply to column names:

The metadata file line containing the ID column name and any other column names is referred to as the header row.

Metadata values

The contents of a metadata file following the ID column and header row (excluding comments and empty lines) are referred to as the metadata values. A single metadata value, defined by an (ID, column) pair, is referred to as a cell.

The following rules apply to metadata values and cells:

Leading and trailing whitespace characters

If any cell in the metadata contains leading or trailing whitespace characters (e.g. spaces, tabs), those characters will be ignored when the file is loaded. Thus, leading and trailing whitespace characters are not significant, so cells containing the values 'gut' and ' gut ' are equivalent. This rule is applied before any other rules described in this section.

Recommendations for identifiers

Our goal with QIIME 2 is to support arbitrary Unicode characters in all cells of metadata files. However, given that QIIME 2 plugins and interfaces can be developed by anyone, we can’t make a guarantee that arbitrary Unicode characters will work with all plugins and interfaces. We can therefore make recommendations to users about characters that should be safe to use in identifiers, and we are preparing resources for plugin and interface developers to help them make their software as robust as possible.

Sample and feature identifiers with problematic characters tend to cause the most issues for our users. Based on our experiences we recommend the following attributes for identifiers:

An important point to remember is that sometimes values in your sample metadata can become identifiers. For example, taxonomy annotations can become feature identifiers following qiime taxa collapse, and sample or feature metadata values can become identifiers after applying qiime feature-table group. If you plan to apply these or similar methods where metadata values can become identifiers, you will be less likely to encounter problems if the values adhere to these identifier recommendations as well.

Column types

QIIME 2 currently supports categorical and numeric metadata columns. By default, QIIME 2 will attempt to infer the type of each metadata column: if the column consists only of numbers or missing data, the column is inferred to be numeric. Otherwise, if the column contains any non-numeric values, the column is inferred to be categorical. Missing data (i.e. empty cells) are supported in categorical columns as well as numeric columns.

QIIME 2 supports an optional comment directive to allow users to explicitly state a column’s type. This bypasses the column type inference described above. This can be useful if there is a column that appears to be numeric, but should actually be treated as categorical metadata (e.g. a Subject column where subjects are labeled 1, 2, 3, etc.). Explicitly declaring a column’s type also makes your metadata file more descriptive because the intended column type is included with the metadata, instead of relying on software to infer the type (which isn’t always transparent).

You can add a comment directive to declare column types in your metadata file manually or through the q2cli command line utilities (call qiime tools).

For manual specifications within your metadata file(s), comment directive line(s) must appear directly below the header. The row’s first cell must be #q2:types to indicate the row is a comment directive. Subsequent cells may contain the values categorical or numeric (both case-insensitive). The empty cell is also supported if you do not wish to assign a type to a column (the type will be inferred in that case). Thus, it is easy to include this comment directive without having to declare types for every column in your metadata.

This functionality is now also supported directly through q2cli by calling qiime tools cast-metadata. This utility allows for bulk specifications to your metadata file(s) column types, set to either categorical or numeric. This tool utilizes the aforementioned comment directive, but allows for inline data manipulation (or the ability to automate column type assignment through a custom script), which can be a more robust method than manual file manipulation.

Number formatting

If a column is to be interpreted as a numeric metadata column (either through column type inference or by using the #q2:types comment directive), numbers in the column must be formatted following these rules:

Advanced missing metadata value encoding

Missing metadata values may be encoded in one of the following schemes:

  1. blank: The default, which treats empty cells as the only valid missing values.

  2. no-missing: Indicates there are no missing values, and that any empty cells should be considered an error. If a scheme other than ‘blank’ is used by default, this scheme can be provided to preserve strings as categorical terms.

  3. INSDC:missing: The INSDC vocabulary for missing values. The current implementation supports only lower-case terms which match exactly: ‘not applicable’, ‘missing’, ‘not provided’, ‘not collected’, and ‘restricted access’.

The encoding used for each column can be specified on a per-column basis using the #q2:missing comment directive. For manual specifications within your metadata file(s), comment directive line(s) must appear directly below the header. The row’s first cell must be #q2:missing to indicate the row is a comment directive. Subsequent cells may contain the values blank, no-missing, or INSDC:missing (all case-sensitive). The empty cell is also supported if you do not wish to assign a missing value encoding to a column, in which case it will default to blank.

Advanced metadata formatting

If you’re creating TSV files manually (e.g. in a text editor) or writing your own software to consume or produce QIIME 2 metadata files this section provides additional formatting details. If you’re creating and exporting QIIME 2 metadata files using a spreadsheet program (e.g. Microsoft Excel, Google Sheets) you can skip this content.

TSV dialect and parser

QIIME 2 attempts to interoperate with TSV files exported from Microsoft Excel, as this is the most common TSV “dialect” we have seen in use. The QIIME 2 metadata parser (i.e. reader) uses the Python csv module excel-tab dialect for parsing TSV metadata files. This dialect supports wrapping fields in double quote characters (") to allow for tab, newline, and carriage return characters within a field. To include a literal double quote character in a field, the double quote character must be immediately preceded by another double quote character. See the Python csv module for complete documentation on the excel-tab dialect.

Encoding and line endings

Metadata files must be encoded as UTF-8, which is backwards-compatible with ASCII encoding.

Unix line endings (\n), Windows/DOS line endings (\r\n), and “classic Mac OS” line endings (\r) are all supported by the metadata parser for interoperability. When metadata files are written to disk in QIIME 2, the line endings will always be \r\n (Windows/DOS line endings).

Trailing empty cells and jagged data

The metadata parser ignores any trailing empty cells that occur past the fields declared by the header. This is mainly for interoperability with files exported from some spreadsheet programs. These trailing cells/columns may be jagged (or not); they will be ignored either way when the file is read.

If a row doesn’t contain as many fields as declared by the header, empty cells will be padded to match the header length (again, this is mainly for interoperability with exported spreadsheets).

Footnotes
  1. In addition to TSV files, some QIIME 2 Artifacts (i.e. .qza files) can also be used as metadata. See How to use Artifacts as Metadata for details.

  2. The length recommended here (36 characters or less) is designed to be as short as possible while still supporting version 4 UUIDs formatted with dashes.

References
  1. Ziemann, M., Eren, Y., & El-Osta, A. (2016). Gene name errors are widespread in the scientific literature. Genome Biology, 17(1). 10.1186/s13059-016-1044-7
  2. Chase, J. H., Bolyen, E., Rideout, J. R., & Caporaso, J. G. (2016). cual-id: Globally Unique, Correctable, and Human-Friendly Sample Identifiers for Comparative Omics Studies. mSystems, 1(1). 10.1128/msystems.00010-15