pdp command-line interface

The pdp command-line interface is intended to split individual stages of the complete primer/metabarcoding marker design process into logically-sensible “chunks”. Each stage is invoked using the formula:

pdp <SUBCOMMAND> <OPTIONS> <ARGUMENTS>

The stages are described and summarised below, in the usual order of implementation for design of qPCR primers or metabarcoding markers.

0. Create configuration file

The pdp software takes as initial input a configuration file. This is a plain text, tab-separated file containing columns describing the following data:

  • The name of the sequence (for use during analysis)
  • A comma-separated list of labels describing classes/groups to which the sequence belongs
  • The path to the sequence’s FASTA file
  • (Optionally) the path to a set of features (e.g. gene annotations); if no features are supplied, this can be replaced with a hyphen: -.

Tip

Any line in the configuration file that starts with a hash/octothorpe (#) is ignored. This is useful for commenting and/or ignoring specific sequences.

An example configuration file is included at the location tests/walkthrough/pectoconf.tab. The contents are shown below (note the extensive commenting, with lines preceded by a hash/octothorpe).

$ cat tests/walkthrough/pectoconf.tab
# Pectobacterium genomes downloaded from GenBank/NCBI; genomovars inferred from ANIm
# Annotated Pba: genomovar 1
Pba_SCRI1043        Pectobacterium,atrosepticum_NCBI,gv1    tests/walkthrough/sequences/GCF_000011605.1.fasta       -
# Annotated Pwa: genomovars 2, 3
Pwa_CFBP_3304       Pectobacterium,wasabiae_NCBI,gv2        tests/walkthrough/sequences/GCF_000291725.1.fasta       -
# Annotated Pb      : genomovar 7
Pbe_NCPPB_2795      Pectobacterium,betavasculorum_NCBI,gv7  tests/walkthrough/sequences/GCF_000749845.1.fasta       -

The first line describing an input sequence tells us that its name is Pba_SCRI1043, that it belongs to classes/groups Pectobacterium, atrosepticum_NCBI, and gv1, and that the sequence’s FASTA file can be found at tests/walkthrough/sequences/GCF_000011605.1.fasta. There are no features associated with the sequence.

Tip

It can be easiest to prepare your configuration files in a spreadsheet program such as Microsoft Excel, then convert the file to .json format with pdp config.

1. pdp config

The pdp config command provides functions for validation and handling of pdp configuration files. The key functions provided by this command are:

validation
pdp config can check the formatting and content of a configuration file, to confirm that the appropriate fields are all populated for each input sequence, and that the specified files exist on the filesystem. The program will warn if input sequences require some repair.
$ pdp config --validate tests/walkthrough/pectoconf.tab
WARNING: Validation problems
    Pbe_NCPPB_2795 requires stitch (tests/walkthrough/sequences/GCF_000749845.1.fasta)
    Pwa_CFBP_3304 requires stitch (tests/walkthrough/sequences/GCF_000291725.1.fasta)
    Pwa_CFBP_3304 has non-N ambiguities (tests/walkthrough/sequences/GCF_000291725.1.fasta)
sequence repair
Sequences used as input to pdp may not conform to requirements of some of the third party tools, or the pipeline’s other assumptions. Some stages in the pipeline cannot accommodate genomes presented as multiple sequence files, and others cannot cope with nucleotide ambiguity symbols other than N. pdp config can repair input genome sequences by stitching them and replacing ambiguity codons, writing the “fixed” genome to a new file. Instead of modifying the input sequence directly, which would change the source data (a problem for reproducibility), pdp will “fix” these sequences by writing new, “stitched” and “cleaned” versions of the input sequences, then automatically update the configuration file to point to the modified files. This is done with the pdp config --fix_sequences command:
pdp config --fix_sequences tests/walkthrough/fixed.json tests/walkthrough/pectoconf.tab

Attention

The --fix_sequences option takes as its argument the location to write the output configuration file; the final positional argument is the path to the input configuration file.

format conversion
The pdp config subcommand can convert configuration files between .tab and JSON format. The .tab format is easier to read and manipulate in spreadsheet software, but all the tools in the pdp pipeline require input in .json format.
pdp config --to_json myconfig.json myconfig.tab
pdp config --to_tab myconfig.tab myconfig.json

pdp config output files are (except when converting to .tab format) written as JSON files. This is a machine-readable version of the configuration file, and all the other pdp tools accept JSON format configuration files, but not .tab files. As an example the fixed.conf configuration file produced in the example above is shown below. Here, the Pba_SCRI1043 line is unmodified, the Pbe_NCPPB_2795 is stitched (hence concat appears in the filename), and the Pwa_CFBP_3304 is both stitched and has ambiguity symbols replaced (so has concat_noambig in the filename).

[
    {
        "features": null,
        "filestem": "GCF_000011605.1",
        "filtered_seqfile": null,
        "groups": [
            "Pectobacterium",
            "atrosepticum_NCBI",
            "gv1"
        ],
        "name": "Pba_SCRI1043",
        "primers": null,
        "primersearch": null,
        "seqfile": "tests/walkthrough/sequences/GCF_000011605.1.fasta"
    },
    {
        "features": null,
        "filestem": "GCF_000749845.1_concat",
        "filtered_seqfile": null,
        "groups": [
            "Pectobacterium",
            "betavasculorum_NCBI",
            "gv7"
        ],
        "name": "Pbe_NCPPB_2795",
        "primers": null,
        "primersearch": null,
        "seqfile": "tests/walkthrough/sequences/GCF_000749845.1_concat.fas"
    },
    {
        "features": null,
        "filestem": "GCF_000291725.1_concat_noambig",
        "filtered_seqfile": null,
        "groups": [
            "Pectobacterium",
            "gv2",
            "wasabiae_NCBI"
        ],
        "name": "Pwa_CFBP_3304",
        "primers": null,
        "primersearch": null,
        "seqfile": "tests/walkthrough/sequences/GCF_000291725.1_concat_noambig.fas"
    }
]

As can be seen from this file, the modified sequences are written to the tests/walkthrough/sequences subdirectory, alongside the original:

$ tree tests/walkthrough/sequences/
tests/walkthrough/sequences/
├── GCF_000011605.1.fasta
├── GCF_000291725.1.fasta
├── GCF_000291725.1_concat.fas
├── GCF_000291725.1_concat_noambig.fas
├── GCF_000749845.1.fasta
└── GCF_000749845.1_concat.fas

2. pdp filter

As an optional step in the primer/marker design pipeline, input genomes may be filtered such that primer sets are only designed to restricted sections of the input. The pdp filter subcommand has options to automate this process for the following genomic regions:

predicted coding sequences
The pdp filter --prodigal option carries out a priori CDS prediction on each input genome using the Prodigal software tool. It generates a .gff file, and includes that in the output configuration file. This is suggested as an approach to maximise conserved genomic regions as sources for robust qPCR diagnostic primer design, as coding sequences are expected to be relatively conserved.
pdp filter --prodigal myconfig.json cds_filtered.json
predicted intergenic regions
Using the pdp filter --prodigaligr option will conduct CDS prediction with Prodigal, but generate a .gff file describing regions between predicted CDS, plus a short flanking regions that extends into neighbouring predicted CDS. This is included in the output configuration file. The size of the flanking region can be controlled with the --flanklen option. This is suggested as an approach to maximise amplicon sequence variation when designing metabarcoding markers, as intergenic regions are expected to be less well-conserved.
pdp filter --prodigaligr myconfig.json igr_filtered.json
consserved variable regions
With the pdp filter --alnvar <GROUP> option, pdp will pairwise align all genomes annotated in the configuration file with group <GROUP>, to identify regions conserved across all members of that group. If the --min_sim_error_rate <VALUE> option is used, then only conserved regions that have an error rate (SNP or indel) of at least <VALUE>``% (constrained in the range [0, 1) in *every* genome of ``<GROUP> will be retained for primer design. This is suggested as an approach to maximising amplicon sequence variatino for metabarcoding marker design, as a minimum level of source sequence variation is assured.
pdp filter --alnvar group01 --min_sim_error_rate 0.1 myconfig.json igr_filtered.json

Tip

The pdp filter subcommand can be used with options for multiprocessing/SGE-like parallelisation (see below)

3. pdp eprimer3

Warning

The EMBOSS ePrimer3 package uses the PRIMER3 primer design software, but will only work with an old version of that software: v1.1.4

The pdp eprimer3 command takes a .json format configuration file, and uses the EMBOSS ePrimer3 package to design primers to each input genome sequence, writing the output files to a location specified with the --outdir <OUTDIR> option. A new output .json configuration file is produced which associates primer information with the appropriate input sequence.

pdp eprimer3 --outdir primers/ myconfig.json genomes_with_primers.json

In order to use the “filtered” genomes specified in an input configuration file, the pdp eprimer3 subcommand must be run with the --filter option.

pdp eprimer3 --filter --outdir primers/ myconfig.json genomes_with_primers.json

Tip

The pdp eprimer3 subcommand can be used with options for multiprocessing/SGE-like parallelisation (see below)

4. pdp dedupe

The pdp eprimer3 primer design step treats each genome independently which, especially when several related genomes are used as input, can result in design of many identical primer sets. The pdp dedupe subcommand identifies these redundant primer sets and discards them from the design process, generating a new .json configuration file that points only to non-redundant primer sets. The location of deduplicated/non-redundant primer sets generated in this process is given with the --dedupedir <DIRNAME> option, and BLASTN+ output written to a location specified by the --outdir <OUTPUTDIR> option.

Tip

This step is optional, but it is typical that most primer sets in the primer design process are redundant. As some of the subsequent pipeline stages do not scale well with an increasing number of primer sets, deduplication is highly recommended.

pdp dedupe --dedupedir deduped/ myconfig.json deduped.json

5. pdp blastscreen

As a fast, preliminary screen of designed primers against a set of off-target sequences, the pdp blastscreen subcommand performs BLASTN+ searches of designed primers against a pre-prepared nucleotide sequence database specified by the --db <DBPATH> option, where <DBPATH> is the path to the BLAST+ database.

pdp blastscreen --db blastdb/offtargets --outdir blastn_outputmyconfig.json screened.json

Tip

The composition of the screening database should be appropriate to your analysis/design goals. For example, if you are interested in designing primers diagnostic to all species in a particular bacterial genus, then a database comprising available genomes from sister genera may be appropriate. Alternatively, a subsampling of complete genomes from the bacterial group containing your genus of interest may be useful. However the screening database is constructed, it should represent a good range of off-target sequences that could reasonably be detected as false positives, to eliminate non-specific primer sets.

Tip

This step is optional, but it is typical that many primer sets in the primer design process have off-target matches and are not useful as diagnostic tools. As some of the subsequent pipeline stages do not scale well with an increasing number of primer sets, screening against a relevant database is highly recommended.

Tip

The pdp blastscreen subcommand can be used with options for multiprocessing/SGE-like parallelisation (see below)

6. pdp primersearch

pdp primersearch carries out the most critical step in the design process: predicting which genomes produce amplicons for each of the designed primer sets. Each candidate primer set is tested in turn against all the input sequences to determine whether it has the potential to amplify that sequence. This is the most computationally-demanding step of the analysis.

The pdp primersearch command uses the EMBOSS tool primersearch to carry out in silico hybridisation of each of the candidate primer sets against each input (unfiltered) genome. The output directory into which primersearch result files are written can be specified with --outdir:

pdp primersearch --outdir primersearch_results myconfig.json primersearch.json

Tip

It is strongly recommended that primers are deduplicated, and an off-target pre-screen is performed using pdp blastscreen or pdp diamondscreen before carrying out this step.

Tip

The pdp primersearch subcommand can be used with options for multiprocessing/SGE-like parallelisation (see below)

7. pdp classify

Each primer set can be assessed for potential specificity using the pdp classify subcommand to determine whether it is predicted to amplify specifically genomes only from one of the groups specified in the configuration file.

pdp classify myconfig.json classified_primers/

No new configuration file is produced, but new .json and .ePrimer3 format files are written to the specified output directory for each of the groups specified in the configuration file. These contain accounts of the primer sets determined to be specific to that group.

$ tree tests/walkthrough/classify/
tests/walkthrough/classify/
├── Pectobacterium_primers.ePrimer3
├── Pectobacterium_primers.json
├── atrosepticum_NCBI_primers.ePrimer3
├── atrosepticum_NCBI_primers.json
├── betavasculorum_NCBI_primers.ePrimer3
├── betavasculorum_NCBI_primers.json
├── gv1_primers.ePrimer3
├── gv1_primers.json
├── gv2_primers.ePrimer3
├── gv2_primers.json
├── gv7_primers.ePrimer3
├── gv7_primers.json
├── results.json
├── summary.tab
├── wasabiae_NCBI_primers.ePrimer3
└── wasabiae_NCBI_primers.json

Two summary files are also written: results.json and summary.tab. The summary.tab file is a tab-separated plain text file that describes how many primer sets were predicted to be diagnostic for each input class, and describes a path to the .json file describing their results:

$ cat tests/walkthrough/classify/summary.tab
Group   NumPrimers      Primers
Pectobacterium  4       tests/walkthrough/classify/Pectobacterium_primers.json
atrosepticum_NCBI       1       tests/walkthrough/classify/atrosepticum_NCBI_primers.json
betavasculorum_NCBI     2       tests/walkthrough/classify/betavasculorum_NCBI_primers.json
gv1     1       tests/walkthrough/classify/gv1_primers.json
gv2     2       tests/walkthrough/classify/gv2_primers.json
gv7     2       tests/walkthrough/classify/gv7_primers.json
wasabiae_NCBI   2       tests/walkthrough/classify/wasabiae_NCBI_primers.json

8. pdp extract

The pdp extract subcommand extracts the amplicon sequences for each of the primers in a specified pdp classify output primer file (specific for a particular genome group), aligns those sequences using MAFFT and then produces summary statistics about the sequence diversity of the amplicons.

pdp extract requires the configuration file used for pdp classify, the path to the .json file describing the primers specific to a particular genome group, and the path to an output directory for result files. Given the output path <OUTDIR> and the group group01, the extracted sequences and summary files will be written to <OUTDIR>/group01.

pdp extract myconfig.json classified_primers/group01.json extracted/

The output directory will contain, for each primer set:

  • a FASTA format file describing all the amplicon sequences (ending in .fasta)
  • a MAFFT-aligned output file (ending in .aln) describing the aligned sequences
  • a summary tab-separated plain text file called distances_summary.tab

The distances_summary.tab file is a table with one row per primer set, (and one row for headers), describing:

  • primer set identifier
  • mean pairwise distance between aligned sequences
  • standard deviation of pairwise distance between aligned sequences
  • minimum pairwise distance between aligned sequences
  • maximum pairwise distance between aligned sequences
  • the count of unique amplicons
  • the number of non-unique amplicons
  • the Shannon Index of sequence diversity (larger is more diverse)
  • the Shannon Evenness of sequence diversity ([0, 1]: closer to 1 is more even)

Tip

The pdp extract subcommand can be used with options for multiprocessing/SGE-like parallelisation (see below)

Multiprocessing/SGE-like parallelisation

Attention

Several stages in the pdp pipeline (principally those that call third-party software tools) can take advantage of multicore systems or clusters using an SGE-like scheduler. The syntax for doing this is the same for each of the stages.

multiprocessing
By default, subcommands that can use parallelism will attempt to distribute jobs to local cores using Python’s built-in multiprocessing module. This can be explicitly enabled with the -s multiprocessing option, and the number of workers controlled with the -w <N> option, to limit the total number of workers to a maximum of <N>.
pdp eprimer3 --outdir primers -s multiprocessing -w 4 myconfig.json eprimer3.json
SGE-like schedulers
The -s SGE option can be provided to use an SGE-like scheduler (one you can invoke with qsub). To cause minimal problems with queues, individual jobs are batched into job arrays, with a default array size of 10000 jobs (this can be controlled with the --SGEgroupsize <N> option). If you need to pass further arguments to SGE, this can be done with the --SGEargs <ARGUMENTS> option.
pdp eprimer3 --outdir primers -s SGE --SGEgroupsize 5000 --SGEargs "-M me@domain.org -m bes" myconfig.json eprimer3.json