• Can I use QIIME or MOTHUR with RTL Genomics Data?

QIIME [1] and MOTHUR [2] are open-source bioinformatics pipelines for performing microbiome analysis from raw DNA sequencing data.  RTL Genomics provides clients that receive our analysis a copy of their data already quality checked and ready for use in both QIIME and MOTHUR.  You can find the sequence data for QIIME and MOTHUR in the FASTA zip archive discussed in the section “FASTA Archive File Descriptions” found on page 23.

You can find documentation for or download QIIME at http://qiime.org/.

You can find documentation for or download MOTHUR at http://www.mothur.org.

How do I use QIIME with RTL Genomics Data?

QIIME’s split_libraries.py workflow is used to demultiplex sequence data and format it to work with the downstream analysis tools.  This particular workflow requires three files to work, a FASTA formatted sequence file, a FASTA formatted quality file and a mapping file that tells QIIME what the sample’s names, barcodes and primers are.  We strongly recommend you use the following options when running split_libraries.py using our FASTA data:

split_libraries.py –m <Mapping TXT file> -f <Fasta FNA File> -q <Qual File> -o <Output Folder> -H 1000 –p

The –m, -f, -q and –o options are all required and tell the workflow where the mapping, FASTA sequence, FASTA quality and output files are respectively.  However we also recommend you add the following options:

·         –p

  • The –p options instructs QIIME to ignore the primer because otherwise it will require the primer be present in order for the sequence to be valid.
  • While the 454 and PGM platforms will sequence the primer, the Illumina platform does not sequence it.  It might be present in the raw data if we performed a double method PCR.
  • Our pipeline removes the primer from each sequence during our QC steps, so even if the primer exists on the raw data it will not exist in the data that has been prepared for QIIME.

·         –H 1000

  • The –H option sets the maximum number of homopolymers the sequence can contain and still be considered valid.
  • We recommend using this option with Illumina data but do not use it with 454 or PGM data.
  • 454 and PGM sequencers suffer from issues when homopolymers longer than 4-6 base pairs are encountered however the Illumina platform does not suffer from issues regarding homopolymers (at least not the point that the other platforms do).  As such you can set the –H option to a very high number to keep QIIME from throwing away valid sequences.  This is a bigger issue in fungal samples than bacterial.

  

How do I use MOTHUR with RTL Genomics Data?

MOTHUR’s Trim.seqs command is used to demultiplex sequence data and format it to work with the downstream analysis commands.  This command requires two files to work, a FASTA formatted sequence file and an oligos mapping file.  If you provide this command with the FASTA sequence file and oligos file provided by RTL Genomics, then you should be able to use this data in all downstream analysis.

What exactly does No Hit mean in my analysis?

The RTL Genomics analysis pipeline generates a number of files as discussed in the section “Analysis Archive File Descriptions” on page 25.  The taxonomic information given in the “FullTaxa” and “TrimmedTaxa” are often the assigned taxonomic information, but in some cases the taxonomic information may contain the “No Hit” keyword for each taxonomic level.  The “No Hit” keyword simply means that there exists no matches in our database that meet the minimum criteria to be considered likely.  As such we cannot make any confident calls regarding the taxonomic classification of the OTU – even at the Kingdom/Domain level.  There exists a number of reasons why this occur which we will describe below:

1.       The organism’s sequence is missing from our database

a.       New sequences and organisms are added to NCBI/EMBL/DDBJ on a daily basis and, while RTL Genomics makes every effort to keep our databases current, it may take a few months before new sequences are in our database. 
b.      Many sequences in NCBI/EMBL/DDBJ are too short or contain no taxonomic information and are excluded from being added to our database.

2.       The organism’s sequence data is not yet in NCBI/EMBL/DDBJ.

a.       NCBI/EMBL/DDBJ contain a vast amount of data, however they require researchers to have already sequenced an organism before they have sequence data to provide.  If no one has sequenced the organism and submitted it to those repositories, then we will not yet have the sequence.
b.      While this is not often the case, there exists the possibility that your sequence data contains an organism not yet known to science, i.e. a novel species

3.       Low quality sequence

a.       While RTL Genomics does perform quality and chimera checking on your data, these algorithms are not fool-proof and low quality or chimeric sequences may have managed to make it to the taxonomic analysis stage.  These sequences will often fail to identify as any organism due to their low quality or chimeric nature, causing them to be marked as “No Hit”.

What is the difference between Unknown and Unclassified in my analysis files?

The RTL Genomics analysis pipeline generates a number of files as discussed in the section “Analysis Archive File Descriptions” on page 25.  The taxonomic information given in the “FullTaxa” and “TrimmedTaxa” are often the assigned taxonomic information, but in some cases the taxonomic information may contain the “Unknown” and “Unclassified” keywords which can cause some confusion.  These keywords are described as follows:

Unknown

  • Our algorithm was unable to make a confident determination regarding the taxonomic classification at a certain level.
  • See section “USEARCH Global Search (Default)” on page 19 for more information for how we determine confidence.

Unclassified

  • The taxonomic information retrieved from NCBI contains missing information at this level.
  • For instance, if the best match in our database is classified in NCBI down to the Family level then our database will mark the Genus and Species as “Unclassified”.

How are confidence values determined?

The RTL Genomics analysis pipeline generates a number of files as discussed in the section “Analysis Archive File Descriptions” on page 25.  Once each OTU has been aligned to our database, our algorithm will select the top/best six matches for the OTU and attempt to assign a confidence value to each taxonomic level.  The top match is then compared against the other five to determine the number of matches at agree with the base match at each taxonomic level.  The number of agreements is then converted into a confidence value using the following equation:

 

 

 

For instance, if OTU ID #13 has 6 top matches with the following taxa:

1)      Bacteria ; Proteobacteria ; Alphaproteobacteria ; Rhizobiales ; Bradyrhizobiaceae ; Bradyrhizobium ; Bradyrhizobium sp

2)      Bacteria ; Proteobacteria ; Alphaproteobacteria ; Rhizobiales ; Bradyrhizobiaceae ; Bradyrhizobium ; Bradyrhizobium sp

3)      Bacteria ; Proteobacteria ; Alphaproteobacteria ; Rhizobiales ; Bradyrhizobiaceae ; Bradyrhizobium ; Bradyrhizobium sp

4)      Bacteria ; Proteobacteria ; Alphaproteobacteria ; Rhizobiales ; Bradyrhizobiaceae ; Bradyrhizobium ; Bradyrhizobium japonicum

5)      Bacteria ; Proteobacteria ; Alphaproteobacteria ; Rhizobiales ; Bradyrhizobiaceae ; Bradyrhizobium ; Bradyrhizobium japonicum

6)      Bacteria ; Proteobacteria ; Alphaproteobacteria ; Rhizobiales ; Bradyrhizobiaceae ; Rhodopseudomonas ; Rhodopseudomonas palustris

 

Then each taxonomic level would receive the following confidence:

 

Table1BioinformaticsFAQ.png

 

Using this example, we would assign a confidence of 1 (100%) to the kingdom, phylum, class, order and family taxa. We would then assign a confidence of .83 (83%) to the genus taxon and a confidence of .5 (50%) to the species.  These confidence values are then used when the “TrimmedTaxa” files are generated.

 

What is the difference between FullTaxa and TrimmedTaxa Files?

The RTL Genomics analysis pipeline generates a number of files as discussed in the section “Analysis Archive File Descriptions” on page 25.  In our analysis data we provide most of our analysis in duplicate files, one containing the analysis using the “FullTaxa” and the other containing the analysis using the “TrimmedTaxa”.  Please see “How are confidence values determined?” found on page 8 as this discussion will assume you have an idea of what confidence values are and how we assign them.

The FullTaxa files are generated under the assumption that the best match is the correct one.  As such each taxonomic level is assigned using that match without taking the confidence values into account.  We provide this file for two reason: 1) this method for assigning taxa is similar to the method we used many years ago and we continue to provide these files for legacy purposes and 2) this helps you see what the original best match was before confidence trimming occurred.  We believe these files are a powerful tool in allowing you to better get an idea of what our algorithm originally thought the data contained before confidence values were taken into account, which can help you get a better feel for the data.  However, we do advise that you do not use only the FullTaxa data to perform your analysis as the lack of confidence makes the data considerably less accurate.

The TrimmedTaxa files are generated using the FullTaxa data after the confidence values have been taken into account.  As such each taxonomic level is assigned only if the confidence value is greater than or equal to .51 (51%).  If a taxon falls below .51, it is replaced with the “Unknown” keyword.  Using the example data provided in the section “How are confidence values determined?” on page 8, the FullTaxa would read Bacteria ; Proteobacteria ; Alphaproteobacteria ; Rhizobiales ; Bradyrhizobiaceae ; Bradyrhizobium ; Bradyrhizobium sp and the TrimmedTaxa would read Bacteria ; Proteobacteria ; Alphaproteobacteria ; Rhizobiales ; Bradyrhizobiaceae ; Bradyrhizobium ; Unknown, where the species is now assigned “Unknown” due to the low confidence value.

Brief Synopsis

FullTaxa

o   Generated under the assumption that the top/best match is completely accurate.

o   Does not take confidence values into account.

o   Provided primarily for legacy purposes and not recommended to be used for detailed analysis.

TrimmedTaxa

o   Takes the confidence value into account at each taxonomic level.

o   Replaces low confidence taxa with the “Unknown” keyword.

Do my sequences contain the primer, barcodes or adapters?

Upon the completion of an order at RTL Genomics, clients will receive two zip archives containing their sequence data, the files are described in the section “File Descriptions and Formatting” starting on page 20.  The raw data archive contains your sequence data directly from the sequencer with no post-processing done on our end.  This data is packaged as one SFF or one pair of FASTQ files per sample.  The FASTA data archive contains your sequence data after we have performed denoising and some basic quality checking on the data.  This data is provided as a single FASTA formatted sequence and quality file that contains all of your sequences multiplexed together.

 

Please refer to the following graph in order to determine whether the primer, barcodes or adapters are on your sequences: