Deriving accurate microbiota profiles from human samples with low bacterial content through post-sequencing processing of Illumina MiSeq data

Jake Jervis-Brady, Lex E X Leong, Shashikanth Marri, Renee J Smith, Jocelyn M Choo, Heidi Smith-Vaughan, Elizabeth Nosworthy, Peter Morris, Stephen O'Leary, Geraint Rogers, Robyn Marsh

    Research output: Contribution to journalArticleResearchpeer-review

    6 Downloads (Pure)

    Abstract

    Background: The rapid expansion of 16S rRNA gene sequencing in challenging clinical contexts has resulted in a growing body of literature of variable quality. To a large extent, this is due to a failure to address spurious signal that is characteristic of samples with low levels of bacteria and high levels of non-bacterial DNA. We have developed a workflow based on the paired-end read Illumina MiSeq-based approach, which enables significant improvement in data quality, post-sequencing. We demonstrate the efficacy of this methodology through its application to paediatric upper-respiratory samples from several anatomical sites.

    Results: A workflow for processing sequence data was developed based on commonly available tools. Data generated from different sample types showed a marked variation in levels of non-bacterial signal and ‘contaminant’ bacterial reads. Significant differences in the ability of reference databases to accurately assign identity to operational taxonomic units (OTU) were observed. Three OTU-picking strategies were trialled as follows: de novo, open-reference and closed-reference, with open-reference performing substantially better. Relative abundance of OTUs identified as potential reagent contamination showed a strong inverse correlation with amplicon concentration allowing their objective removal. The removal of the spurious signal showed the greatest improvement in sample types typically containing low levels of bacteria and high levels of human DNA. A substantial impact of pre-filtering data and spurious signal removal was demonstrated by principal coordinate and co-occurrence analysis. For example, analysis of taxon co-occurrence in adenoid swab and middle ear fluid samples indicated that failure to remove the spurious signal resulted in the inclusion of six out of eleven bacterial genera that accounted for 80% of similarity between the sample types.

    Conclusions: The application of the presented workflow to a set of challenging clinical samples demonstrates its utility in removing the spurious signal from the dataset, allowing clinical insight to be derived from what would otherwise be highly misleading output. While other approaches could potentially achieve similar improvements, the methodology employed here represents an accessible means to exclude the signal from contamination and other artefacts.
    Original languageEnglish
    Pages (from-to)1-11
    Number of pages11
    JournalMicrobiome
    Volume3
    Issue number19
    DOIs
    Publication statusPublished - 2015

    Fingerprint

    Workflow
    Microbiota
    Bacteria
    Adenoids
    DNA
    Middle Ear
    rRNA Genes
    Artifacts
    Databases
    Pediatrics

    Cite this

    Jervis-Brady, Jake ; Leong, Lex E X ; Marri, Shashikanth ; Smith, Renee J ; Choo, Jocelyn M ; Smith-Vaughan, Heidi ; Nosworthy, Elizabeth ; Morris, Peter ; O'Leary, Stephen ; Rogers, Geraint ; Marsh, Robyn. / Deriving accurate microbiota profiles from human samples with low bacterial content through post-sequencing processing of Illumina MiSeq data. In: Microbiome. 2015 ; Vol. 3, No. 19. pp. 1-11.
    @article{295091378f0c42b69a8adbe1d49a3830,
    title = "Deriving accurate microbiota profiles from human samples with low bacterial content through post-sequencing processing of Illumina MiSeq data",
    abstract = "Background: The rapid expansion of 16S rRNA gene sequencing in challenging clinical contexts has resulted in a growing body of literature of variable quality. To a large extent, this is due to a failure to address spurious signal that is characteristic of samples with low levels of bacteria and high levels of non-bacterial DNA. We have developed a workflow based on the paired-end read Illumina MiSeq-based approach, which enables significant improvement in data quality, post-sequencing. We demonstrate the efficacy of this methodology through its application to paediatric upper-respiratory samples from several anatomical sites.Results: A workflow for processing sequence data was developed based on commonly available tools. Data generated from different sample types showed a marked variation in levels of non-bacterial signal and ‘contaminant’ bacterial reads. Significant differences in the ability of reference databases to accurately assign identity to operational taxonomic units (OTU) were observed. Three OTU-picking strategies were trialled as follows: de novo, open-reference and closed-reference, with open-reference performing substantially better. Relative abundance of OTUs identified as potential reagent contamination showed a strong inverse correlation with amplicon concentration allowing their objective removal. The removal of the spurious signal showed the greatest improvement in sample types typically containing low levels of bacteria and high levels of human DNA. A substantial impact of pre-filtering data and spurious signal removal was demonstrated by principal coordinate and co-occurrence analysis. For example, analysis of taxon co-occurrence in adenoid swab and middle ear fluid samples indicated that failure to remove the spurious signal resulted in the inclusion of six out of eleven bacterial genera that accounted for 80{\%} of similarity between the sample types.Conclusions: The application of the presented workflow to a set of challenging clinical samples demonstrates its utility in removing the spurious signal from the dataset, allowing clinical insight to be derived from what would otherwise be highly misleading output. While other approaches could potentially achieve similar improvements, the methodology employed here represents an accessible means to exclude the signal from contamination and other artefacts.",
    author = "Jake Jervis-Brady and Leong, {Lex E X} and Shashikanth Marri and Smith, {Renee J} and Choo, {Jocelyn M} and Heidi Smith-Vaughan and Elizabeth Nosworthy and Peter Morris and Stephen O'Leary and Geraint Rogers and Robyn Marsh",
    year = "2015",
    doi = "10.1186/s40168-015-0083-8",
    language = "English",
    volume = "3",
    pages = "1--11",
    journal = "Microbiome",
    issn = "2049-2618",
    publisher = "BioMed Central",
    number = "19",

    }

    Deriving accurate microbiota profiles from human samples with low bacterial content through post-sequencing processing of Illumina MiSeq data. / Jervis-Brady, Jake; Leong, Lex E X; Marri, Shashikanth; Smith, Renee J; Choo, Jocelyn M; Smith-Vaughan, Heidi; Nosworthy, Elizabeth; Morris, Peter; O'Leary, Stephen; Rogers, Geraint; Marsh, Robyn.

    In: Microbiome, Vol. 3, No. 19, 2015, p. 1-11.

    Research output: Contribution to journalArticleResearchpeer-review

    TY - JOUR

    T1 - Deriving accurate microbiota profiles from human samples with low bacterial content through post-sequencing processing of Illumina MiSeq data

    AU - Jervis-Brady, Jake

    AU - Leong, Lex E X

    AU - Marri, Shashikanth

    AU - Smith, Renee J

    AU - Choo, Jocelyn M

    AU - Smith-Vaughan, Heidi

    AU - Nosworthy, Elizabeth

    AU - Morris, Peter

    AU - O'Leary, Stephen

    AU - Rogers, Geraint

    AU - Marsh, Robyn

    PY - 2015

    Y1 - 2015

    N2 - Background: The rapid expansion of 16S rRNA gene sequencing in challenging clinical contexts has resulted in a growing body of literature of variable quality. To a large extent, this is due to a failure to address spurious signal that is characteristic of samples with low levels of bacteria and high levels of non-bacterial DNA. We have developed a workflow based on the paired-end read Illumina MiSeq-based approach, which enables significant improvement in data quality, post-sequencing. We demonstrate the efficacy of this methodology through its application to paediatric upper-respiratory samples from several anatomical sites.Results: A workflow for processing sequence data was developed based on commonly available tools. Data generated from different sample types showed a marked variation in levels of non-bacterial signal and ‘contaminant’ bacterial reads. Significant differences in the ability of reference databases to accurately assign identity to operational taxonomic units (OTU) were observed. Three OTU-picking strategies were trialled as follows: de novo, open-reference and closed-reference, with open-reference performing substantially better. Relative abundance of OTUs identified as potential reagent contamination showed a strong inverse correlation with amplicon concentration allowing their objective removal. The removal of the spurious signal showed the greatest improvement in sample types typically containing low levels of bacteria and high levels of human DNA. A substantial impact of pre-filtering data and spurious signal removal was demonstrated by principal coordinate and co-occurrence analysis. For example, analysis of taxon co-occurrence in adenoid swab and middle ear fluid samples indicated that failure to remove the spurious signal resulted in the inclusion of six out of eleven bacterial genera that accounted for 80% of similarity between the sample types.Conclusions: The application of the presented workflow to a set of challenging clinical samples demonstrates its utility in removing the spurious signal from the dataset, allowing clinical insight to be derived from what would otherwise be highly misleading output. While other approaches could potentially achieve similar improvements, the methodology employed here represents an accessible means to exclude the signal from contamination and other artefacts.

    AB - Background: The rapid expansion of 16S rRNA gene sequencing in challenging clinical contexts has resulted in a growing body of literature of variable quality. To a large extent, this is due to a failure to address spurious signal that is characteristic of samples with low levels of bacteria and high levels of non-bacterial DNA. We have developed a workflow based on the paired-end read Illumina MiSeq-based approach, which enables significant improvement in data quality, post-sequencing. We demonstrate the efficacy of this methodology through its application to paediatric upper-respiratory samples from several anatomical sites.Results: A workflow for processing sequence data was developed based on commonly available tools. Data generated from different sample types showed a marked variation in levels of non-bacterial signal and ‘contaminant’ bacterial reads. Significant differences in the ability of reference databases to accurately assign identity to operational taxonomic units (OTU) were observed. Three OTU-picking strategies were trialled as follows: de novo, open-reference and closed-reference, with open-reference performing substantially better. Relative abundance of OTUs identified as potential reagent contamination showed a strong inverse correlation with amplicon concentration allowing their objective removal. The removal of the spurious signal showed the greatest improvement in sample types typically containing low levels of bacteria and high levels of human DNA. A substantial impact of pre-filtering data and spurious signal removal was demonstrated by principal coordinate and co-occurrence analysis. For example, analysis of taxon co-occurrence in adenoid swab and middle ear fluid samples indicated that failure to remove the spurious signal resulted in the inclusion of six out of eleven bacterial genera that accounted for 80% of similarity between the sample types.Conclusions: The application of the presented workflow to a set of challenging clinical samples demonstrates its utility in removing the spurious signal from the dataset, allowing clinical insight to be derived from what would otherwise be highly misleading output. While other approaches could potentially achieve similar improvements, the methodology employed here represents an accessible means to exclude the signal from contamination and other artefacts.

    U2 - 10.1186/s40168-015-0083-8

    DO - 10.1186/s40168-015-0083-8

    M3 - Article

    VL - 3

    SP - 1

    EP - 11

    JO - Microbiome

    JF - Microbiome

    SN - 2049-2618

    IS - 19

    ER -