This is an open-access article d

This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Citation: Luo C, Tsementzi D, Kyrpides N, Read T, Konstantinidis KT (2012) Direct Comparisons of Illumina vs. Roche 454 Sequencing Technologies on the Same Microbial Community DNA Sample. 1C). To provide new insights into these issues, we evaluated the two most frequently used platforms for microbial community metagenomic analysis, the Roche 454 FLX Titanium and the Illumina GA II, by comparing and contrasting reads and assemblies obtained from the same community DNA sample. We compared the reads from the Lanier.Illumina dataset against the Lanier.454 dataset to identify the fraction of reads shared between the two datasets. Hence, the majority of non-homopolymer-associated errors remain challenging to model and thus, to correct. For example, Roche 454 sequencing may be advantageous for resolving sequences with repetitive structures or palindromes or for metagenomic analyses based on unassembled reads, given the substantially longer read length (Fig. Although the use of the TIGR reference assembly resulted in a slightly higher number of sequence errors for both Illumina and Roche 454 data, Illumina consistently showed a smaller number of sequencing errors and the relative error rate between the two platforms was similar to that based on the JGI genome data alone, independent of the reference genome used (Fig. e30087. Second, we directly assessed homopolymer error rate against reference genomes from GenBank that represented close relatives (average amino acid identity >70%) of the microorganisms sampled in the Lanier metagenome. The sponsors of this research had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Between 10 and 15 replicate datasets for each genome and each sequencing platform were analyzed; the exact number depended on the amount of total data available for each genome. We assessed homopolymer error rate in metagenomic data using two different strategies. Reciprocal best matches (RBMs), when overlapping by at least 500 bp and showing higher than 95% nucleotide identity, were identified and re-aligned using ClustalW2 [31]. Gene sequences from assembled contigs were extracted and ClustalW2 [31] was used to align the sequences against their orthologs from the reference assembly. These findings suggest that both NGS technologies are reliable for quantitatively assessing genetic diversity within natural communities. The results from metagenomic samples were further validated against DNA samples of eighteen isolate genomes, which showed a range of genome sizes and G+C% content. 2B). The same cut-off was used to map raw reads on contigs. (B) Error rate (as a percentage of the total genes evaluated, y-axis) increases as homopolymer length increases (x-axis). It should be noted, however, that most of the previous error estimates and sequencing biases have been determined based on relatively simple DNA samples (e.g., a single viral genome) and thus, their relevance for complex community DNA samples remains to be evaluated. https://doi.org/10.1371/journal.pone.0030087.g003. 1A). Yes PCC6803 (Cyanobacteria). Nine Illumina and eight Roche 454 assemblies from independent replicate datasets of the Fibrobacter succinogenes subsp. The resulting datasets were 502 Mbp (Lanier.454) and 2,460 Mbp (Lanier.Illumina) in size; all our bioinformatic analyses and comparisons were based on these trimmed datasets. 2B, inset) and this was primarily attributable to a higher sequencing error rate associated with A- and T-rich homopolymers (Fig. 2A, inset). LuoC, No, Is the Subject Area "Genome sequencing" applicable to this article? Assemblies were obtained for each possible combination and the base call error and gap opening error of the resulting assemblies were determined as described for individual reads above. No additional external funding was received for this study. No, Is the Subject Area "Gene sequencing" applicable to this article? This resulted in a set of 500 bp long sequence fragments, which were subsequently mapped onto the reference assembly using Blastn.

We also estimated the abundance of each contig shared between the two assemblies by counting the number of reads composing the contig, which can be taken as a proxy of the abundance of the corresponding DNA sequence in the sample [19]. School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, Georgia, United States of America, Affiliation Illumina does not appear to share these limitations but it has its own systematic base calling biases [13]. For instance, it has been established that Roche 454 has a high error rate in homopolymer regions (i.e., three or more consecutive identical DNA bases) caused by accumulated light intensity variance [5], [11] and up to 15% of the resulting sequences are often products of artificial (in vitro) amplification [12]. Illumina GA II sequencing quality is evaluated in panels E and F, which show: (E) base call error rate of individual reads plotted against the G+C% of the genome; and (F) gap opening error rate of individual reads plotted against the G+C% of the genome. Correction: Direct Comparisons of Illumina vs. Roche 454 Sequencing Technologies on the Same Microbial Community DNA Sample. The two platforms agreed on over 90% of the assembled contigs and 89% of the unassembled reads as well as on the estimated gene and genome abundance in the sample (Fig. These errors were not observed in the Illumina data, presumably due to both the high sequence coverage that greatly facilitated the resolution of homopolymer ambiguities and the less pronounced sequencing biases of Illumina (Fig. (B) Graph shows the comparison of the contig length of three assemblies plotted against the N statistic of the assembly [for instance, N40 (x-axis) is equal to about 1 Kbp (y-axis), which means that (10040=60) % of the entire assembly is contained in contigs no shorter than 1 Kbp]. 2B). 3). All 2D plots (panels B, D, E, and F) represent the arithmetic average of the medians of each dataset for the same genome; Illumina medians were identical among replicate datasets; therefore, only one value is shown in panel E. The results show that Illumina sequence quality was affected less than that of Roche 454 by the G+C% content of the sequenced DNA (note the lower r-squared value and the slope in E). broad scope, and wide readership a perfect fit for your research every time. Even though read lengths increase as the technologies advance, they are still far shorter than the desirable length (e.g., the average bacterial gene length is 950 bp) or the read length obtained from traditional Sanger sequencing (1000 bp). PLOS ONE 7(3): 10.1371/annotation/64ba358f-a483-46c2-b224-eaa5b9a33939. Shared reads were defined as those that mapped on reads of the other dataset using Bowtie with default settings [25]. A similar strategy based on reference genome sequences was used to identify and count non-homopolymer-related, single-base errors. 2).

The average G+C% content of the metagenome was 47.4%; thus, our results are not simply attributable to higher abundance of A's and T's in the metagenome. Venn diagram showing the extent of overlapping and platform-specific sequences of assembled contigs longer than 500 bp. Affiliation Graphs show the calculated base call error rate (A) and gap open error rate (B) for each comparison (figure key). The quality of the resulting contigs was examined in terms of base call error (A) and gap opening error (B), which revealed that the combination of the parameters of the assembly did not have a dramatic effect on the quality of the contigs (see projected contours on x-z and y-z space). These results revealed that, in general, the two platforms sampled the same fraction of the total diversity in the sample. For instance, derived assemblies overlapped in 90% of their total sequences and in situ abundances of genes and genotypes (estimated based on sequence coverage) correlated highly between the two platforms (R2>0.9). Although low coverage contigs (e.g., 1 to 5) are likely to contain a higher fraction of chimeric sequences than 0.2% according to our previous study [18], such contigs were rare in the results reported here, which included only contigs longer than 500 bp with average coverage 10 or higher (only about 3% of the contigs showed less than 5 coverage; Fig. More importantly, most of our findings from metagenomic data were reproducible in data from isolate genomes, which were sequenced by both sequencing platforms and showed a range of G+C% content (Figs. Although Illumina generally provided equivalent assemblies with Roche 454, there may be cases where Illumina might be inferior to Roche 454. One aliquot was sequenced with the Roche 454 FLX Titanium sequencer (average read length, 450 bp) and the other one with the llumina GA II (100100 bp pair-ended reads) at Emory University Genomics Facility. 3), low G+C% genomes sequenced with this platform may have 20% or more genes with frameshift errors whereas the Illumina platform is not affected as much by the G+C% of the sequenced DNA (Fig. The resulting contigs were merged into one dataset, and Newbler was used to assemble this dataset into longer contigs, using the same parameters as in the assembly of Lanier.454 data. We applied widely used protocols to assemble both sets of reads (see Materials and Methods for details), which substantially collapsed the Lanier.Illumina dataset into 57 Mbp of total unique sequences and the Lanier.454 dataset into 46 Mbp (Fig. Note that Illumina assemblies recovered a significantly larger fraction of the reference genome than Roche 454 assemblies (two tailed Whitney-Mann U test p-value=0.014), which is consistent with the results from the metagenomes (Fig. For convenience, we called the two sequence data sets Lanier.454 and Lanier.Illumina, respectively. here. https://doi.org/10.1371/journal.pone.0030087.g004. The slightly higher single-base accuracy of Roche 454 metagenomic reads relative to that of the isolate genome reads is presumably due to the use of the latest, optimized Roche 454 protocol in the former and slight differences in the performance of the sequencers used. 1B. For each genome, we varied the amount of sequences input to the assembly and the primary parameters of assembly (K-mer for SOAPdenovo and Velvet, and minimal alignment length for Newbler).

We used the isolate genome data to evaluate the effect of the parameters of the assembly on the quality of the contigs as follows: a series of assemblies were obtained for genomes of low (Arcobacter nitrofigilis, 28%), medium (Fibrobacter succinogenes, 48%), and high (Cellulomonas flavigena, 74%) G+C% content. Abundance was determined based on the number and coverage of the contigs, as described elsewhere [17]. Samples were collected from Lake Lanier, Atlanta, GA, below the Browns Bridge in August 2009 and community DNA was extracted as described previously [17]. Funding: This research was supported, in part, by the U.S. Department of Energy (award DE-SC0004601). We aligned the assembled contigs from 9 Illumina and 8 Roche 454 assemblies from JGI data for the same genome against the TIGR reference assembly and calculated base call error rate and gap open error rate as described above for JGI genomes. https://doi.org/10.1371/journal.pone.0030087.g005, https://doi.org/10.1371/journal.pone.0030087.t001. Roche 454 sequencing quality is evaluated in panels A through D, which show: (A) base call error rate of individual reads (x-axis) for each genome evaluated (y-axis); (B) base call error rate (y-axis) plotted against the G+C% of the genome; (C) gap opening error rate of individual reads (x-axis) for each genome evaluated (y-axis); (D) gap opening error rate (y-axis) plotted against the G+C% of the genome. We extracted the predicted gene sequences from the reads and the corresponding amino acid sequences were searched against the genes of the reference assembly of the same dataset using BLAT [28]. For Lanier.Illumina, the SOAPdenovo [23] and Velvet [24] de novo assemblers were used to pre-assemble short reads into contigs using different K-mers. Given that the single-base error of individual reads was comparable between Lanier.454 and Lanier.Illumina (0.5% per base), our results reveal that the lower single-base error rate of Lanier.Illumina contigs (3% vs. 4.5% for Roche 454, counting homopolymer- and non-homopolymer-associated errors) is primarily due to the higher coverage obtained. For instance, we noted that homopolymer-associated, single-base errors affected 1% of the protein sequences recovered in Illumina contigs of 10 coverage and 50% G+C; this frequency increased to 3% when non-homopolymer errors were also considered. We also quantitatively assessed the errors in the consensus sequences of the derived assemblies. https://doi.org/10.1371/annotation/64ba358f-a483-46c2-b224-eaa5b9a33939 (A) Venn diagram showing the extent of overlapping and platform-specific raw reads between the Lanier.454 and Lanier.Illumina datasets (without assembly).

Six genomes that represented abundant genera in the lake metagenome were identified this way. Next-generation sequencing (NGS) is commonly used in metagenomic studies of complex microbial communities but whether or not different NGS platforms recover the same diversity from a sample and their assembled sequences are of comparable quality remain unclear. 1B). No, PLOS is a nonprofit 501(c)(3) corporation, #C2354500, based in San Francisco, California, US, Corrections, Expressions of Concern, and Retractions, https://doi.org/10.1371/journal.pone.0030087, https://doi.org/10.1371/annotation/64ba358f-a483-46c2-b224-eaa5b9a33939. Velvet was used to assemble each of these Illumina datasets with K-mer set at 31. Yes 4, which is based on isolate genome data). https://doi.org/10.1371/journal.pone.0030087.g007. It is possible that the remaining 10% of the contig sequences might have been different because of imperfect or uneven splitting of the original DNA sample into the two aliquots sequenced and the fact that the diversity in the sample was not saturated by sequencing (estimates based on rarefaction curves using raw reads indicated that we sampled about 8085% of the total diversity in the Illumina data). Assembly parameters (primary and secondary x-axes) were evaluated for low (Arcobacter nitrofigilis, 28%; left), medium (Fibrobacter succinogenes, 48%; middle), and high (Cellulomonas flavigena, 74%; right) G+C% genomes. For this, Blastn [30] was employed to search all gene sequences annotated in the Lanier.454 assembly against those in the Lanier.Illumina assembly. Our work also provides a methodology for evaluating and comparing metagenomic data from NGS platforms. Panels A and C represent the variation observed in reads from different (replicate) datasets of the same genome; red bars represent the median, the upper and lower box boundaries represent the upper and lower quartiles, and the upper and lower whiskers represent the largest and smallest observations. The amount of Illumina and Roche 454 input sequence data was chosen so that the ratio of the two was similar to the ratio in the metagenomic analysis (2.5 Gb Illumina reads versus 500 Mbp Roche 454 reads, or 51). To validate our findings from metagenomics, we performed similar comparative analyses based on eighteen isolate genomes that were sequenced by both Illumina and Roche 454 and showed a range of genome sizes and G+C% content (Table 1). Lanier.454 and Lanier.Illumina reads were trimmed at both the 5 and 3 ends using a Phred quality score cutoff of 20. We found a strong linear correlation (r2>0.99) between the Roche 454 and Illumina data with this respect (Fig. Total unique sequences in this case included only contigs longer than 500 bp because shorter contigs were usually characterized by low coverage and thus, were error-prone (Fig. We found that about 90% of the Roche 454 unique contig sequences overlapped with Illumina contig sequences (Fig. DT acknowledges the support of the Onassis Scholarship Foundation.

In the former approach, we examined protein-coding sequences recovered in contigs longer than 500 bp that were shared between the Lanier.454 and Lanier.Illumina assemblies. Similarly, the reference assembly sequence was cut into 500 bp long fragments and mapped onto assembled contigs longer than 500 bp; the unmapped regions of these contigs were identified as chimeric sequences and their total length (as a fraction of the total length of the contigs) represented the degree of chimerism for each dataset. Finally, we calculated the average single-base call error rate and gap opening error rate of individual reads of each dataset as follows: raw reads were trimmed using the same standards as described above and subsequently mapped onto the corresponding reference assembly from RefSeq. Finally, in all genomes analyzed, Illumina assemblies consistently recovered a larger percentage of the reference genome than Roche 454 assemblies (two tailed Whitney-Mann U test p-value=0.014; Fig. For more information about PLOS Subject Areas, click We also found that the systematic single-base errors associated with GGC-motifs in Illumina data reported recently [16] represented only a minor fraction of the non-homopolymer-associated errors (0.015% of the total bases analyzed, consistent with the frequency reported in the original study).

PLOS ONE promises fair, rigorous peer review, No, Is the Subject Area "Sequence alignment" applicable to this article? https://doi.org/10.1371/journal.pone.0030087.g001. As noted above, similar gap opening errors were observed for the metagenomic reads from the two platforms and single-base accuracy was comparable between the two platforms (99.34% vs. 99.46% for the Lanier.454 and Lanier.Illumina metagenomic reads, respectively). (A) A's and T's contribute significantly more homopolymer errors than C's and G's. 2). For comparing gene calling accuracy on unassembled reads, we employed FragGeneScan [27] to predict genes on Lanier.454 and Lanier.Illumina reads using the 454 1% error rate model and the Illumina 0.5% error model, respectively. No, Is the Subject Area "Next-generation sequencing" applicable to this article? Lanier.Illumina contigs were generally longer than Lanier.Roche 454 contigs, i.e., the assembly N50 (the contig length for which 50% of the entire assembly is contained in contigs no shorter than this length) was 1.6 Kbp versus 1.2 Kbp, respectively. Thus, Roche 454 is advantageous with respect to gene calling when working with unassembled reads.

Sitemap 36

This is an open-access article d

This is an open-access article ddunelm made to measure blinds

Contact