Preprocessing of Metagenomic Samples

Page last modified on: 2025-November-27

On this page

Introduction

Metagenomic samples are inherently complex because they contain mixtures of DNA sequences from multiple organisms, and sometimes even from various environmental sources, including contaminants genetic sequences from the host (humans, animals, or plants).

Preprocessing these samples, specifically removing the contaminants, is a critical step before conducting further analyses. The following sections outline the reasons for and steps in preprocessing your metagenomic samples prior to assembling them into contigs.

Improving Data Quality

Quality Filtering:

Next-generation sequencing (NGS) is a technology commonly used for metagenomic sequencing. This technology can generate sequencing reads of varying quality across multiple runs. Poor-quality reads, which may contain errors such as miscalled bases or ambiguous nucleotides, can lead to incorrect assemblies. Filtering out these reads helps ensure that downstream analyses use accurate sequences.
Adapter and Primer Removal:

During metagenomic library preparation, various adapters and primers (depending on the library preparation kit) are ligated to the DNA fragments. If these sequences are not removed, they can interfere with the assembly process or taxonomic identification.

Removal of contaminants
- Host and Environmental Contamination:
When gathering metagenomic samples from a host (human, animal, or plant), the sample can include host cells as well as microbial cells. Host DNA may be present during library preparation and subsequent sequencing, potentially interfering with further analyses. Removing these contaminants helps to focus the analysis on the microbial community of interest rather than on the host. - Laboratory contaminants:

In some cases, contaminants may arise from the laboratory where the library was prepared, introducing DNA sequences that were not originally present in the sampled community. As with host DNA, removing these contaminants helps to focus the analysis on the microbial community instead of the laboratory microbiome.

The steps mentioned above (data quality checks and contaminant removal) provide several benefits for the downstream analysis of samples:

Enhanced MAG recovery and Taxonomic Classification:

High-quality, filtered reads improve the performance of assembly and genome binning algorithms. Cleaner sequences increase the likelihood of correctly reconstructing genomes (Metagenome Assembled Genomes - MAGs) from fragmented reads. Similarly, removing low-quality reads or contaminated sequences reduces noise and helps prevent taxonomic misclassification and overestimation of abundance.
Reducing Computational Burden:

Due to their nature, metagenomic datasets can be enormous. Preprocessing steps like removing duplicates and filtering out non-informative sequences reduce the total number of sequence reads or fragments that need to be analyzed, thereby making computational analysis more efficient and less resource-intensive.
Minimizing Bias and Error:

During the amplification steps of library preparation, PCR duplicates or chimeric sequences can be generated. These artifacts can skew abundance estimates if not properly identified and removed. Some preprocessing workflows include error-correction procedures to statistically correct for systematic errors introduced during the sequencing process, further improving data reliability.

In essence, by preprocessing metagenomic sequences before analyses, researchers can eliminate low-quality reads, contaminants, and artifacts arising from sampling and library preparation. This results in more reliable assembly and genome binning, more accurate taxonomic classification, and more efficient use of computational resources.

This cleaning process is a standard practice in metagenomics and underscores the importance of data quality in complex biological analyses. Collectively, these steps ensure that the biological conclusions drawn from metagenomic data reflect the true underlying microbial community rather than artifacts of sample preparation or sequencing processes.