In microbial diversity or metagenomic analysis, there are several words that appear frequently, such as OTU, community structure, alpha diversity, and beta diversity. This article will explain these words.
- OTU classification
The full name of OTU is Operational Taxonomic Unit. It is actually a taxonomic unit that is artificially defined, that is, generally in the analysis of microbial diversity, the sequence is clustered with 97% similarity.
For microbial research, we often focus on the community structure level of the habitat (such as human intestinal samples, the intestinal environment can be regarded as a habitat; another example is the soil sampling in a certain area, and the regional soil can be regarded as a habitat). The composition of communities in similar habitats is very similar.
Therefore, the method of diversity research is: first, cluster the valid tags of all samples (tags here refer to the sequence of double-ended reads) with 97% similarity to classify OTU. For example, 90,000 tags may cluster to 2000 OTU units. Then select the longest sequence or the largest abundance from each OTU classification unit as the representative sequence. Align and annotate these 2000 representative sequences with the database.
Analysis based on OTU level can be shown:
1). OTU-based venn map and petal map: It can count the unique OTUs and shared OTUs between different samples or groups.
2). Construction of phylogenetic tree based on OTU representative sequence: OTUs with higher abundance can be selected, and phylogenetic trees of these OTUs can be constructed, and the results of Heatmap can be displayed. The relative high and low abundance OTUs can be seen clearly in different samples or groups.
3). OTU-based heat map: It can visually display the abundance difference of OTU in different samples or groups.
- Community structure
The microbial environment in the habitat can be regarded as a large ecological biological community, and these communities are composed of various dominant bacterial genera and low-abundance bacterial genera. The abundance of microorganisms in different habitats are different, and the composition of different abundance bacteria can be understood as the community structure of habitat.
In general, community structure analysis can be started from several angles:
1). Community structure distribution histogram: It can show the composition of different samples or groupings of the overall community, and the difference between the composition.
2). Heatmap diagram of community structure distribution: It can visually display the abundance of species at the level of phyla, family and genus.
3). Ternary phase diagram of community structure distribution: Ternary Plot uses an equilateral triangle to describe the ratio relationship of the different attributes of three variables. In the analysis, the species or functions of three or three sets of samples can be analyzed according to species classification or function information. The composition is compared and analyzed, and the proportion and relationship of different species or functions in the sample can be visually displayed through the triangle diagram. The ternary phase diagram mainly focuses on showing the distribution of species in three different samples or groups.
The α-, β-, and γ-diversity of community ecology
- Alpha diversity
Both alpha and beta diversity are derived from ecology and can be understood as two different spatial scales. Alpha diversity generally refers to the degree of diversity of species in a habitat, that is, it does not focus on comparison, but only evaluates the degree of diversity in habitat, while beta diversity focuses on comparing the diversity of different habitats.
There are many evaluation indexes for alpha diversity: observed species is the number of OTUs observed, shannon Shannon index, simpson index, chao index, ACE index, etc…
Different indexes have different focuses and different calculation formulas. In general: Observed species is the number of classified OTUs; Shannon index can simultaneously reflect the species diversity and evenness of the community; Chao index algorithm is estimated by calculating the number of OTUs that are only detected once and twice in the community and the number of species actually present in the community. Therefore, the index is relatively sensitive to trace bacteria (low-abundance species).
From which angles can alpha diversity analysis be demonstrated?
1). The value of each index can be calculated, for example, it looks like this:
With such an index table, we can assess the diversity of the sample. Of course, if you need to compare the degree of diversity or uniformity of different samples from the index value, you can first perform a random sampling operation on the sequences in each sample, and compare the diversity index between samples with the same amount of sequencing.
2). The saturation of sequencing can be evaluated by the diversity index.
3). At the same time, you can compare whether the diversity index of different treatment groups is significantly different between the two groups.
- Beta diversity analysis
Beta diversity focuses on the comparison of community composition in different habitats. Analysis methods commonly used to show beta diversity are:
- PCA principal component analysis. Principal component analysis is a model based on linear analysis and does not rely on the distance matrix algorithm.
- PCoA analysis and NMDS analysis based on distance matrix algorithm. Unlike PCA principal component analysis, PCoA and NMDS can use different matrix algorithms (Unweighted Unifrac, Weighted Unifrac, Bray Curtis, Binary Jaccard, Euclidean, etc.) to compare the similarity between samples.
- RDA/CCA analysis. Namely Redundancy analysis (RDA), Canonical analysis. That is, the variables of environmental factors are introduced, and the data of the flora structure is matched with a given factor, and the relationship between the sample, the species, and the environment, or the relationship between the three, is explored through the replacement test.
But how should we choose between so many algorithms for beta diversity comparison? Microbial diversity research generally recommends combining experimental design, considering multiple matrix algorithms, and selecting the most suitable one. For example, the Unifrac distance weighted and non-weighted methods, the non-weighted method focuses on only considering the presence or absence of species, that is, the species difference of the community; while the weighting algorithm not only considers the presence or absence of species, but also considers the level of species abundance. Some processing factors mainly cause changes in the abundance of microbial species. In this case, the weighting algorithm may be more suitable.
- Statistical analysis (difference statistics or classification)
Multivariate statistical analysis of microorganisms, that is, find different species between groups according to different groups, or find biomarkers of different treatment groups. Statistical analysis is based on species abundance (ANOVA, G_test, Metastat, etc.), and there are algorithms based on distance matrix (Adonis, ANOSIM, etc.). At the same time, it can also be divided into statistical methods of parametric testing and statistical algorithms of non-parametric testing.
There are also some statistics for classification evaluation, such as ROC curve analysis. And some other statistical methods: random forest distribution, LEfSe analysis, etc.
CD Genomics’ integrated and novel microbial diversity detection platform delivers high-throughput, and accurate 16S/18S/ITS analysis via a combination of advanced sequencing technology and quantitative PCR (q-PCR). This platform allows both short-read 16S/18S/ITS sequencing by next-generation sequencing technology and full-length 16S/18S/ITS sequencing by PacBio SMRT sequencing or Nanopore sequencing. The results are validated by Sanger sequencing or q-PCR with genus- and species-specific primers. Quantitative PCR can also be used to quantify microbial species.