Prokaryotes viruses

2022.01.19 01:59

Several computational approaches have been introduced to predict hosts of viruses based on viral genomic sequences. They can be largely classified into two groups according to their dependence on alignment: alignment-based methods and alignment-free methods. Such sequences may come from spacer sequences used in CRISPR systems, integration sites used by proviruses or horizontal gene transfer. BLAST is most widely used to predict viral hosts with relatively high accuracy [ 16 , 17 ] based on the similarity between virus and host genomes.

However, for newly identified viruses divergent from the known ones, the applicability of the BLAST-based method could be limited. Besides, since the spacers are only the infection history of an individual prokaryotic cell, a precise phage-bacteria sequence match would require the unlikely preservation of the CRISPR spacers.

The alignment-free methods predict the host of viruses based on co-occurring k- mers to other phages with known hosts [ 18 ], or sequence composition similarity between viruses and their hosts.

WIsH predicts viral hosts by training a homogeneous Markov model for each potential host genome, then calculating the likelihood of a contig under each of the trained Markov models, and finally predicting the host whose model yields the highest likelihood.

In this study, a Gaussian model GM for predicting hosts of prokaryotic viruses was developed based on the differences of k- mer frequencies between viral and host genomic sequences.

A standalone tool and a web-based tool was implemented to run the GM. The GM not only outperformed previous alignment-free methods, but also shaped a complement to the alignment-based methods in predicting hosts for prokaryotic viruses.

The GM should facilitate the prediction of virus host in metagenomic studies. The prokaryotic viruses were referred to as viruses hereinafter unless otherwise specified. Two datasets were used to build and test computational models for predicting virus hosts. One pair of virus-host interaction Tetraselmis viridis virus - Tetraselmis sp. The updated VHM dataset contains a total of pairs of virus-host interactions, virus genomes, and 31, prokaryotic genomes.

It was available to the public on GitHub [ 22 ]. The second dataset was the test dataset to assess the computational models of predicting virus hosts.

Contrary to the VHM dataset that contains virus-host interactions compiled from the NCBI RefSeq database [ 23 ] before May 5th, , the test dataset contains those submitted between May 6th, , and February 26th, The virus-host interactions which have both the same viral species and the same host genus with those in the VHM dataset were removed. The test dataset contains a total of pairs of virus-host interactions, virus genomes, and 60, prokaryotic genomes obtained from the NCBI genome database [ 24 ] on February 21th, The taxonomy distribution of both the virus and host in the test dataset and the VHM dataset was analyzed and shown in Additional file 1 : Figure S1.

When compared to the VHM dataset, the test dataset includes , 97, and 2 new viral species, genus, and families, respectively, and 37, 11, and 8 new host species, genus, and families, respectively. The Gaussian mixture model is a probabilistic model which uses a finite number of Gaussian distributions to fit data points and get the probability density of them [ 25 ].

Here, the Gaussian mixture model with only one component was found to perform best in predicting virus hosts Additional file 1 : Figure S2 ; therefore, the Gaussian mixture model was simplified as Gaussian model GM. The GM takes the differences of k- mer frequencies between virus and prokaryotic genomic sequences as features, and outputs a score the logarithm of the probability of being viral host for the prokaryote.

The k- mers of 4 nucleotides were selected Additional file 1 : Figure S2 , which resulted in features. The GM was built using the function of GaussianMixture in scikit-learn [ 25 , 26 ]. For each virus, the GM calculated a score the logarithm of the probability of being viral host for all prokaryotic genomes available in the dataset. For example, in the test dataset, each of the 60, prokaryotic genomes would be assigned a score the logarithm of the probability of being viral host by the GM.

The prokaryotic species with the highest score was considered as the predicted host of the virus. The predicted host was compared to the actual host at different taxonomic levels. If the predicted host belonged to the same taxonomic unit such as genus with the actual host, the prediction was considered as correct at the level.

The accuracy of virus host prediction at a certain taxonomic level was defined as the ratio of correctly predicted host at this taxonomic level. VHM and WIsH were the best alignment-free methods for predicting phage hosts according to previous studies [ 16 , 19 ].

For comparison, they were tested with default parameters on the test dataset mentioned above. They were computed with the codes available at GitHub [ 27 , 28 ]. Previous studies by Edwards et al.

We compared these methods with the GM in predicting virus hosts on the test dataset. To predict the virus host based on BLAST, the genome sequence of each virus was queried against the prokaryotic genomes in the test dataset using blastn version 2.

Then, the genome sequence of each virus was queried against the prokaryotic CRISPR spacer sequences using blastn version 2. The hits, i. The prokaryotic species with perfect hits to the virus genome was considered as the potential host of the virus. Metagenomic assembly often yields partial genomes, so the prediction of the virus host based on contigs of varying lengths was important.

Viruses and their hosts often share similar oligonucleotide frequency patterns in their genomes, yet the prediction of virus-host interactions based on the similarity pattern remains challenging. To evaluate the ability of the GM in predicting virus hosts, a strict testing strategy was adapted Fig.

Firstly, a feature vector characterizing the differences of k- mer frequencies between viral and host genomic sequences was calculated for each pair of virus-host interaction within the VHM dataset.

The K -means algorithm was then used to separate the virus-host interactions in the VHM dataset into ten clusters based on the feature vectors. Finally, ten-fold cross-validations were conducted as follows: nine clusters of virus-host interactions were used to train the GM, while the outcome GM model was then used to predict the virus-host interactions in the remaining cluster. For each virus, scores were assigned to all the prokaryotic host species in the VHM dataset, and the prokaryotic species with the highest score was predicted to be the host of the virus.

The above process was repeated for each cluster. The overall prediction accuracy of the GM was calculated as the ratio of the correctly predicted viruses among all viruses in the dataset. The testing strategy mentioned above was also used to determine parameter values for the GM. Two important parameters for the GM were the length of k- mers and the number of components i. The virus host prediction accuracy of the GM increased as the length of k- mers increased from 1 to 5, and then decreased with k- mers of six nucleotides Additional file 1 : Figure S2A.

The k- mers with four nucleotides which have a total of kinds of k- mers were selected to balance the model complexity and prediction accuracy since the number of samples used in training the GM is only When selecting the number of components used in the GM, interestingly, we found the GM with one component outperformed that with multi-components Additional file 1 : Figure S2B. Therefore, the GM with one component and with k- mers of four nucleotides was used in the further analysis.

The optimized GM had a virus host prediction accuracy of 0. The GM was also compared to other common machine-learning algorithms in predicting virus hosts, including the random forest, logistic regression, naive Bayesian, decision tree, k -nearest neighbor, and multi-layer perceptron algorithms.

The GM was found to outperform much than these machine-learning algorithms in the ten-fold cross-validations on K -means clustering of the VHM dataset Additional file 1 : Figure S3. The predictive accuracy of the GM increased with the gradual elevation of the taxonomic level from genus to phylum Fig. Notably, the GM achieved accuracies of 0. Our GM achieved much higher accuracies than these two methods at all taxonomic levels. For example, the prediction accuracies of GM at the genus level was 0.

The host prediction accuracy a and the recall rate b of the GM model and their comparisons to other computational methods on the test dataset. The shared viruses and hosts in both the test dataset and the training dataset VHM dataset may result in over-estimate of the performance of the GM in the test dataset. Therefore, when predicting a target virus-host interaction in the test dataset, the GM was re-built based on the training dataset from which the virus-host interactions that shared the same viral and host genus with the target virus-host interactions were removed.

This resulted in a compromised performance of the GM on the test dataset see red bars in Fig. For example, the predictive accuracies of the GM were 0. The similarity between the virus and host genomic sequences often indicates virus-host relations. These two methods were tested on the test dataset and were compared to the GM Fig.

The CRISPR-spacer-based method predicted virus hosts with the highest accuracies at all taxonomic levels, ranging from 0. We further investigated the performance of the alignment-free methods in predicting hosts for viruses which cannot be predicted by the alignment-based methods on the test dataset.

For example, the GM had a prediction accuracy of 0. When 30 predicted hosts were considered, the prediction accuracy at the genus level improved significantly using this consensus approach for GM and VHM, achieving an accuracy of 0.

Similar variation patterns were observed at other taxonomic levels for all three alignment-free methods Additional file 1 : Table S1. The host prediction accuracies of the GM was obtained from the ten-fold cross-validations on the K -means clustering of the VHM dataset. Only the prediction accuracies at the genus level were shown in the figure for all methods. The host prediction accuracies at higher taxonomic levels are shown in Additional file 1 : Table S1 and S2.

Host prediction accuracy of GM solid line and WIsH dashed line at all taxonomic levels based on viral contigs of varying lengths. The host prediction accuracies of the GM were obtained in the ten-fold cross-validations on the K -means clustering of the VHM dataset. We also tested applying a score threshold requirement to making host predictions as Ahlgren et al. Predictions were only made when the score was larger than a given threshold.

As is shown in Fig. Further analysis of the prediction accuracies at higher taxonomic levels versus the recall rate found that the GM outperformed VHM and WIsH much at both the family and order level when the recall rate ranged from 1 to 0. Metagenomic assembly often yields partial genomes. By far, WIsH was reported as the most accurate method for predicting phage hosts based on short contigs [ 19 ].

PHP is freely available either as a standalone version [ 22 ] or in the form of a web server [ 31 ]. While the standalone version of PHP is suitable for host prediction of a large number of viruses, the PHP web server is suitable for host prediction of fewer than viruses. The web server of PHP is intuitive and user-friendly. It takes one or multiple virus genomic sequences as input.

After submission, a waiting page appears and would last from several minutes to several hours depending on the number and size of viral genomic sequences. The output would show the name, the score, and the taxa from species to phylum of the predicted host for the given viruses. Both the top 1 and the consensus of top 5 predicted hosts were shown since considering the consensus of the top 5 predictions would improve the performances of the GM. When tested on a server with 40 threads see Additional file 1 : Table S3 for details , the time consumption of both PHP and WIsH was reduced much compared to that of the tests on the laptop.

Finally, we tested the ability of the PHP in predicting virus hosts using pairs of known phage-host interactions which were determined by the single-cell viral tagging method [ 32 ]. Requiring a minimal score for making predictions thresholding and taking the consensus of the top 30 predictions further improved the host prediction accuracy of PHP.

This work will facilitate the rapid identification of hosts for newly identified prokaryotic viruses in metagenomic studies. Abstract Background: Viruses are ubiquitous biological entities, estimated to be the largest reservoirs of unexplored genetic diversity on Earth. Phages can also affect host diversity, e. Moreover, they mediate gene transfer between prokaryotes, but this remains largely unknown in the environment. Genomics or proteomics are providing us now with powerful tools in phage ecology, but final testing will have to be performed in the environment.

Abstract The finding that total viral abundance is higher than total prokaryotic abundance and that a significant fraction of the prokaryotic community is infected with phages in aquatic systems has stimulated research on the ecology of prokaryotic viruses and their role in ecosystems.

Publication types Research Support, Non-U.

Ameba Ownd

trasbapreecas1977's Ownd

Prokaryotes viruses