Phylogenetic Comparative Methods Demonstrate The Non-Directional Evolution Of Genomic Features In The Streptococcus Genus

Streptococcus , a genus of Gram-positive bacteria, encompasses many species with diverse ecological roles, from commensals in the human microbiota to pathogens causing a spectrum of infections. Understanding the biology and evolution of Streptococcus is crucial for unraveling its significance in health and disease. In this paper, I shall study the trait evolution of 4 genomic features - Genome size, Genomic GC content, Genomic repeat fraction, and Number of coding genes in the Streptococcus genus. Using phylogenetic generalized least squares, I find a strong positive correlation between genome size and the number of coding genes, while other features are unrelated.


Introduction
The Streptococcus genus is a prominent and diverse group of bacteria characterized by their spherical or coccishaped cells (B.Spellberg., 2015) 1 .These Gram-positive bacteria are widely distributed across various environments, including soil, water, and the human body.One of their distinctive features is their tendency to form chains or pairs during cell division, which can be observed under a microscope (B.Lara et al., 2005)  2 .The genus encompasses various species, each with unique characteristics and adaptations.
Streptococci exhibit significant variability in terms of their ecological roles and interactions (JO Mundt, 1982)  3 .While some species are commensals (Kreth et al., 2017 4 ; Salvadori et al., 2019 5 ), coexisting harmlessly with their host organisms, others can be opportunistic pathogens, causing various infections in humans and other animals (Cunningham, 2000)  6 .The adaptability and versatility of Streptococcus make it an intriguing subject of study in microbiology as researchers explore the factors determining whether these bacteria become beneficial residents or disease-causing agents within a host.biological processes.Understanding the diverse roles and behaviors of Streptococcus bacteria contributes to our knowledge of microbiology, evolutionary biology, and the complex relationships between microorganisms and their hosts.An organism's genome provides a detailed record of its evolutionary history and can give insights into its phenotype, metabolomics, and proteome (Ellegren , 2014) 7 .There are many genomic features that are studied in global level genome size, genomic GC, number of genes, non-coding regions, 3D genome organization, and local genomic contexts like -local mechanical and shape properties.I restrict myself to the study of whole genomic features in streptococcus genera.To study these features between closely spaced species, we need to account for the phylogenetic non-independency of data points.I shall use trait evolution models and phylogenetically corrected regressions to understand the pattern of genomic features in Streptococci.
Trait evolution models are essential tools in evolutionary biology and provide a framework for understanding how traits change over time in populations and species (Munkemuller et al., 2012) 8 .These models correct for phylogeny similarity of the species, which may play a role in trait similarity, and then explore questions related to adaptation and diversification of life forms.There are several trait evolution models in the literature; some of the popular ones are -Brownian motion (Felsenstein , 1985)  9 , OU model (Butler and King, 2005)  10 , EB model (Clavel et al., 2019)  11 , lambda model (Ho et al., 2014)  12 , and white noise model (Cooper et al., 2010)  13 .Brownian motion model, the most basic model of trait evolution, assumes that trait evolution occurs as a random walk, with trait values changing continuously along branches of a phylogenetic tree.Under this model, traits evolve without any specific directionality, and the variance of trait change is proportional to the time elapsed.Ornstein-Uhlenbeck's (OU) model, which incorporates stabilizing selection, posits that traits are subject to a restoring force that pulls them toward an optimal value, providing a mechanism for trait convergence.The OU model is especially relevant when examining traits adapted to specific ecological niches.The EB (Early Burst) model suggests that during the early stages of a lineage's evolutionary history, a rapid burst of diversification leads to the emergence of a wide variety of traits or species.Subsequently, this rate of diversification slows down over time.The Lambda model assumes that trait evolution follows a Brownian motion process, where trait values change continuously along phylogenetic tree branches.However, it allows for the rate of trait change to vary across branches, with λ serving as a scaling factor.A λ value of 1 indicates that the trait evolves constantly across the tree, while values greater than 1 suggest accelerated trait evolution along some branches, and values less than 1 indicate decelerated evolution.The white model shows how traits change over time and across different species.Unlike models that assume specific evolutionary processes like Brownian motion or Ornstein-Uhlenbeck, the White Noise Model assumes that trait evolution is entirely random, with no correlation between traits of closely related species.
This study investigates the evolution of 4 genomic features in the streptococcus genus -genome size, genomic GC, number of coding regions, and genomic repeat fraction.We explore the relationship between them with the help of phylogenetic regression, study their trait evolution with the help of the above-mentioned trait evolution models, and examine which one is most appropriate for each trait.

Plotting Phylogenetic tree to visualize the phylogenetic relationships
The genome for 48 Streptococci bacteria was downloaded from the NCBI database (Federhen et al 2012) 14 , and their accessions are given in Table 1.The phylogenetic tree for the streptococcus genera was plotted using the TYGS server (Kolthoff et al., 2019) 15 , and a 16S phylogenetic tree was taken.The phylogenetic tree is shown in Figure 1.The phylogenetic tree was also taken as a Newick format for further analysis.

Computation of genomic features
The genomic features -genome size, GC, and coding genes were obtained from NCBI sites.The SSR repeats in the genome were detected using the repeat finder plugin of Geneios Prime 2023.The repeat finder uses a kmer approach to detect repeats and is database-independent, making it suitable for species for which repeat databases are unavailable (Benson, 1999)  16 .The repeat lengths were summated and divided by the total genome size to obtain the genomic fraction.The table showing the genomic features for each species is given in Table 1.

Phylogenetic modeling of trait evolution
For statistical analysis, we used the programming language R v.4.01 (RC Team., 2000) 17 .We used the Pearson correlation test to check for the correlations using the cor.test() function of the base package.The phylogenetic least squares regression was performed using ape (Paradis et al., 2019) 18 and caper packages (Orme et al., 2013) 19 in R. The phylogenetic comparative modeling was performed using the Geiger package of R and Brownian motion, OU model, EB model, lambda model, and white noise model was fitted, and their AICs were checked using aic function of the base package.

Results
Table 1 shows the species taken for analyses, their NCBI accession numbers, and their genomic features.We can see that 95% confidence intervals for genome size are (1.959-2.083) MB, for genomic GC (38.636 -39.946) %, for coding genes (1831 -1941), and for genomic repeat fraction (16.053 -20.933)%.We can see a greater spread in the distribution of the values for genomic repeat fraction than genomic GC, which has a narrower range.The Pearson correlation between the genomic features was computed and plotted as a correlation heatmap.We see a significant correlation between all the genomic parameters-genome size is positively correlated with genomic GC ( R = 0.341, P = 0.009, N= 48), genomic repeat fraction ( R = 0.318, P = 0.029, N = 48), number of coding genes (R = 0.865, P = 2.2E-16, N= 48).Genomic GC is positively correlated with genomic repeat fraction ( R = 0.317, P = 0.030, N= 48) and number of coding genes ( R = 0.386, P = 0.003, N= 48).Genomic repeat fraction is nearly correlated with the number of coding genes ( R = 0.272, P = 0.065, N= 48).Some of these correlations can be understood in the context of pre-existing literature, which talks about the trends of genomic features.We also know that organismal complexity is correlated with genome size in organisms with smaller genomes (Gregory, 2000).Since protein-coding genes can be considered as a proxy of organismal complexity, we find them to be positively correlated.These trends are, however not seen in larger multicellular organisms, which tend to have much larger genomes and don't show a similar increase in geomic complexity.Since streptococci are prokaryotes, the positive scaling between genomic size and the number of coding genes is the expected behavior.our study, we conclude that there is no directional trend for the evolution of the 4 genomic features in streptococci, and these 4 features, except for genome size and complexity, are independent of each other.