User:Klioseth/sandbox

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Klioseth (talk | contribs) at 19:21, 19 April 2022. The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

SimPlot [1] is a Windows application allowing the users to produce high-quality sequence similarity plots.

SimPlot++[2] is a freely available reinterpretation of SimPlot. SimPlot++ is an open-source multi-platform application developed at the department of Computer Science of the Université du Québec à Montréal. SimPlot++ can be used to produce publication quality sequence similarity plots using 63 nucleotide and 20 amino acid distance models, to detect intergenic and intragenic recombination events using Phi, χ2, NSS or proportion tests, and to generate and analyze interactive sequence similarity networks. SimPlot++ supports multicore data processing and provides useful distance calculability diagnostics. As such, SimPlot++ improves on the original tools offered by SimPlot, such as SimPlot, BootScan and FindSites, and provide the users with a new similarity network feature[2].

SimPlot++ is available on github as source code from Windows, MacOS and Linux, and as an executable for Windows.

General Use

Simplot++ requires a multiple sequence alignment input file containing either DNA or Amino acid sequences. The input data must be in one of the following formats: FASTA, Nexus, PIR, PHYLIP, Stockholm or Clustal[2].

Once loaded into SimPlot++, the input sequences must be manually separated in different groups by the user, based on their evolutionary proximity in order to generate consensus sequences. These consensus sequences will be used to perform any analysis available in SimPlot++. A minimum of two groups must be created in order to have full access to the application features[2].

SimPlot analysis

Typical SimPlot output provided by SimPlot++.

A SimPlot analysis uses a window of a specified size and a specified advancement step to slide this window over the multiple sequence alignment (MSA)[1]. Every sub-MSA covered by the window is extracted and a distance matrix based on a selected distance model is generated. This distance matrix is then used to produce a similarity plot for every consensus sequence against the reference sequence chosen by the user. The variations in similarity between the reference sequence and the consensus sequences can be used, for example, to detect potential recombination events[1].

The SimPlot analysis offers many features[2]:

  • 43 DNA and 20 amino acid distance models are available
  • Multiprocessing functionality is available
  • Matplotlib-based plots with a toolbar to easily customize and save the outputs in multiple formats[3]
  • A new quality control window will open an interactive HTML page to access additional information with the distance calculability diagnostic

BootScan analysis

Typical BootScan output provided by SimPlot++.

Bootscanning[4] is a pipeline consisting of 4 main steps, all done using a sliding window analysis (as in the SimPlot analysis).

Steps [4]

  1. The subsequences extracted from the consensus groups are Bootstrapped N times.
  2. For each of the N Bootstrapped sub-MSAs, a distance matrix is generated.
  3. A phylogenetic tree is inferred for each distance matrix (either with Neighbor joining or UPGMA).
  4. The conflicting phylogenetic signals are quantified and expressed as the percentage of trees where each sequence is the nearest neighbor of the reference sequence.

This analysis offers the following features[2]:

  1. 43 DNA distance models are available for generating the distance matrices
  2. Multiprocessing functionality is available
  3. Matplotlib-based plots with a toolbar to easily customize and save the outputs in multiple formats[3]

FindSites

The FindSites scan is used for locating possible regions of recombination by identifying Informative sites[5]. The first step of the analysis is to select a sequence assumed to be originated from a recombination event as well as two sequences of interest (one from each of the two possible parental evolutionary lines), and a fourth sequence as an outgroup. Informative sites will be identified as those where, at the same position, two of the sequences share the same nucleotide, and the other two sequences share another (different) nucleotide[5].

Similarity Network

Typical Similarity Network output provided by SimPlot++. The network nodes represent different sequences, and the edges represent both the global (in black) and local sequence similarity (in red).

The sequence similarity network analysis is an interactive representation of a SimPlot analysis using a window in which every group (including the reference group) is represented by a network node. These nodes are connected by an edge depending on the calculated global (over the whole sequence) or local (over sub-sequences of a selected length) similarity[2].

By adjusting the minimum similarity threshold required to show each of the edge types (global and local), it is possible to get a better insight on the relationships between every group. Furthermore, the network similarity representation can be limited to a specific range of the full MSA (in order to analyze a gene or region of interest)[2].

The graph data and visualization can be saved in an HTML file. The graph itself can be saved as either a .png or .svg directly from the toolbox in the HTML file.[2]

Recombination analysis

Statistical tests for detecting recombination events from PhiPack[6] have been implemented in SimPlot++.

The Phi[6], Phi-profile[6], Max χ2[7] and NSS[8] tests are available for both the ungrouped (raw sequences) and grouped consensus sequences.

Moreover, a new simple Proportion test has been designed as a complement to the traditional SimPlot analysis in order to identify quickly the most likely mosaic regions (i.e. possible recombination events) in the grouped sequences. This test is based on the proportion of genetic distances extracted from the SimPlot distance matrices. The Proportion score is an indicator of the signal strength but should not be always considered as a recombination signal[2].

References

  1. ^ a b c Lole, Kavita S.; Bollinger, Robert C.; Paranjape, Ramesh S.; Gadkari, Deepak; Kulkarni, Smita S.; Novak, Nicole G.; Ingersoll, Roxann; Sheppard, Haynes W.; Ray, Stuart C. (January 1999). "Full-Length Human Immunodeficiency Virus Type 1 Genomes from Subtype C-Infected Seroconverters in India, with Evidence of Intersubtype Recombination". Journal of Virology. 73 (1): 152–160. doi:10.1128/jvi.73.1.152-160.1999. ISSN 0022-538X.
  2. ^ a b c d e f g h i j Samson, Stéphane; Lord, Étienne; Makarenkov, Vladimir (17 December 2021). "SimPlot++: a Python application for representing sequence similarity and detecting recombination". arXiv:2112.09755 [cs, math, q-bio]. doi:10.48550/arxiv.2112.09755.
  3. ^ a b Hunter, John D. (2007). "Matplotlib: A 2D Graphics Environment". Computing in Science & Engineering. 9 (3): 90–95. doi:10.1109/MCSE.2007.55.
  4. ^ a b Salminen, Mika O.; Carr, Jean K.; Burke, Donald S.; McCUTCHAN, Francine E. (1 November 1995). "Identification of Breakpoints in Intergenotypic Recombinants of HIV Type 1 by Bootscanning". AIDS Research and Human Retroviruses. 11 (11): 1423–1425. doi:10.1089/aid.1995.11.1423. ISSN 0889-2229.
  5. ^ a b Robertson, David L.; Hahn, Beatrice H.; Sharp, Paul M. (1 March 1995). "Recombination in AIDS viruses". Journal of Molecular Evolution. 40 (3): 249–259. doi:10.1007/BF00163230. ISSN 1432-1432.
  6. ^ a b c Bruen, Trevor C; Philippe, Hervé; Bryant, David (1 April 2006). "A Simple and Robust Statistical Test for Detecting the Presence of Recombination". Genetics. 172 (4): 2665–2681. doi:10.1534/genetics.105.048975.
  7. ^ Smith, JohnMaynard (February 1992). "Analyzing the mosaic structure of genes". Journal of Molecular Evolution. 34 (2). doi:10.1007/BF00182389.
  8. ^ Jakobsen, Ingrid B.; Easteal, Simon (1996). "A program for calculating and displaying compatibility matrices as an aid in determining reticulate evolution in molecular sequences". Bioinformatics. 12 (4): 291–295. doi:10.1093/bioinformatics/12.4.291.

External links