﻿--FAQ FOR COMPARATIVE DATABASE--
 
1) ABOUT THE PIPELINE

	-What is Comparative_db?
The Comparative_db is a customizable comparative genomics database which enables the user to compare and analyse genomes. Comparative_db integrates multiple tools for comparative genomic analyses and facilitates the access to comprehensive annotations from multiple databases including UniProt (curated and automated protein annotations), KEGG (annotation of pathways) and COG (orthology). Next to this, phylogenetic trees and phylogenetic relationships of genes, conservation of gene neighbourhood and taxonomic profiles can be visualized using dynamically generated graphs, available for download.

	-How does Comparative_db work?
Comparative_db  consists of two parts:
1. generation of a SQL database through a Nextflow–based pipeline applied on two or more input .gbk files. (see question ‘-Which analysis does the pipeline include?’)
2. launch of an overview and interacting webpage that displays the results retrieved in step 1.

	-What is the input format file?
Standard Genbank file that contains genome sequences and annotations.
Only the extention “.gbk” is used, consequently Genbank files downloaded from NCBI, with the extention “.gbff” must be renamed as “.gbk”.
You can annotate a genome and generate the corresponding “.gbk” file using Prokka (see Prokka documentation – put citation).

	-Which input files do I need to provide? 
    1. STEP 1 – generation of SQL database:
       -Input tabel (.csv/.tsv ):
	it contains an identifier for each input file and the local or absolute path of the gbk files. 
	Download an example of input table here ( link to an example table)
	For a detailed explanation go here (link to the documentation).
	-Config file
	it let you control which analyses among those classified as ‘optional’ in question X (How does Comparative_db work?) will be run.
	Download an example of Config file here (link).
	Rely on the Documentation to set up this file in the proper way. (link Documentation)
    2. STEP 2 – launch of the web app
	settings file (not sure if the user will need to modify it or it will be automatically linked to the newly generated db)

	-Which tools do I need to install to run the pipeline?
You need to install Nextflow (see Nextflow documentation – install – put link) in your environment for running the annotation analysis.
For launching the webapp you need to create an environment based on the chlamdb.yml file provided.
Please see the documentation
---- What about providing a single environment with Nextflow and everything? ----
	
	-How many genomes can I compare?
There is no limits on the number of genomes given as input

	-How long does it take to run the pipeline?
Test with x number of genomes, size of genomes, cpu to run it, etc (to do)

	-Which analyses does the pipeline include?
The pipeline performs the following analyses:
-Orthologous proteins identified with OrthoFinder (optional)
-Detailed annotation of proteins based on 
	KEGG (Kegg orthologs) - (optional)
	COG (clusters of orthologous groups) - (optional)
	SwissProt (manually annotated proteins) (NOT SURE).
-Precomputed homology searches with RefSeq and SwissProt databases for each protein 
-Precomputed phylogenetic reconstructions of orthologous groups
-COG phylogenetic profiles
-Generation of customized databases for blast search (optional)
For a detailed description link (Bastian’s documentation)

	-Can I customize the analyses I want to perform?
The user can select which analyses she/he wants to perform turning on or off some of the parameters of the configuration file. See the list of ‘optional’ parameters in Q-‘Which analyses does the pipeline include?’and refer to documentation (link)

2) HOW TO NAVIGATE INTO THE COMPARATIVE DATABASE 
	-How do I identify which proteins are part of a specific orthogroup?
In the 'analyses menu', in the section called 'Comparison', you can select  'orthologies' and discovering the orthogroups identified in a specific 		genome or those shared by several genomes of interest. Clicking on one of the orthogroup names (example: group_50) it lists the genes that belongs to it, 	the COG and KEG annotations,the organization of Pfam domains and info regarding transmembrane domains.
	-How do I use the search bar? –
The search bar can be used to search any term into the datbase. For example it is possible to access a cog, a kegg or an accession term directly typing it 	   in the searchbar.  
	-How to extract metabolic information?
	--Be sure that this type of analysis was turned on when you run the pipeline --
In the 'analyses menu', in the section called 'Metabolism', you can select 'kegg modules' and investigate a specific KEGG category of interest (example: Energy metabolism, lipid metabolism, etc). Within the category,for each gneome you can visualize the KEGG modules and the ko entries that belong to each module. Besides, you can list the proteins, annotated with ko identifiers further linked to the Ko database, that belong to a selected Kegg module and navigate to the protein locus. This analysis let you navigate from a global kegg characterizations of your genomes to a more detailed description of each kegg-described protein.  
If you select 'kegg maps' you can visualize whch genes of each metabolic map are carried by a selected genome of interest.
-Other analysis ...
A phylogenetic tree annotated with the presence/absence and quantification of the genes which belong to each KO entry is available (for both a single entry with an addiitonal comparison with the associated orthogroup, and all the entries together) (MAPS BOTH FOR CATEOGIRES AND SUBCATEGOIES - I DON'T GET THE DIFFERENCE ABOUT IT - IT DOES NOT SEEM TO DERIVE FROM THE KEGG DATABASE).
In addition, you can compare the strains and have an easy-readable table that reports all the ko modules the number of proteins that belongs to each one. THERE IS ALSO nKO 	nKO+ THAT I DO NOT GET)
Differently, you can use the Kegg Orthologs anlyses in the 'analyses menu', section called 'Comparison', where you can compare two or more genomes and idenfity which KO entries have in common (you can also exclude those which belong to other genome(s)). For each Ko entry you can investigate which pathway and module it is part of, how many proteins are associated with it and of which orthogroup(s) they are part. Discrepencies between orthogroup clustering and ko prediction are higlighted in the phylogentic tree. Similar comparison can be done focusing on COGs and orthologs. 
3) BIOLOGICAL ASPECTS 
	-Are plasmids identified?
Genes belonging to plasmids are reported only if this feature is annotated in the input file, otherwise no distinction between chromosome and plasmids is done. 
	-How have orthogroups been identified?
Orthogroups (or orthologous groups) were identified with OrthoFinder (https://github.com/davidemms/OrthoFinder). This tools identify orthogroups based on BLASTp (parameters: -evalue 0.001) results using the MCL (https://micans.org/mcl/) clustering software.
	
      ﻿--FAQ FOR COMPARATIVE DATABASE--
 
1) ABOUT THE PIPELINE

	-What is Comparative_db?
The Comparative_db is a customizable comparative genomics database which enables the user to compare and analyse genomes. Comparative_db integrates multiple tools for comparative genomic analyses and facilitates the access to comprehensive annotations from multiple databases including UniProt (curated and automated protein annotations), KEGG (annotation of pathways) and COG (orthology). Next to this, phylogenetic trees and phylogenetic relationships of genes, conservation of gene neighbourhood and taxonomic profiles can be visualized using dynamically generated graphs, available for download.

	-How does Comparative_db work?
Comparative_db  consists of two parts:
1. generation of a SQL database through a Nextflow–based pipeline applied on two or more input .gbk files. (see question ‘-Which analysis does the pipeline include?’)
2. launch of an overview and interacting webpage that displays the results retrieved in step 1.

	-What is the input format file?
Standard Genbank file that contains genome sequences and annotations.
Only the extention “.gbk” is used, consequently Genbank files downloaded from NCBI, with the extention “.gbff” must be renamed as “.gbk”.
You can annotate a genome and generate the corresponding “.gbk” file using Prokka (see Prokka documentation – put citation).

	-Which input files do I need to provide? 
    1. STEP 1 – generation of SQL database:
       -Input tabel (.csv/.tsv ):
	it contains an identifier for each input file and the local or absolute path of the gbk files. 
	Download an example of input table here ( link to an example table)
	For a detailed explanation go here (link to the documentation).
	-Config file
	it let you control which analyses among those classified as ‘optional’ in question X (How does Comparative_db work?) will be run.
	Download an example of Config file here (link).
	Rely on the Documentation to set up this file in the proper way. (link Documentation)
    2. STEP 2 – launch of the web app
	settings file (not sure if the user will need to modify it or it will be automatically linked to the newly generated db)

	-Which tools do I need to install to run the pipeline?
You need to install Nextflow (see Nextflow documentation – install – put link) in your environment for running the annotation analysis.
For launching the webapp you need to create an environment based on the chlamdb.yml file provided.
Please see the documentation
---- What about providing a single environment with Nextflow and everything? ----
	
	-How many genomes can I compare?
There is no limits on the number of genomes given as input

	-How long does it take to run the pipeline?
Test with x number of genomes, size of genomes, cpu to run it, etc (to do)

	-Which analyses does the pipeline include?
The pipeline performs the following analyses:
-Orthologous proteins identified with OrthoFinder (optional)
-Detailed annotation of proteins based on 
	KEGG (Kegg orthologs) - (optional)
	COG (clusters of orthologous groups) - (optional)
	SwissProt (manually annotated proteins) (NOT SURE).
-Precomputed homology searches with RefSeq and SwissProt databases for each protein 
-Precomputed phylogenetic reconstructions of orthologous groups
-COG phylogenetic profiles
-Generation of customized databases for blast search (optional)
For a detailed description link (Bastian’s documentation)

	-Can I customize the analyses I want to perform?
The user can select which analyses she/he wants to perform turning on or off some of the parameters of the configuration file. See the list of ‘optional’ parameters in Q-‘Which analyses does the pipeline include?’and refer to documentation (link)

2) HOW TO NAVIGATE INTO THE COMPARATIVE DATABASE 
	-How do I identify which proteins are part of a specific orthogroup?
In the 'analyses menu', in the section called 'Comparison', you can select  'orthologies' and discovering the orthogroups identified in a specific 		genome or those shared by several genomes of interest. Clicking on one of the orthogroup names (example: group_50) it lists the genes that belongs to it, 	the COG and KEG annotations,the organization of Pfam domains and info regarding transmembrane domains.
	-How do I use the search bar? –
The search bar can be used to search any term into the datbase. For example it is possible to access a cog, a kegg or an accession term directly typing it 	   in the searchbar.  
	-How to extract metabolic information?
	--Be sure that this type of analysis was turned on when you run the pipeline --
In the 'analyses menu', in the section called 'Metabolism', you can select 'kegg modules' and investigate a specific KEGG category of interest (example: Energy metabolism, lipid metabolism, etc). Within the category,for each gneome you can visualize the KEGG modules and the ko entries that belong to each module. Besides, you can list the proteins, annotated with ko identifiers further linked to the Ko database, that belong to a selected Kegg module and navigate to the protein locus. This analysis let you navigate from a global kegg characterizations of your genomes to a more detailed description of each kegg-described protein.  
If you select 'kegg maps' you can visualize whch genes of each metabolic map are carried by a selected genome of interest.
-Other analysis ...
A phylogenetic tree annotated with the presence/absence and quantification of the genes which belong to each KO entry is available (for both a single entry with an addiitonal comparison with the associated orthogroup, and all the entries together) (MAPS BOTH FOR CATEOGIRES AND SUBCATEGOIES - I DON'T GET THE DIFFERENCE ABOUT IT - IT DOES NOT SEEM TO DERIVE FROM THE KEGG DATABASE).
In addition, you can compare the strains and have an easy-readable table that reports all the ko modules the number of proteins that belongs to each one. THERE IS ALSO nKO 	nKO+ THAT I DO NOT GET)
Differently, you can use the Kegg Orthologs anlyses in the 'analyses menu', section called 'Comparison', where you can compare two or more genomes and idenfity which KO entries have in common (you can also exclude those which belong to other genome(s)). For each Ko entry you can investigate which pathway and module it is part of, how many proteins are associated with it and of which orthogroup(s) they are part. Discrepencies between orthogroup clustering and ko prediction are higlighted in the phylogentic tree. Similar comparison can be done focusing on COGs and orthologs. 
3) BIOLOGICAL ASPECTS 
	-Are plasmids identified?
Genes belonging to plasmids are reported only if this feature is annotated in the input file, otherwise no distinction between chromosome and plasmids is done. 
	-How have orthogroups been identified?
Orthogroups (or orthologous groups) were identified with OrthoFinder (https://github.com/davidemms/OrthoFinder). This tools identify orthogroups based on BLASTp (parameters: -evalue 0.001) results using the MCL (https://micans.org/mcl/) clustering software.
	
      

