diff --git a/docs.it4i/anselm-cluster-documentation/software/omics-master/overview.md b/docs.it4i/anselm-cluster-documentation/software/omics-master/overview.md index f2ff944602aafc35692c7f0369f38b6841b3b847..904ec0b073ed4dceb6f9f079a532dfebdd0b01ff 100644 --- a/docs.it4i/anselm-cluster-documentation/software/omics-master/overview.md +++ b/docs.it4i/anselm-cluster-documentation/software/omics-master/overview.md @@ -11,7 +11,7 @@ The pipeline inputs the raw data produced by the sequencing machines and undergo  -**Figure 1.** OMICS MASTER solution overview. Data is produced in the external labs and comes to IT4I (represented by the blue dashed line). The data pre-processor converts raw data into a list of variants and annotations for each sequenced patient. These lists files together with primary and secondary (alignment) data files are stored in IT4I sequence DB and uploaded to the discovery (candidate prioritization) or diagnostic component where they can be analyzed directly by the user that produced them, depending of the experimental design carried out. +Figure 1. OMICS MASTER solution overview. Data is produced in the external labs and comes to IT4I (represented by the blue dashed line). The data pre-processor converts raw data into a list of variants and annotations for each sequenced patient. These lists files together with primary and secondary (alignment) data files are stored in IT4I sequence DB and uploaded to the discovery (candidate prioritization) or diagnostic component where they can be analyzed directly by the user that produced them, depending of the experimental design carried out. Typical genomics pipelines are composed by several components that need to be launched manually. The advantage of OMICS MASTER pipeline is that all these components are invoked sequentially in an automated way. @@ -35,24 +35,24 @@ FastQC& FastQC. These steps are carried out over the original FASTQ file with optimized scripts and includes the following steps: sequence cleansing, estimation of base quality scores, elimination of duplicates and statistics. -Input: **FASTQ file**. +Input: FASTQ file. -Output: **FASTQ file plus an HTML file containing statistics on the data**. +Output: FASTQ file plus an HTML file containing statistics on the data. FASTQ format It represents the nucleotide sequence and its corresponding quality scores.  -**Figure 2**.FASTQ file. +Figure 2.FASTQ file. #### Mapping -Component: **Hpg-aligner**. +Component: Hpg-aligner. Sequence reads are mapped over the human reference genome. SOLiD reads are not covered by this solution; they should be mapped with specific software (among the few available options, SHRiMP seems to be the best one). For the rest of NGS machine outputs we use HPG Aligner. HPG-Aligner is an innovative solution, based on a combination of mapping with BWT and local alignment with Smith-Waterman (SW), that drastically increases mapping accuracy (97% versus 62-70% by current mappers, in the most common scenarios). This proposal provides a simple and fast solution that maps almost all the reads, even those containing a high number of mismatches or indels. -Input: **FASTQ file**. +Input: FASTQ file. -Output: **Aligned file in BAM format**. +Output: Aligned file in BAM format. #### Sequence Alignment/Map (SAM) @@ -60,10 +60,10 @@ It is a human readable tab-delimited format in which each read and its alignment The SAM format (1) consists of one header section and one alignment section. The lines in the header section start with character â€@’, and lines in the alignment section do not. All lines are TAB delimited. -In SAM, each alignment line has 11 mandatory fields and a variable number of optional fields. The mandatory fields are briefly described in Table 1. They must be present but their value can be a â€\*’ or a zero (depending on the field) if the +In SAM, each alignment line has 11 mandatory fields and a variable number of optional fields. The mandatory fields are briefly described in Table 1. They must be present but their value can be a â€\’ or a zero (depending on the field) if the corresponding information is unavailable. -| ** No. ** | ** Name ** | ** Description ** | +| No. | Name | Description | | --------- | ---------- | ----------------------------------------------------- | | 1 | QNAME | Query NAME of the read or the read pai | | 2 | FLAG | Bitwise FLAG (pairing,strand,mate strand,etc.) | @@ -77,47 +77,47 @@ corresponding information is unavailable. | 10 | SEQ | <p>Query SEQuence on the same strand as the reference | | 11 | QUAL | <p>Query QUALity (ASCII-33=Phred base quality) | -** Table 1 **. Mandatory fields in the SAM format. + Table 1 . Mandatory fields in the SAM format. The standard CIGAR description of pairwise alignment defines three operations: â€M’ for match/mismatch, â€I’ for insertion compared with the reference and â€D’ for deletion. The extended CIGAR proposed in SAM added four more operations: â€N’ for skipped bases on the reference, â€S’ for soft clipping, â€H’ for hard clipping and â€P’ for padding. These support splicing, clipping, multi-part and padded alignments. Figure 3 shows examples of CIGAR strings for different types of alignments.  -** Figure 3 **. SAM format file. The â€@SQ’ line in the header section gives the order of reference sequences. Notably, r001 is the name of a read pair. According to FLAG 163 (=1+2+32+128), the read mapped to position 7 is the second read in the pair (128) and regarded as properly paired (1 + 2); its mate is mapped to 37 on the reverse strand (32). Read r002 has three soft-clipped (unaligned) bases. The coordinate shown in SAM is the position of the first aligned base. The CIGAR string for this alignment contains a P (padding) operation which correctly aligns the inserted sequences. Padding operations can be absent when an aligner does not support multiple sequence alignment. The last six bases of read r003 map to position 9, and the first five to position 29 on the reverse strand. The hard clipping operation H indicates that the clipped sequence is not present in the sequence field. The NM tag gives the number of mismatches. Read r004 is aligned across an intron, indicated by the N operation. + Figure 3 . SAM format file. The â€@SQ’ line in the header section gives the order of reference sequences. Notably, r001 is the name of a read pair. According to FLAG 163 (=1+2+32+128), the read mapped to position 7 is the second read in the pair (128) and regarded as properly paired (1 + 2); its mate is mapped to 37 on the reverse strand (32). Read r002 has three soft-clipped (unaligned) bases. The coordinate shown in SAM is the position of the first aligned base. The CIGAR string for this alignment contains a P (padding) operation which correctly aligns the inserted sequences. Padding operations can be absent when an aligner does not support multiple sequence alignment. The last six bases of read r003 map to position 9, and the first five to position 29 on the reverse strand. The hard clipping operation H indicates that the clipped sequence is not present in the sequence field. The NM tag gives the number of mismatches. Read r004 is aligned across an intron, indicated by the N operation. -** Binary Alignment/Map (BAM) ** + Binary Alignment/Map (BAM) BAM is the binary representation of SAM and keeps exactly the same information as SAM. BAM uses lossless compression to reduce the size of the data by about 75% and provides an indexing system that allows reads that overlap a region of the genome to be retrieved and rapidly traversed. #### Quality Control, Preprocessing and Statistics for BAM -** Component **: Hpg-Fastq & FastQC. +Component: Hpg-Fastq & FastQC. Some features -* Quality control - * reads with N errors - * reads with multiple mappings - * strand bias - * paired-end insert -* Filtering: by number of errors, number of hits - * Comparator: stats, intersection, ... + Quality control + reads with N errors + reads with multiple mappings + strand bias + paired-end insert + Filtering: by number of errors, number of hits + Comparator: stats, intersection, ... -** Input: ** BAM file. +Input: BAM file. -** Output: ** BAM file plus an HTML file containing statistics. +Output: BAM file plus an HTML file containing statistics. #### Variant Calling -Component: ** GATK **. +Component: GATK. Identification of single nucleotide variants and indels on the alignments is performed using the Genome Analysis Toolkit (GATK). GATK (2) is a software package developed at the Broad Institute to analyze high-throughput sequencing data. The toolkit offers a wide variety of tools, with a primary focus on variant discovery and genotyping as well as strong emphasis on data quality assurance. -** Input: ** BAM +Input: BAM -** Output: ** VCF +Output:VCF -** Variant Call Format (VCF) ** +Variant Call Format (VCF) VCF (3) is a standardized format for storing the most prevalent types of sequence variation, including SNPs, indels and larger structural variants, together with rich annotations. The format was developed with the primary intention to represent human genetic variation, but its use is not restricted >to diploid genomes and can be used in different contexts as well. Its flexibility and user extensibility allows representation of a wide variety of genomic variation with respect to a single reference sequence. @@ -127,42 +127,42 @@ A VCF file consists of a header section and a data section. The header contains this list; the reference haplotype is designated as 0. For multiploid data, the separator indicates whether the data are phased (|) or unphased (/). Thus, the two alleles C and G at the positions 2 and 5 in this figure occur on the same chromosome in SAMPLE1. The first data line shows an example of a deletion (present in SAMPLE1) and a replacement of two bases by another base (SAMPLE2); the second line shows a SNP and an insertion; the third a SNP; the fourth a large structural variant described by the annotation in the INFO column, the coordinate is that of the base before the variant. (b–f ) Alignments and VCF representations of different sequence variants: SNP, insertion, deletion, replacement, and a large deletion. The REF columns shows the reference bases replaced by the haplotype in the ALT column. The coordinate refers to the first reference base. (g) Users are advised to use simplest representation possible and lowest coordinate in cases where the position is ambiguous.](../../../img/fig4.png) -** Figure 4 **. (a) Example of valid VCF. The header lines ##fileformat and #CHROM are mandatory, the rest is optional but strongly recommended. Each line of the body describes variants present in the sampled population at one genomic position or region. All alternate alleles are listed in the ALT column and referenced from the genotype fields as 1-based indexes to this list; the reference haplotype is designated as 0. For multiploid data, the separator indicates whether the data are phased (|) or unphased (/). Thus, the two alleles C and G at the positions 2 and 5 in this figure occur on the same chromosome in SAMPLE1. The first data line shows an example of a deletion (present in SAMPLE1) and a replacement of two bases by another base (SAMPLE2); the second line shows a SNP and an insertion; the third a SNP; the fourth a large structural variant described by the annotation in the INFO column, the coordinate is that of the base before the variant. (b–f ) Alignments and VCF representations of different sequence variants: SNP, insertion, deletion, replacement, and a large deletion. The REF columns shows the reference bases replaced by the haplotype in the ALT column. The coordinate refers to the first reference base. (g) Users are advised to use simplest representation possible and lowest coordinate in cases where the position is ambiguous. + Figure 4 . (a) Example of valid VCF. The header lines ##fileformat and #CHROM are mandatory, the rest is optional but strongly recommended. Each line of the body describes variants present in the sampled population at one genomic position or region. All alternate alleles are listed in the ALT column and referenced from the genotype fields as 1-based indexes to this list; the reference haplotype is designated as 0. For multiploid data, the separator indicates whether the data are phased (|) or unphased (/). Thus, the two alleles C and G at the positions 2 and 5 in this figure occur on the same chromosome in SAMPLE1. The first data line shows an example of a deletion (present in SAMPLE1) and a replacement of two bases by another base (SAMPLE2); the second line shows a SNP and an insertion; the third a SNP; the fourth a large structural variant described by the annotation in the INFO column, the coordinate is that of the base before the variant. (b–f ) Alignments and VCF representations of different sequence variants: SNP, insertion, deletion, replacement, and a large deletion. The REF columns shows the reference bases replaced by the haplotype in the ALT column. The coordinate refers to the first reference base. (g) Users are advised to use simplest representation possible and lowest coordinate in cases where the position is ambiguous. ### Annotating -** Component: ** HPG-Variant + Component: HPG-Variant The functional consequences of every variant found are then annotated using the HPG-Variant software, which extracts from CellBase, the Knowledge database, all the information relevant on the predicted pathologic effect of the variants. VARIANT (VARIant Analysis Tool) (4) reports information on the variants found that include consequence type and annotations taken from different databases and repositories (SNPs and variants from dbSNP and 1000 genomes, and disease-related variants from the Genome-Wide Association Study (GWAS) catalog, Online Mendelian Inheritance in Man (OMIM), Catalog of Somatic Mutations in Cancer (COSMIC) mutations, etc. VARIANT also produces a rich variety of annotations that include information on the regulatory (transcription factor or miRNAbinding sites, etc.) or structural roles, or on the selective pressures on the sites affected by the variation. This information allows extending the conventional reports beyond the coding regions and expands the knowledge on the contribution of non-coding or synonymous variants to the phenotype studied. -** Input: ** VCF + Input: VCF -** Output: ** The output of this step is the Variant Calling Format (VCF) file, which contains changes with respect to the reference genome with the corresponding QC and functional annotations. + Output: The output of this step is the Variant Calling Format (VCF) file, which contains changes with respect to the reference genome with the corresponding QC and functional annotations. #### CellBase CellBase(5) is a relational database integrates biological information from different sources and includes: -** Core features: ** + Core features: We took genome sequences, genes, transcripts, exons, cytobands or cross references (xrefs) identifiers (IDs) from Ensembl (6). Protein information including sequences, xrefs or protein features (natural variants, mutagenesis sites, post-translational modifications, etc.) were imported from UniProt (7). -** Regulatory: ** + Regulatory: CellBase imports miRNA from miRBase (8); curated and non-curated miRNA targets from miRecords (9), miRTarBase (10), TargetScan(11) and microRNA.org (12) and CpG islands and conserved regions from the UCSC database (13). -** Functional annotation ** + Functional annotation OBO Foundry (14) develops many biomedical ontologies that are implemented in OBO format. We designed a SQL schema to store these OBO ontologies and 30 ontologies were imported. OBO ontology term annotations were taken from Ensembl (6). InterPro (15) annotations were also imported. -** Variation ** + Variation CellBase includes SNPs from dbSNP (16)^; SNP population frequencies from HapMap (17), 1000 genomes project (18) and Ensembl (6); phenotypically annotated SNPs were imported from NHRI GWAS Catalog (19),HGMD (20), Open Access GWAS Database (21), UniProt (7) and OMIM (22); mutations from COSMIC (23) and structural variations from Ensembl (6). -** Systems biology ** + Systems biology We also import systems biology information like interactome information from IntAct (24). Reactome (25) stores pathway and interaction information in BioPAX (26) format. BioPAX data exchange format enables the integration of diverse pathway resources. We successfully solved the problem of storing data released in BioPAX format into a SQL relational schema, which allowed us importing Reactome in CellBase. @@ -212,26 +212,26 @@ If we launch ngsPipeline with â€-h’, we will get the usage help: Let us see a brief description of the arguments: ```bash - *-h --help*. Show the help. + -h --help. Show the help. - *-i, --input.* The input data directory. This directory must to have a special structure. We have to create one folder per sample (with the same name). These folders will host the fastq files. These fastq files must have the following pattern “sampleName” + “_” + “1 or 2” + “.fq”. 1 for the first pair (in paired-end sequences), and 2 for the + -i, --input. The input data directory. This directory must to have a special structure. We have to create one folder per sample (with the same name). These folders will host the fastq files. These fastq files must have the following pattern “sampleName” + “_” + “1 or 2” + “.fq”. 1 for the first pair (in paired-end sequences), and 2 for the second one. - *-o , --output.* The output folder. This folder will contain all the intermediate and final folders. When the pipeline will be executed completely, we could remove the intermediate folders and keep only the final one (with the VCF file containing all the variants) + -o , --output. The output folder. This folder will contain all the intermediate and final folders. When the pipeline will be executed completely, we could remove the intermediate folders and keep only the final one (with the VCF file containing all the variants) - *-p , --ped*. The ped file with the pedigree. This file contains all the sample names. These names must coincide with the names of the input folders. If our input folder contains more samples than the .ped file, the pipeline will use only the samples from the .ped file. + -p , --ped. The ped file with the pedigree. This file contains all the sample names. These names must coincide with the names of the input folders. If our input folder contains more samples than the .ped file, the pipeline will use only the samples from the .ped file. - *--email.* Email for PBS notifications. + --email. Email for PBS notifications. - *--prefix.* Prefix for PBS Job names. + --prefix. Prefix for PBS Job names. - *-s, --start & -e, --end.* Initial and final stage. If we want to launch the pipeline in a specific stage we must use -s. If we want to end the pipeline in a specific stage we must use -e. + -s, --start & -e, --end. Initial and final stage. If we want to launch the pipeline in a specific stage we must use -s. If we want to end the pipeline in a specific stage we must use -e. - *--log*. Using log argument NGSpipeline will prompt all the logs to this file. + --log. Using log argument NGSpipeline will prompt all the logs to this file. - *--project*>. Project ID of your supercomputer allocation. + --project>. Project ID of your supercomputer allocation. - *--queue*. [Queue](../../resource-allocation-and-job-execution/introduction.html) to run the jobs in. + --queue. [Queue](../../resource-allocation-and-job-execution/introduction.html) to run the jobs in. ``` Input, output and ped arguments are mandatory. If the output folder does not exist, the pipeline will create it. @@ -290,47 +290,47 @@ If we want to re-launch the pipeline from stage 4 until stage 20 we should use t The pipeline calls the following tools -* [fastqc](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/), quality control tool for high throughput sequence data. -* [gatk](https://www.broadinstitute.org/gatk/), The Genome Analysis Toolkit or GATK is a software package developed at + [fastqc](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/), quality control tool for high throughput sequence data. + [gatk](https://www.broadinstitute.org/gatk/), The Genome Analysis Toolkit or GATK is a software package developed at the Broad Institute to analyze high-throughput sequencing data. The toolkit offers a wide variety of tools, with a primary focus on variant discovery and genotyping as well as strong emphasis on data quality assurance. Its robust architecture, powerful processing engine and high-performance computing features make it capable of taking on projects of any size. -* [hpg-aligner](https://github.com/opencb-hpg/hpg-aligner), HPG Aligner has been designed to align short and long reads with high sensitivity, therefore any number of mismatches or indels are allowed. HPG Aligner implements and combines two well known algorithms: _Burrows-Wheeler Transform_ (BWT) to speed-up mapping high-quality reads, and _Smith-Waterman_> (SW) to increase sensitivity when reads cannot be mapped using BWT. -* [hpg-fastq](http://docs.bioinfo.cipf.es/projects/fastqhpc/wiki), a quality control tool for high throughput sequence data. -* [hpg-variant](http://docs.bioinfo.cipf.es/projects/hpg-variant/wiki), The HPG Variant suite is an ambitious project aimed to provide a complete suite of tools to work with genomic variation data, from VCF tools to variant profiling or genomic statistics. It is being implemented using High Performance Computing technologies to provide the best performance possible. -* [picard](http://picard.sourceforge.net/), Picard comprises Java-based command-line utilities that manipulate SAM files, and a Java API (HTSJDK) for creating new programs that read and write SAM files. Both SAM text format and SAM binary (BAM) format are supported. -* [samtools](http://samtools.sourceforge.net/samtools-c.shtml), SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format. -* [snpEff](http://snpeff.sourceforge.net/), Genetic variant annotation and effect prediction toolbox. + [hpg-aligner](https://github.com/opencb-hpg/hpg-aligner), HPG Aligner has been designed to align short and long reads with high sensitivity, therefore any number of mismatches or indels are allowed. HPG Aligner implements and combines two well known algorithms: _Burrows-Wheeler Transform_ (BWT) to speed-up mapping high-quality reads, and _Smith-Waterman_> (SW) to increase sensitivity when reads cannot be mapped using BWT. + [hpg-fastq](http://docs.bioinfo.cipf.es/projects/fastqhpc/wiki), a quality control tool for high throughput sequence data. + [hpg-variant](http://docs.bioinfo.cipf.es/projects/hpg-variant/wiki), The HPG Variant suite is an ambitious project aimed to provide a complete suite of tools to work with genomic variation data, from VCF tools to variant profiling or genomic statistics. It is being implemented using High Performance Computing technologies to provide the best performance possible. + [picard](http://picard.sourceforge.net/), Picard comprises Java-based command-line utilities that manipulate SAM files, and a Java API (HTSJDK) for creating new programs that read and write SAM files. Both SAM text format and SAM binary (BAM) format are supported. + [samtools](http://samtools.sourceforge.net/samtools-c.shtml), SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format. + [snpEff](http://snpeff.sourceforge.net/), Genetic variant annotation and effect prediction toolbox. This listing show which tools are used in each step of the pipeline -* stage-00: fastqc -* stage-01: hpg_fastq -* stage-02: fastqc -* stage-03: hpg_aligner and samtools -* stage-04: samtools -* stage-05: samtools -* stage-06: fastqc -* stage-07: picard -* stage-08: fastqc -* stage-09: picard -* stage-10: gatk -* stage-11: gatk -* stage-12: gatk -* stage-13: gatk -* stage-14: gatk -* stage-15: gatk -* stage-16: samtools -* stage-17: samtools -* stage-18: fastqc -* stage-19: gatk -* stage-20: gatk -* stage-21: gatk -* stage-22: gatk -* stage-23: gatk -* stage-24: hpg-variant -* stage-25: hpg-variant -* stage-26: snpEff -* stage-27: snpEff -* stage-28: hpg-variant + stage-00: fastqc + stage-01: hpg_fastq + stage-02: fastqc + stage-03: hpg_aligner and samtools + stage-04: samtools + stage-05: samtools + stage-06: fastqc + stage-07: picard + stage-08: fastqc + stage-09: picard + stage-10: gatk + stage-11: gatk + stage-12: gatk + stage-13: gatk + stage-14: gatk + stage-15: gatk + stage-16: samtools + stage-17: samtools + stage-18: fastqc + stage-19: gatk + stage-20: gatk + stage-21: gatk + stage-22: gatk + stage-23: gatk + stage-24: hpg-variant + stage-25: hpg-variant + stage-26: snpEff + stage-27: snpEff + stage-28: hpg-variant ## Interpretation @@ -338,25 +338,25 @@ The output folder contains all the subfolders with the intermediate data. This f ![TEAM upload panel. Once the file has been uploaded, a panel must be chosen from the Panel list. Then, pressing the Run button the diagnostic process starts.]\((../../../img/fig7.png) -** Figure 7. ** _TEAM upload panel._ _Once the file has been uploaded, a panel must be chosen from the Panel_ list. Then, pressing the Run button the diagnostic process starts. + Figure 7. _TEAM upload panel._ _Once the file has been uploaded, a panel must be chosen from the Panel_ list. Then, pressing the Run button the diagnostic process starts. Once the file has been uploaded, a panel must be chosen from the Panel list. Then, pressing the Run button the diagnostic process starts. TEAM searches first for known diagnostic mutation(s) taken from four databases: HGMD-public (20), [HUMSAVAR](http://www.uniprot.org/docs/humsavar), ClinVar (29) and COSMIC (23).  -** Figure 7. ** The panel manager. The elements used to define a panel are (** A **) disease terms, (** B **) diagnostic mutations and (** C **) genes. Arrows represent actions that can be taken in the panel manager. Panels can be defined by using the known mutations and genes of a particular disease. This can be done by dragging them to the ** Primary Diagnostic ** box (action ** D **). This action, in addition to defining the diseases in the ** Primary Diagnostic ** box, automatically adds the corresponding genes to the ** Genes ** box. The panels can be customized by adding new genes (action ** F **) or removing undesired genes (action **G**). New disease mutations can be added independently or associated to an already existing disease term (action ** E **). Disease terms can be removed by simply dragging them back (action ** H **). + Figure 7. The panel manager. The elements used to define a panel are ( A ) disease terms, ( B ) diagnostic mutations and ( C ) genes. Arrows represent actions that can be taken in the panel manager. Panels can be defined by using the known mutations and genes of a particular disease. This can be done by dragging them to the Primary Diagnostic box (action D ). This action, in addition to defining the diseases in the Primary Diagnostic box, automatically adds the corresponding genes to the Genes box. The panels can be customized by adding new genes (action F ) or removing undesired genes (action G). New disease mutations can be added independently or associated to an already existing disease term (action E ). Disease terms can be removed by simply dragging them back (action H ). For variant discovering/filtering we should upload the VCF file into BierApp by using the following form: -\*\* +\\ -** Figure 8 **. \*BierApp VCF upload panel. It is recommended to choose a name for the job as well as a description \*\*. + Figure 8 . \BierApp VCF upload panel. It is recommended to choose a name for the job as well as a description \\. Each prioritization (â€job’) has three associated screens that facilitate the filtering steps. The first one, the â€Summary’ tab, displays a statistic of the data set analyzed, containing the samples analyzed, the number and types of variants found and its distribution according to consequence types. The second screen, in the â€Variants and effect’ tab, is the actual filtering tool, and the third one, the â€Genome view’ tab, offers a representation of the selected variants within the genomic context provided by an embedded version of the Genome Maps Tool (30).  -** Figure 9 **. This picture shows all the information associated to the variants. If a variant has an associated phenotype we could see it in the last column. In this case, the variant 7:132481242 CT is associated to the phenotype: large intestine tumor. + Figure 9 . This picture shows all the information associated to the variants. If a variant has an associated phenotype we could see it in the last column. In this case, the variant 7:132481242 CT is associated to the phenotype: large intestine tumor. ## References