From 09133402d12907c78229cf95293470a6cbf7b2aa Mon Sep 17 00:00:00 2001 From: Jan Siwiec <jan.siwiec@vsb.cz> Date: Thu, 12 Mar 2020 13:24:39 +0100 Subject: [PATCH] Update overview.md --- .../software/bio/omics-master/overview.md | 40 +++++++++---------- 1 file changed, 20 insertions(+), 20 deletions(-) diff --git a/docs.it4i/software/bio/omics-master/overview.md b/docs.it4i/software/bio/omics-master/overview.md index 4a3ef9901..f673c899b 100644 --- a/docs.it4i/software/bio/omics-master/overview.md +++ b/docs.it4i/software/bio/omics-master/overview.md @@ -1,12 +1,12 @@ # Overview -The human NGS data processing solution +A human NGS data processing solution. ## Introduction -The scope of this OMICS MASTER solution is restricted to human genomics research (disease causing gene discovery in whole human genome or exome) or diagnosis (panel sequencing), although it could be extended in the future to other usages. +The scope of this OMICS MASTER solution is restricted to human genomics research (disease causing gene discovery in the whole human genome or exome) or diagnosis (panel sequencing), although it could be extended to other usages in the future. -The pipeline inputs the raw data produced by the sequencing machines and undergoes a processing procedure that consists on a quality control, the mapping and variant calling steps that result in a file containing the set of variants in the sample. From this point, the prioritization component or the diagnostic component can be launched. +The pipeline inputs the raw data produced by the sequencing machines and undergoes a processing procedure that consists of a quality control, the mapping and variant calling steps that result in a file containing the set of variants in the sample. From this point, the prioritization component or the diagnostic component can be launched.  @@ -17,7 +17,7 @@ Typical genomics pipelines are composed by several components that need to be la OMICS MASTER pipeline inputs a FASTQ file and outputs an enriched VCF file. This pipeline is able to queue all the jobs to PBS by only launching a process taking all the necessary input files and creates the intermediate and final folders -Let’s see each of the OMICS MASTER solution components: +Let us see each of the OMICS MASTER solution components: ## Components @@ -25,21 +25,21 @@ Let’s see each of the OMICS MASTER solution components: This component is composed by a set of programs that carry out quality controls, alignment, realignment, variant calling and variant annotation. It turns raw data from the sequencing machine into files containing lists of variants (VCF) that once annotated, can be used by the following components (discovery and diagnosis). -We distinguish three types of sequencing instruments: bench sequencers (MySeq, IonTorrent, and Roche Junior, although this last one is about being discontinued), which produce relatively Genomes in the clinic +We distinguish three types of sequencing instruments: bench sequencers (MySeq, IonTorrent, and Roche Junior, although the last one is being discontinued), which produce relatively Genomes in the clinic. -low throughput (tens of million reads), and high end sequencers, which produce high throughput (hundreds of million reads) among which we have Illumina HiSeq 2000 (and new models) and SOLiD. All of them but SOLiD produce data in sequence format. SOLiD produces data in a special format called colour space that require of specific software for the mapping process. Once the mapping has been done, the rest of the pipeline is identical. Anyway, SOLiD is a technology which is also about being discontinued by the manufacturer so, this type of data will be scarce in the future. +Low throughput (tens of million reads), and high-end sequencers, which produce high throughput (hundreds of million reads) among which we have Illumina HiSeq 2000 (and new models) and SOLiD. All of them but SOLiD produce data in sequence format. SOLiD produces data in a special format called color space that require specific software for the mapping process. Once the mapping has been done, the rest of the pipeline is identical. SOLiD is a technology which is also being discontinued by the manufacturer, so this type of data will be scarce in the future. #### Quality Control, Preprocessing and Statistics for FASTQ FastQC& FastQC. -These steps are carried out over the original FASTQ file with optimized scripts and includes the following steps: sequence cleansing, estimation of base quality scores, elimination of duplicates and statistics. +These steps are carried out over the original FASTQ file with optimized scripts and includes the following steps: sequence cleansing, estimation of base quality scores, elimination of duplicates, and statistics. Input: FASTQ file. Output: FASTQ file plus an HTML file containing statistics on the data. -FASTQ format It represents the nucleotide sequence and its corresponding quality scores. +FASTQ format represents the nucleotide sequence and its corresponding quality scores.  Figure 2.FASTQ file. @@ -58,7 +58,7 @@ Output: Aligned file in BAM format. It is a human readable tab-delimited format in which each read and its alignment is represented on a single line. The format can represent unmapped reads, reads that are mapped to unique locations, and reads that are mapped to multiple locations. -The SAM format (1) consists of one header section and one alignment section. The lines in the header section start with character ‘@’, and lines in the alignment section do not. All lines are TAB delimited. +The SAM format (1) consists of one header section and one alignment section. The lines in the header section start with the ‘@’ character, and lines in the alignment section do not. All lines are TAB delimited. In SAM, each alignment line has 11 mandatory fields and a variable number of optional fields. The mandatory fields are briefly described in Table 1. They must be present but their value can be a ‘\’ or a zero (depending on the field) if the corresponding information is unavailable. @@ -77,13 +77,13 @@ corresponding information is unavailable. | 10 | SEQ | Query SEQuence on the same strand as the reference | | 11 | QUAL | Query QUALity (ASCII-33=Phred base quality) | - Table 1 . Mandatory fields in the SAM format. + Table 1. Mandatory fields in the SAM format. The standard CIGAR description of pairwise alignment defines three operations: ‘M’ for match/mismatch, ‘I’ for insertion compared with the reference and ‘D’ for deletion. The extended CIGAR proposed in SAM added four more operations: ‘N’ for skipped bases on the reference, ‘S’ for soft clipping, ‘H’ for hard clipping and ‘P’ for padding. These support splicing, clipping, multi-part and padded alignments. Figure 3 shows examples of CIGAR strings for different types of alignments.  - Figure 3 . SAM format file. The ‘@SQ’ line in the header section gives the order of reference sequences. Notably, r001 is the name of a read pair. According to FLAG 163 (=1+2+32+128), the read mapped to position 7 is the second read in the pair (128) and regarded as properly paired (1 + 2); its mate is mapped to 37 on the reverse strand (32). Read r002 has three soft-clipped (unaligned) bases. The coordinate shown in SAM is the position of the first aligned base. The CIGAR string for this alignment contains a P (padding) operation which correctly aligns the inserted sequences. Padding operations can be absent when an aligner does not support multiple sequence alignment. The last six bases of read r003 map to position 9, and the first five to position 29 on the reverse strand. The hard clipping operation H indicates that the clipped sequence is not present in the sequence field. The NM tag gives the number of mismatches. Read r004 is aligned across an intron, indicated by the N operation. + Figure 3. SAM format file. The ‘@SQ’ line in the header section gives the order of reference sequences. Notably, r001 is the name of a read pair. According to FLAG 163 (=1+2+32+128), the read mapped to position 7 is the second read in the pair (128) and regarded as properly paired (1 + 2); its mate is mapped to 37 on the reverse strand (32). Read r002 has three soft-clipped (unaligned) bases. The coordinate shown in SAM is the position of the first aligned base. The CIGAR string for this alignment contains a P (padding) operation, which correctly aligns the inserted sequences. Padding operations can be absent when an aligner does not support multiple sequence alignment. The last six bases of read r003 map to position 9, and the first five to position 29 on the reverse strand. The hard clipping operation H indicates that the clipped sequence is not present in the sequence field. The NM tag gives the number of mismatches. Read r004 is aligned across an intron, indicated by the N operation. ##### Binary Alignment/Map (BAM) @@ -101,7 +101,7 @@ Some features strand bias paired-end insert Filtering: by number of errors, number of hits - Comparator: stats, intersection, ... + Comparator: stats, intersection, etc. Input: BAM file. @@ -175,7 +175,7 @@ More detail [here][2]. ## Usage -First of all, we should load ngsPipeline module: +First, we should load ngsPipeline module: ```console $ ml ngsPipeline @@ -258,7 +258,7 @@ We have a folder with the following structure in └── sample2_2.fq ``` -The ped file ( file.ped) contains the following info: +The ped file (file.ped) contains the following info: ```console #family_ID sample_ID parental_ID maternal_ID sex phenotype @@ -266,7 +266,7 @@ The ped file ( file.ped) contains the following info: FAM sample_B 0 0 2 2 ``` -Now, lets load the NGSPipeline module and copy the sample data to a [scratch directory][4]: +Now, let us load the NGSPipeline module and copy the sample data to a [scratch directory][4]: ```console $ ml ngsPipeline @@ -282,7 +282,7 @@ $ ngsPipeline -i /scratch/$USER/omics/sample_data/data -o /scratch/$USER/omics/r This command submits the processing [jobs to the queue][5]. -If we want to re-launch the pipeline from stage 4 until stage 20 we should use the next command: +If we want to re-launch the pipeline from stage 4 until stage 20, we should use the next command: ```console $ ngsPipeline -i /scratch/$USER/omics/sample_data/data -o /scratch/$USER/omics/results -p /scratch/$USER/omics/sample_data/data/file.ped -s 4 -e 20 --project OPEN-0-0 --queue qprod @@ -343,21 +343,21 @@ The output folder contains all the subfolders with the intermediate data. This f Once the file has been uploaded, a panel must be chosen from the Panel list. Then, pressing the Run button the diagnostic process starts. TEAM searches first for known diagnostic mutation(s) taken from four databases: HGMD-public (20), [HUMSAVAR][i], ClinVar (29) and COSMIC (23). - + - Figure 7. The panel manager. The elements used to define a panel are ( A ) disease terms, ( B ) diagnostic mutations and ( C ) genes. Arrows represent actions that can be taken in the panel manager. Panels can be defined by using the known mutations and genes of a particular disease. This can be done by dragging them to the Primary Diagnostic box (action D ). This action, in addition to defining the diseases in the Primary Diagnostic box, automatically adds the corresponding genes to the Genes box. The panels can be customized by adding new genes (action F ) or removing undesired genes (action G). New disease mutations can be added independently or associated to an already existing disease term (action E ). Disease terms can be removed by simply dragging them back (action H ). + Figure 7. The panel manager. The elements used to define a panel are ( A ) disease terms, ( B ) diagnostic mutations and ( C ) genes. Arrows represent actions that can be taken in the panel manager. Panels can be defined by using the known mutations and genes of a particular disease. This can be done by dragging them to the Primary Diagnostic box (action D). This action, in addition to defining the diseases in the Primary Diagnostic box, automatically adds the corresponding genes to the Genes box. The panels can be customized by adding new genes (action F) or removing undesired genes (action G). New disease mutations can be added independently or associated to an already existing disease term (action E). Disease terms can be removed by simply dragging them back (action H). For variant discovering/filtering we should upload the VCF file into BierApp by using the following form: \ - Figure 8 . \BierApp VCF upload panel. It is recommended to choose a name for the job as well as a description \\. + Figure 8. \BierApp VCF upload panel. It is recommended to choose a name for the job as well as a description \\. Each prioritization (‘job’) has three associated screens that facilitate the filtering steps. The first one, the ‘Summary’ tab, displays a statistic of the data set analyzed, containing the samples analyzed, the number and types of variants found and its distribution according to consequence types. The second screen, in the ‘Variants and effect’ tab, is the actual filtering tool, and the third one, the ‘Genome view’ tab, offers a representation of the selected variants within the genomic context provided by an embedded version of the Genome Maps Tool (30).  - Figure 9 . This picture shows all the information associated to the variants. If a variant has an associated phenotype we could see it in the last column. In this case, the variant 7:132481242 CT is associated to the phenotype: large intestine tumor. + Figure 9. This picture shows all the information associated to the variants. If a variant has an associated phenotype we could see it in the last column. In this case, the variant 7:132481242 CT is associated to the phenotype: large intestine tumor. ## References -- GitLab