diff --git a/docs.it4i/anselm-cluster-documentation/software/omics-master/overview.md b/docs.it4i/anselm-cluster-documentation/software/omics-master/overview.md index ac67ad86bdfb87050c2116dbacbf1f0c51a79386..bc366f79e2aba3f752befae5627815a2d15325a6 100644 --- a/docs.it4i/anselm-cluster-documentation/software/omics-master/overview.md +++ b/docs.it4i/anselm-cluster-documentation/software/omics-master/overview.md @@ -85,7 +85,7 @@ The standard CIGAR description of pairwise alignment defines three operations: Figure 3 . SAM format file. The â€@SQ’ line in the header section gives the order of reference sequences. Notably, r001 is the name of a read pair. According to FLAG 163 (=1+2+32+128), the read mapped to position 7 is the second read in the pair (128) and regarded as properly paired (1 + 2); its mate is mapped to 37 on the reverse strand (32). Read r002 has three soft-clipped (unaligned) bases. The coordinate shown in SAM is the position of the first aligned base. The CIGAR string for this alignment contains a P (padding) operation which correctly aligns the inserted sequences. Padding operations can be absent when an aligner does not support multiple sequence alignment. The last six bases of read r003 map to position 9, and the first five to position 29 on the reverse strand. The hard clipping operation H indicates that the clipped sequence is not present in the sequence field. The NM tag gives the number of mismatches. Read r004 is aligned across an intron, indicated by the N operation. -##### Binary Alignment/Map (BAM) +##### Binary Alignment/Map (BAM) BAM is the binary representation of SAM and keeps exactly the same information as SAM. BAM uses lossless compression to reduce the size of the data by about 75% and provides an indexing system that allows reads that overlap a region of the genome to be retrieved and rapidly traversed. @@ -131,7 +131,7 @@ two bases by another base (SAMPLE2); the second line shows a SNP and an insertio ### Annotating - Component: HPG-Variant +Component: HPG-Variant The functional consequences of every variant found are then annotated using the HPG-Variant software, which extracts from CellBase, the Knowledge database, all the information relevant on the predicted pathologic effect of the variants. @@ -145,24 +145,24 @@ VARIANT (VARIant Analysis Tool) (4) reports information on the variants found th CellBase(5) is a relational database integrates biological information from different sources and includes: -Core features: +Core features We took genome sequences, genes, transcripts, exons, cytobands or cross references (xrefs) identifiers (IDs) from Ensembl (6). Protein information including sequences, xrefs or protein features (natural variants, mutagenesis sites, post-translational modifications, etc.) were imported from UniProt (7). -Regulatory: +Regulatory CellBase imports miRNA from miRBase (8); curated and non-curated miRNA targets from miRecords (9), miRTarBase (10), TargetScan(11) and microRNA.org (12) and CpG islands and conserved regions from the UCSC database (13). -Functional annotation +Functional annotation OBO Foundry (14) develops many biomedical ontologies that are implemented in OBO format. We designed a SQL schema to store these OBO ontologies and 30 ontologies were imported. OBO ontology term annotations were taken from Ensembl (6). InterPro (15) annotations were also imported. -Variation +Variation CellBase includes SNPs from dbSNP (16)^; SNP population frequencies from HapMap (17), 1000 genomes project (18) and Ensembl (6); phenotypically annotated SNPs were imported from NHRI GWAS Catalog (19),HGMD (20), Open Access GWAS Database (21), UniProt (7) and OMIM (22); mutations from COSMIC (23) and structural variations from Ensembl (6). -Systems biology +Systems biology We also import systems biology information like interactome information from IntAct (24). Reactome (25) stores pathway and interaction information in BioPAX (26) format. BioPAX data exchange format enables the integration of diverse pathway resources. We successfully solved the problem of storing data released in BioPAX format into a SQL relational schema, which allowed us importing Reactome in CellBase.