Get Started

Index

Introduction

To evaluate the functional potential of sequence variants in candidate regions, evaluating if they are present in coding or non-coding transcripts or in regulatory regions such as non-coding conserved elements is important. This tool is meant to assist you in evaluating the overlap of conserved elements with the vast amount of data that is generated in large re-sequencing projects. We have put emphasis on locating Single Nucleotide Polymorphisms (SNPs) and small indels in regions conserved across species, since it is known that those are more likely to have a function and to play a role in phenotypic variation. It is important to be aware that mutations that are not conserved may still have a unique function in the species of interest. It is also important to consider which species are represented in the alignment depending on the special trait under investigating. The user of this tool can choose among some different species alignments and scoring methods. Also larger insertions, copy number variation and deletions could be of great importance, and our approach to locate them is described below.

Scoring

The purpose of this module is to score the identified variants by conservation. We recommend that you always start with the scoring module. The scored files generated are then used for further modules such as displaying the result in the UCSC browser, and for trying case-control analysis. This tool can read SNP and indels (insertions/deletions) files generated by variant calling tools e.g. MAQ, GATK and SAMtools. When submitting your file, your will get every variant scored by conservation, depending on the alignment chosen. If the mutation variant is not located within the conserved element you will get information about the distance in base-pairs to the closest oneconserved element.

If you have multiple files to score, you can upload all of them compressed in a zip-archive. All files must be placed in the archive root. The results will be returned as a zip-archive containing your files.

To assure that no one else can access your data, all data submitted will be hash coded and then automatically erased from our database within one hour. Make sure to save all results back to your own computer.

Please check that you mapped your sequence data to a reference that correspond to one of the assemblies shown under the heading Resources, or the scoring will give you faulty results.

Merge & Show

The purpose of this module is to merge all variations into one file and display the results in the UCSC browser, for easier comparison and interpretation. When using thise merge & show option module scored files should be used as input data. When using scored SNP-files you will get two out-put files; one with all the files merged that is easy to use for further analyses, using for instance excel, and another file for display using the UCSC genome browser. If a SNP is detected in any of the individuals, individuals that don't have a SNP at that position are supposed to have the same genotype as the reference at that position. At the results page you will also find a link that takes you directly to the genome browser and displays your data. This option is available and can be useful even if you have just one sequenced individual that you would like to compare to the reference. The SNPs will be color coded in the browser. Homozygous SNPs in conserved elements ±5bp will be colored red. Heterozygous SNPs in conserved elements ±5bp will be colored pink. Homozygous SNPs equal to the reference will be colored yellow, homozygous SNPs deviating from the reference will be colored blue and heterozygous SNPs will be green. Homozygous SNPs equal to the reference will be colored yellow, if another indivdual shows a different genotype at that position. Indels overlapping with a conserved element ±5bp will be colored red, insertions elsewhere grey and deletions elsewhere black. You can upload several files compressed in a zip-archive. All files must be placed in the archive root. In this case samples will automatically get named by filenames, otherwise you need to name all your samples with unique names.

Case & control

The purpose of this module is to identify candidate mutations that are segregating/ associated with the phenotype, which is done by comparing cases and controls. The sample status can be assigned based on phenotype or risk genotype (typically an associated haplotype identified as part of a previous association study). Scored SNP-files are supposed to be up-loaded. The individuals should be given unique names for all options to work correctly. For each individual sample the case or control button should be selected. It is also possible to upload several files compressed in a zip-archive. All files must be placed in the archive root. In this case samples will automatically get named by filenames. Cases and controls must be in separate zip-archives.

You can chose to investigate the pattern for conserved SNPs only, by clicking the compare conserved SNPs button. The pattern will be scored according to your choice of most likely inheritance; recessive or dominant/complex. It is recommended that you save the output to your computer for later analyses. The SNPs with the highest score are the ones that show the biggest difference between cases and controls and are among the most likely to be the causative mutation of the trait in question. For a recessive trait extra scores are given to homozygous cases, and for dominant/complex we consider it more important that the controls are homozygous for the wildtype. The program simply examines all pair-wise combinations of individuals, adding +1 if two cases are homozygous and agree for a recessive trait and +1 if two controls are homozygous and agree for a dominant/complex trait and +1 when a case and a control disagree in genotype. For each SNP the values are summed for all pair-wise comparisons into a total score for that SNP.

Another option is to look for regions where "cases" differs from "controls". The desired size of the region to be analyzed should be set and then the compare genomic region button clicked. Now the highest score go to regions that are as homozygous as possible in "cases" and differ as much as possible to the controls. If only "cases" are submitted, the most homozygous region is identified. All SNPs, conserved and not conserved are taken into consideration. If the trait or disease mutation examined is identical by descent in the individuals under investigation a shared haplotype should be present. It should be noted though that compare genomic region would most likely find homozygous regions containing two risk alleles (homozygosity), and is therefore most applicable to recessive traits and traits under selection.

A third option (run PLINK ) exists where the IUPAC codes are translated into alleles and files appropriate for doing traditional association studies using the program PLINK. Most re-sequencing experiments have too few sample to allow power to achieve any significant results. However, it could be worth trying if a relatively large sample size is used. Fisher's exact allelic test will be calculated, and the p-values -log10 transformed can be displayed in the UCSC browser. Plink format .ped and .map can be downloaded for running PLINK outside SEQscoring. For instructions about how to use PLINK see: http://pngu.mgh.harvard.edu/~purcell/plink/ We recommend that located variants of interest are then assayed by genotyping in a larger cohort.

Relative coverage assessment

The best way to find large chromosomal aberrations is to use data from paired end reads. When such data is not available this module can be used to assess coverage differences between samples. The information about coverage is useful for identifying the putative locations of larger insertions, copy number variation or deletions that differs between cases and controls. SEQscoring uses pileup files created by MAQ, SAMtool or Mosaik to assess coverage. Pileup files are often very large and for that reason we have limited the number of positions checked to 150,000 bases due to performance. This means that if you submit a file that covers 1.5 Mbp, every 10th position will be utilized. You can also set the "check interval" yourself, but it will be overruled if the maximum of 150,000 positions is exceeded. The program also accepts .zip files to speed the file transfer.

To reduce the random variation there is an option to average the coverage in a window of a specific size, i. e. the window size option. To give you an example; if the check interval option is set to ten and the window size option is also set to ten, then the actual space examined is 100 bp. The average coverage is calculated for non-overlapping windows. To adjust for big difference in overall coverage between individuals, the option adjusted coverage will scale all values so that all individuas get equal mean coverage. If you choose adjusted coverage those adjusted values will be displayed for each individual on the UCSC browser. The adjusted coverage is then also used to calculate the ratio between cases and controls. The log2 value of the ratio is calculated, which gives a value of 0 for equal coverage, positive values when cases have an excess of coverage and negative values, when the cases have less coverage then controls. You can choose if you would like to display the log2 values on the browser, or the z values, which are the normalized standard deviations of the log2 values.

Note! To be able to detect differences also when cases or controls have no coverage, the program changes all windows with a coverage of zero to a value of one! Note also that larger insertions are the hardest to detect, unless you have paired-end-reads. This is because the novel sequence is not available in the reference sequence, therefore making the mapping of reads complicated.

In addition to a .wig file for display on the browser, a text file will be created. First in the file you get information for each individual how many of the positions that have coverage of at least 5X and or at least 10X. The following columns contain information on: Window position, the ratio before log2 transformation, log2, z -value, and one column for each individual with the real coverage before any transformation.

File formats

You can submit final SNP and indel files with various file formats to get scored by conservation. We recommend that the files are first filtered to only contain SNPs and indels of good quality. For data that was mapped to a target region, it is necessary that the first column contain at least four fields, that holds the following information: "name:chromosome:startposition:endposition" delimited by ":" or "," or "-" or "_".

All options under the heading Merge & Show and Case & Control requires scored SNP or indel files as input data.

The Relative coverage assessment module, takes pileup files from Mosaik, SAMtool or MAQ as input data. In case the data was mapped only to a target region, information about the target region is needed in the file. Mosaik pileup files must contain this information in the first row of the file e.g. while MAQ and samtools has this information on each row in the first column.

      # MOSAIK coverage file for canFam2_chr8:68689318-68786516.
  
Please compare coverage for individuals for only one chromosome at the time or this option will not work - i.e. only data for one chromosome/file.

For MAQ files the coverage is supposed to be found in the 4th column and for SAMtool in the 8th column. SAMtool pileup files have to be sorted.