BDGP Logo BDGP - Berkeley Drosophila Genome Group
Searches

Splice: Help

Read Abstract Do Search

Instructions for Human and Drosophila melanogaster Splice Site Prediction using Neural Networks

To use the splice site predictor, paste your DNA sequence into the box. Your sequence should consist of one-letter nucleotides (A, C, G, T). Characters that do not uniquely determine a base (e.g. R or N) are treated as unknown bases. The sequence should be in plain or FASTA format. FASTA format looks like this:

>test1 Human 5' and 3' splice site Test Sequence
aataatagctgtttctctgttgtttaaaggcactacaaatactgtggcag
catataatttcccaggtggccggcgcttcaggtgagtggcaccagcccct
ggaagcccgg

Select whether you want to use the neural network version for Human or for Drosophila melanogaster sequences. You can choose whether to show predictions for the reverse strand as well as the forward strand.

The output of the neural networks is a list of the 15-base (41-base) regions that the network judges most likely to be 5' and 3' splice sites, respectively.

Scores

You may also set the score cutoff. It should be between 0 and 1. The default is 0.4, but please keep in mind that the score cutoff means different things for 5' and 3' splice sites. For example, at the 0.4 cutoff:
				 % splice sites	       %false
				   recognized	      positives
Human 5' Splice Sites			93.2%		5.2%
Human 3' Splice Sites			83.8%		3.1%
Drosophila 5' Splice Sites		91.4%		3.0%
Drosophila 3' Splice Sites		90.5%		6.5%

Tables below, under "Estimated accuracy of prediction", show the percent splice sites recognized and the percent false positives for different cutoff scores for 5' versus 3' splice site prediction. You may want to set the cutoff for prediction yourself after looking at the tables.

About the neural network method Splice sites are the key signal sequences that determine the boundaries of exons. A method for splice site detection should ideally be based on a thorough understanding of the complex eukaryotic splicing process. We trained a backpropagation feedforward neural network with one layer of hidden units to recognize 5' and 3' splice sites, using a representative data set (Drosophila melanogaster data set). We only consider genes that have constraint consensus splice sites, i.e., `GT' for the 5' and `AG' for the 3' splice site. The output of the network is a score between 0 and 1 for a potential splice site.

The neural network method is described in detail in
References and Abstract

Estimated accuracy of prediction

Human A carefully randomly chosen independent test set of 43 human genes (/sequence/human-datasets.html) with no related sequences to the training set gave the following results: Human 5' Splice Site prediction:

  +------------+-----------+----------------+------------+
  | threshold  |    %      |     %          | correlation|
  |            | sites     | false positive | coefficient|
  |            | recognized| sites          |    (CC)    |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.99    |   26.0%   |      0.1%      |    0.46    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.95    |   50.4%   |      0.7%      |    0.65    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.90    |   64.1%   |      1.1%      |    0.73    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.85    |   72.7%   |      1.4%      |    0.78    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.80    |   74.4%   |      1.9%      |    0.78    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.75    |   77.8%   |      1.9%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.70    |   81.6%   |      2.7%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.65    |   85.0%   |      3.2%      |    0.83    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.60    |   88.0%   |      3.5%      |    0.84    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.55    |   89.3%   |      3.7%      |    0.84    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.50    |   91.5%   |      4.2%      |    0.85    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.45    |   93.2%   |      4.7%      |    0.85    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.40    |   93.2%   |      5.2%      |    0.84    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.35    |   93.6%   |      5.3%      |    0.84    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.30    |   94.9%   |      5.8%      |    0.84    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.25    |   95.3%   |      6.2%      |    0.84    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.20    |   96.2%   |      6.7%      |    0.83    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.15    |   96.6%   |      8.2%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.10    |   97.9%   |      9.1%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.05    |   98.3%   |     11.1%      |    0.78    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
These percentages are defined by:
                                predicted sites
sites recognized =        -------------------------
                              all observed sites


                                predicted sites
false positive sites =       -------------------------
                             all observed non-sites


                                          (TPxTN)-(FNxFP)
correlation coefficient (CC) =  ------------------------------------
                                  ________________________________
                                 V (TP+FN)x(TN+FP)x(TP+FP)x(TN+FN)

TP = true positive = sites recognized
TN = true negative = non-sites recognized
FP = false positive = observed non-sites predicted as sites
FN = false negatives = observed sites predicted as non-sites Human 3' Splicer Site prediction:
  +------------+-----------+----------------+------------+
  | threshold  |    %      |     %          | correlation|
  |            | sites     | false positive | coefficient|
  |            | recognized| sites          |    (CC)    |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.99    |    7.3%   |      0.0%      |    0.25    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.95    |   33.3%   |      0.4%      |    0.52    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.90    |   47.9%   |      0.5%      |    0.64    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.85    |   57.7%   |      0.6%      |    0.70    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.80    |   61.2%   |      0.9%      |    0.72    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.75    |   65.4%   |      1.1%      |    0.74    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.70    |   69.7%   |      1.3%      |    0.77    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.65    |   73.5%   |      1.5%      |    0.79    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.60    |   76.5%   |      1.8%      |    0.80    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.55    |   79.1%   |      2.0%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.50    |   80.8%   |      2.4%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.45    |   82.5%   |      2.9%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.40    |   83.8%   |      3.1%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.35    |   86.8%   |      3.7%      |    0.82    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.30    |   88.5%   |      4.0%      |    0.82    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.25    |   88.5%   |      4.5%      |    0.81    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.20    |   90.2%   |      4.8%      |    0.82    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.15    |   91.0%   |      6.0%      |    0.80    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.10    |   92.3%   |      7.9%      |    0.77    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.05    |   94.9%   |     10.4%      |    0.74    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+

Neural Network based "consensi" sequences: Extensive analysis of the perceptron neural network weight matrices have revealed the following "refined" 5' and 3' splice site consensus and non-consensus sequences:

5' Splice Site: 

              -7  6  5  4  3   2  -1     +1  2  3   4  5  6  7 +8
consensus:     a  a  a  A C|a  A   G   /  G  T  A   A  G  T  -  c      

non-consensus: g  g  g  G G|T G|T A|T     -  - C|t g|t -  -  t  -     


3' Splice Site: 

               -21 -20 19 18  17 16  15  14  13  12  11  10   9   8   7   6   5  4   3   2 -1
consensus:       -   T  T T|c  T T|C T|C T|c T|c T|c T|c T|c T|c T|C T|c T|C T|c A  T|C  A  G  
non-consensus:                                                                       G        

               +1  2  3  4  5  6  7  8  9  10  11  12 13 14 15 16  17 18 19 +20
consensus:      G  T  c  -  -  -  g  g  -   g  g|a  c  g  a  a a|c  a  g  -   -
non-consensus: c|t       t    g|t

Capital letters indicate strong weights and lower case letters weaker weights.
"|" means "or"
"-" no significant weight "non-consensus" indicates bases that are very unlikely to appear at this position.

Drosophila melanogaster A carefully randomly chosen independent test set of 41 genes (Drosophila melanogaster gene set) with no related sequences to the training set gave the following results: Drosophila melanogaster 5' Splice Site prediction:

  +------------+-----------+----------------+------------+
  | threshold  |    %      |     %          | correlation|
  |            | sites     | false positive | coefficient|
  |            | recognized| sites          |    (CC)    |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.99    |    0.0%   |      0.0%      |     -      |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.95    |   22.9%   |      0.0%      |    0.44    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.90    |   53.3%   |      0.0%      |    0.69    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.85    |   61.9%   |      0.0%      |    0.75    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.80    |   66.7%   |      0.0%      |    0.78    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.75    |   69.5%   |      0.8%      |    0.78    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.70    |   77.1%   |      0.8%      |    0.83    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.65    |   78.1%   |      1.0%      |    0.83    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.60    |   81.9%   |      1.0%      |    0.86    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.55    |   82.9%   |      1.0%      |    0.86    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.50    |   88.6%   |      1.8%      |    0.88    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.45    |   90.5%   |      2.5%      |    0.88    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.40    |   91.4%   |      3.0%      |    0.88    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.35    |   91.4%   |      4.0%      |    0.85    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.30    |   94.3%   |      4.8%      |    0.86    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.25    |   96.2%   |      5.3%      |    0.86    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.20    |   97.1%   |      5.8%      |    0.86    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.15    |   97.1%   |      8.0%      |    0.82    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.10    |   99.1%   |     10.3%      |    0.80    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+
  |            |           |                |            |
  |    0.05    |   99.1%   |     15.1%      |    0.73    |
  |            |           |                |            |
  +------------+-----------+----------------+------------+

Drosophila melanogaster 3' Splice Site prediction:

+------------+-----------+----------------+------------+ | threshold | % | % | correlation| | | sites | false positive | coefficient| | | recognized| sites | (CC) | +------------+-----------+----------------+------------+ | | | | | | 0.99 | 1.9% | 0.0% | 0.12 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.95 | 11.4% | 0.0% | 0.30 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.90 | 28.6% | 0.6% | 0.46 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.85 | 44.8% | 0.6% | 0.60 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.80 | 53.3% | 1.1% | 0.65 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.75 | 60.1% | 2.0% | 0.69 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.70 | 69.5% | 2.3% | 0.74 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.65 | 73.3% | 2.5% | 0.76 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.60 | 76.2% | 3.1% | 0.77 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.55 | 79.0% | 4.2% | 0.77 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.50 | 83.8% | 5.4% | 0.78 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.45 | 87.6% | 5.9% | 0.80 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.40 | 90.5% | 6.5% | 0.81 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.35 | 92.4% | 7.0% | 0.81 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.30 | 94.3% | 9.0% | 0.79 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.25 | 94.3% | 10.7% | 0.77 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.20 | 96.2% | 13.0% | 0.75 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.15 | 96.2% | 14.7% | 0.73 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.10 | 96.2% | 17.5% | 0.69 | | | | | | +------------+-----------+----------------+------------+ | | | | | | 0.05 | 97.1% | 30.7% | 0.56 | | | | | | +------------+-----------+----------------+------------+


Splice site prediction code by Martin G. Reese; web interface by Nomi Harris.