Splice: Help
Instructions for Human and Drosophila melanogaster Splice Site Prediction using Neural Networks
To use the splice site predictor, paste your DNA sequence into the box.
Your sequence should consist of one-letter nucleotides (A, C, G, T).
Characters that do not uniquely determine a base (e.g. R or N) are
treated as unknown bases.
The sequence should be in
plain or FASTA format. FASTA format looks like this:
>test1 Human 5' and 3' splice site Test Sequence
aataatagctgtttctctgttgtttaaaggcactacaaatactgtggcag
catataatttcccaggtggccggcgcttcaggtgagtggcaccagcccct
ggaagcccgg
Select whether you want to use
the neural network version for Human or for Drosophila melanogaster
sequences.
You can choose
whether to show predictions for the reverse strand as well as the
forward strand.
The output of the neural networks is a list of the 15-base (41-base)
regions that the network
judges most likely to be 5' and 3' splice sites, respectively.
Scores
You may also set the score cutoff. It should be between 0 and 1.
The default is 0.4, but please
keep in mind that the score cutoff means different things for 5' and 3' splice
sites. For example, at the 0.4 cutoff:
% splice sites %false
recognized positives
Human 5' Splice Sites 93.2% 5.2%
Human 3' Splice Sites 83.8% 3.1%
Drosophila 5' Splice Sites 91.4% 3.0%
Drosophila 3' Splice Sites 90.5% 6.5%
|
Tables below, under "Estimated accuracy of prediction", show the percent
splice sites recognized and the percent false positives for different
cutoff scores for 5' versus 3' splice site prediction. You may want to
set the cutoff for prediction yourself after looking at the tables.
About the neural network method
Splice sites are the key signal sequences that determine the boundaries
of exons. A method for splice site detection should ideally be based on a thorough
understanding of the complex eukaryotic splicing process.
We trained a backpropagation feedforward neural
network with one layer of hidden units to recognize 5' and 3' splice
sites, using a representative data set (Drosophila melanogaster data set).
We only consider genes that have
constraint consensus splice sites, i.e., `GT' for the 5' and `AG'
for the 3' splice site.
The output of the network is a score between 0 and 1 for a potential splice site.
The neural network method is described in detail in
References and Abstract
Estimated accuracy of prediction
Human
A carefully randomly chosen independent test set of 43 human genes
(/sequence/human-datasets.html)
with no related sequences to the training set
gave the following results:
Human 5' Splice Site prediction:
+------------+-----------+----------------+------------+
| threshold | % | % | correlation|
| | sites | false positive | coefficient|
| | recognized| sites | (CC) |
+------------+-----------+----------------+------------+
| | | | |
| 0.99 | 26.0% | 0.1% | 0.46 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.95 | 50.4% | 0.7% | 0.65 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.90 | 64.1% | 1.1% | 0.73 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.85 | 72.7% | 1.4% | 0.78 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.80 | 74.4% | 1.9% | 0.78 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.75 | 77.8% | 1.9% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.70 | 81.6% | 2.7% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.65 | 85.0% | 3.2% | 0.83 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.60 | 88.0% | 3.5% | 0.84 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.55 | 89.3% | 3.7% | 0.84 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.50 | 91.5% | 4.2% | 0.85 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.45 | 93.2% | 4.7% | 0.85 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.40 | 93.2% | 5.2% | 0.84 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.35 | 93.6% | 5.3% | 0.84 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.30 | 94.9% | 5.8% | 0.84 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.25 | 95.3% | 6.2% | 0.84 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.20 | 96.2% | 6.7% | 0.83 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.15 | 96.6% | 8.2% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.10 | 97.9% | 9.1% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.05 | 98.3% | 11.1% | 0.78 |
| | | | |
+------------+-----------+----------------+------------+
These percentages are defined by:
predicted sites
sites recognized = -------------------------
all observed sites
predicted sites
false positive sites = -------------------------
all observed non-sites
(TPxTN)-(FNxFP)
correlation coefficient (CC) = ------------------------------------
________________________________
V (TP+FN)x(TN+FP)x(TP+FP)x(TN+FN)
TP = true positive = sites recognized
TN = true negative = non-sites recognized
FP = false positive = observed non-sites predicted as sites
FN = false negatives = observed sites predicted as non-sites Human 3' Splicer Site prediction:
+------------+-----------+----------------+------------+
| threshold | % | % | correlation|
| | sites | false positive | coefficient|
| | recognized| sites | (CC) |
+------------+-----------+----------------+------------+
| | | | |
| 0.99 | 7.3% | 0.0% | 0.25 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.95 | 33.3% | 0.4% | 0.52 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.90 | 47.9% | 0.5% | 0.64 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.85 | 57.7% | 0.6% | 0.70 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.80 | 61.2% | 0.9% | 0.72 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.75 | 65.4% | 1.1% | 0.74 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.70 | 69.7% | 1.3% | 0.77 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.65 | 73.5% | 1.5% | 0.79 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.60 | 76.5% | 1.8% | 0.80 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.55 | 79.1% | 2.0% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.50 | 80.8% | 2.4% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.45 | 82.5% | 2.9% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.40 | 83.8% | 3.1% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.35 | 86.8% | 3.7% | 0.82 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.30 | 88.5% | 4.0% | 0.82 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.25 | 88.5% | 4.5% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.20 | 90.2% | 4.8% | 0.82 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.15 | 91.0% | 6.0% | 0.80 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.10 | 92.3% | 7.9% | 0.77 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.05 | 94.9% | 10.4% | 0.74 |
| | | | |
+------------+-----------+----------------+------------+
|
Neural Network based "consensi" sequences:
Extensive analysis of the perceptron neural network weight matrices have revealed the following "refined"
5' and 3' splice site consensus and non-consensus sequences:
5' Splice Site:
-7 6 5 4 3 2 -1 +1 2 3 4 5 6 7 +8
consensus: a a a A C|a A G / G T A A G T - c
non-consensus: g g g G G|T G|T A|T - - C|t g|t - - t -
3' Splice Site:
-21 -20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 -1
consensus: - T T T|c T T|C T|C T|c T|c T|c T|c T|c T|c T|C T|c T|C T|c A T|C A G
non-consensus: G
+1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 +20
consensus: G T c - - - g g - g g|a c g a a a|c a g - -
non-consensus: c|t t g|t
Capital letters indicate strong weights and lower case letters weaker weights.
"|" means "or"
"-" no significant weight
"non-consensus" indicates bases that are very unlikely to appear at this position.
Drosophila melanogaster
A carefully randomly chosen independent test set of 41 genes
(Drosophila melanogaster gene set)
with no related sequences to the training set
gave the following results:
Drosophila melanogaster 5' Splice Site prediction:
+------------+-----------+----------------+------------+
| threshold | % | % | correlation|
| | sites | false positive | coefficient|
| | recognized| sites | (CC) |
+------------+-----------+----------------+------------+
| | | | |
| 0.99 | 0.0% | 0.0% | - |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.95 | 22.9% | 0.0% | 0.44 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.90 | 53.3% | 0.0% | 0.69 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.85 | 61.9% | 0.0% | 0.75 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.80 | 66.7% | 0.0% | 0.78 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.75 | 69.5% | 0.8% | 0.78 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.70 | 77.1% | 0.8% | 0.83 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.65 | 78.1% | 1.0% | 0.83 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.60 | 81.9% | 1.0% | 0.86 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.55 | 82.9% | 1.0% | 0.86 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.50 | 88.6% | 1.8% | 0.88 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.45 | 90.5% | 2.5% | 0.88 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.40 | 91.4% | 3.0% | 0.88 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.35 | 91.4% | 4.0% | 0.85 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.30 | 94.3% | 4.8% | 0.86 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.25 | 96.2% | 5.3% | 0.86 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.20 | 97.1% | 5.8% | 0.86 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.15 | 97.1% | 8.0% | 0.82 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.10 | 99.1% | 10.3% | 0.80 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.05 | 99.1% | 15.1% | 0.73 |
| | | | |
+------------+-----------+----------------+------------+
Drosophila melanogaster 3' Splice Site prediction:
+------------+-----------+----------------+------------+
| threshold | % | % | correlation|
| | sites | false positive | coefficient|
| | recognized| sites | (CC) |
+------------+-----------+----------------+------------+
| | | | |
| 0.99 | 1.9% | 0.0% | 0.12 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.95 | 11.4% | 0.0% | 0.30 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.90 | 28.6% | 0.6% | 0.46 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.85 | 44.8% | 0.6% | 0.60 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.80 | 53.3% | 1.1% | 0.65 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.75 | 60.1% | 2.0% | 0.69 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.70 | 69.5% | 2.3% | 0.74 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.65 | 73.3% | 2.5% | 0.76 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.60 | 76.2% | 3.1% | 0.77 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.55 | 79.0% | 4.2% | 0.77 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.50 | 83.8% | 5.4% | 0.78 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.45 | 87.6% | 5.9% | 0.80 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.40 | 90.5% | 6.5% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.35 | 92.4% | 7.0% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.30 | 94.3% | 9.0% | 0.79 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.25 | 94.3% | 10.7% | 0.77 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.20 | 96.2% | 13.0% | 0.75 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.15 | 96.2% | 14.7% | 0.73 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.10 | 96.2% | 17.5% | 0.69 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.05 | 97.1% | 30.7% | 0.56 |
| | | | |
+------------+-----------+----------------+------------+
| |
Splice site prediction code by Martin G. Reese; web interface by Nomi Harris.
|