Neural networks in non-Euclidean metric spaces.
WΕodzisΕaw Duch and RafaΕ Adamczak,
Department of Computer Methods, Nicholas Copernicus University,
Grudziadzka 5, 87-100 ToruΕ, Poland.
Έ
E-mail: duch,raad@phys.uni.torun.pl
Abstract. C2, lying outside the unit sphere. A single neuron performing
multivariate Gaussian function with 2N adaptive parameters
Multilayer Perceptrons (MLPs) use scalar products to compute weighted
(specifying the center and dispersions in each dimension)
activation of neurons providing decision borders using combinations of soft
is sufficient for this job and the training process is quite
hyperplanes. The weighted fun-in activation function corresponds to Eu-
simple. Many hyperplanes provided by sigmoidal functions
clidean distance functions used to compute similarities between input and
are needed to approximate spherical decision borders. The
weight vector. Replacing the fan-in activation function by non-Euclidean
simplest approximation, using the standard multilayer per-
distance function offers a natural generalization of the standard MLP model,
ceptron (MLP) architecture, that captures any bound region
providing more flexible decision borders. An alternative way leading to
of N-dimensional space, requires construction of a simplex
similar results is based on renormalization of the input vectors using non-
using N sigmoids and an additional neuron to smooth the
Euclidean norms in extended feature spaces. Both approaches influence the
output of a combination of these neurons. Thus at least
shapes of decision borders dramatically, allowing to reduce the complexity
N2 +N parameters are needed, compared with 2N parameters
of MLP networks.
for localized functions. On the other hand if data vectors
belonging to the first class are taken from the corner of the
I. INTRODUCTION
coordinate system bound by the (1,1,...,1) plane a single
sigmoidal function with N + 1 parameters is sufficient for
EURAL networks of the most popular multi-layer per-
perfect classification while Gaussian approximation will be
Nceptron (MLP) type perform discrimination of the in-
quite difficult. A poor approximation may use one Gaussian
put feature space using hyperplanes. Many other transfer
in the center of the region and N +1 Gaussians in the corners,
function have been proposed to increase the flexibility of
using 2N(N + 2) adaptive parameters.
contours used for estimation of decision borders [1] (for a
In the first example the complexity of the training process
recent review see [2]). Perhaps the best known alternative
is O(N2) for MLP and O(N) for RBF, and in the second
to sigmoidal functions are localized Gaussian functions and
example the situation is reversed. One may easily create
other radial basis functions. Viewing the problem of learn-
more complicated examples with more classes between con-
ing from geometrical point of view functions performed by
centric spheres of growing radii or with series of hyperplanes
neural nodes should enable tessellation of the input space
passing through (m,m,...,m) points. Improved learning algo-
in the most flexible way using a small number of adaptive
rithms or network architectures will not change the relative
parameters. Although neural networks with single hidden
complexity of solutions as long as the decision borders pro-
layer using sigmoidal or Gaussian functions can approximate
vided by the transfer functions remain spherical (as in the
an arbitrary continuous function on a compact domain with
first example) or planar (as in the second example). Artificial
arbitrary precision given sufficient number of neurons [3],
examples that are favorable for other types of functions are
i.e. they are universal approximators, for some datasets a
also easy to construct.
large (and hard to train) network using sigmoidal or Gaus-
sian functions may be needed for tasks that could be solved
MLPs are similar to statistical discriminant techniques, al-
with a small (and easy to train) networks using other transfer though soft sigmoids allow for representation of more com-
functions, providing more flexible borders.
plex, nonlinear decision borders. This is usually considered
to be a strength of the MLP model, although in cases when
Improvement in learning and architectures of neural net-
sharp decision borders are needed it may also become its
works may not be able to overcome inherent drawbacks
weakness. For example, classification borders conforming
of models that provide wrong decision borders for a given
to a simple logical rule x1 > 1 '"x2 > 1 are easily represented
problem. Consider a simple classification problem in N di-
by two hyperplanes, but there is no way to represent them ac-
mensions, with spherical distribution of vectors belonging to
curately using soft sigmoidal functions. Increasing the slopes
class C1, to be distinguished from vectors belonging to class
of sigmoidal functions to improve representation of such the next section. In the third section transformation of the
decision borders around the (1,1) point leads to problems input data to the extended feature space is proposed, enabling
with learning by backpropagation, or by any other gradient- the use of non-Euclidean distance functions in the standard
based method, since the volume of the input space in which MLP backpropagation programs without the need for coding
sigmoids change rapidly (and thus gradients are non-zero) is the new transfer functions and their derivatives. The fourth
rapidly shrinking. In the limit sigmoidal functions become section shows how to determine the architecture and param-
step-functions but gradient techniques like backpropagation eters of such networks, including the slopes for each neuron.
cannot be used to make this transition. As a result for some An illustration of this method on the Iris data is presented
datasets no change in learning rule or network architecture for pedagogical purposes in the fifth section. The paper is
will improve the accuracy of neural solutions. A good real- finished with a short discussion.
world example is the hypothyroid dataset, for which the best
optimized MLPs still give about 1.5% of error [4] while
II. DISTANCE FUNCTIONS IN NEURAL NETWORKS
logical rules reduce it to 0.64% (since 3428 cases are pro-
vided for testing this is a significant improvement). Another
In the similarity-based framework the classification prob-
example may be provided by the NASA Shuttle benchmark
lem is solved using a set of class-labeled training vectors
data [5], where MLP makes about 3.5% errors on the test
{Rj,C(Rj)}, j = 1..Nt, where C(Rj) is the class of Rj. The
set, RBF makes about 1.4% error while logical rules achieve
probability p(Ci|X;M) that a given vector X belongs to class
0.01% of errors (one or two vectors out of 14500 in the test
Ci is calculated using the information provided in the sim-
set).
ilarity measure D(X,Rj). M stands here for the classifica-
Most research on neural networks is concentrated on archi-
tion model used, i.e. values of all parameters and proce-
tectures and learning rules, but selection of neural transfer
dures employed. A general similarity-based model [6] of an
functions may be crucial to network performance [1]. Net-
adaptive system used for classification may include various
works providing the most flexible decision borders with the
types of parameters and procedures, such as: the set {Rj}
lowest number of adaptive parameters may have an advan-
of reference vectors created from the set of training vectors
tage over larger and more complex networks. There are
{Xi} by selection and/or optimization procedure; a similarity
two ways to create such networks. First, flexibility of trans-
function D(·) (frequently a distance function or an activation
fer functions may be increased and second, inputs may be
function in neural networks) parameterized in various ways,
transformed in a non-linear way. Transfer functions o(I(X))
or a table used to compute similarities for nominal attributes;
are compositions of the activation I(X) and the output func-
a weighting function G(D(X,R)), estimating contribution
tion o(I). Output functions are usually either sigmoidal
of the reference vector R to the classification probability
(S-shaped) squashing functions, preventing the unbounded
depending on its similarity to the vector X; the total cost
growth of signals in the network, or localized, bell-shaped
function E[·] optimized during training. The cost function
functions. Activation functions are almost always either fan-
may include regularization terms, hints [3], it may depend
in weighted scalar products W · X used with sigmoidal func-
upon a kernel function K(·), scaling the influence of the error,
tions or Euclidean distance functions used with bell-shaped
for a given training example, on the total cost, it may use a
functions. In this paper we will show how to change the
risk matrix R (Ci|Cj) of assigning wrong classes or a matrix
activation functions to obtain more flexible decision borders.
S(Ci|Cj) measuring similarity of output classes. In some
Flexibility of transfer functions is strongly correlated with
models the number of reference vectors k taken into account
the number of functions (and thus with the number of adap-
in the neighborhood of X is specified. An adaptive system
tive parameters available for training) necessary to model
may include several such models Ml and an interpolation pro-
complex shapes of decision borders.
cedure to select between different models or average results
We will place our considerations in the general framework of a committee of models.
for similarity-based classification methods (SBMs) presented
In RBF networks Euclidean distance functions D(X,Rj) =
recently [6], [7]. Investigation of connections between neu-
||X - Rj|| are assumed and radial, for example Gaussian
ral network and similarity-based methods leads to a number
G(D) =exp(-D2), or G(D) =1/(1 +D2) weighting func-
of new neural network models. In particular the distance-
tions are used. Essentially RBF is a minimal distance soft
based MLP (D-MLP) networks are obtained by replacing
weighted method with no restrictions on the number of
the weighted activation with a square of Euclidean distance
neighbors reference vectors Rj that are close to X influence
[8]. Such networks improve upon the traditional approach
probabilities of classification more than those that are far.
by providing more flexible decision borders and by enabling
The SBM framework suggests that there is nothing special
a prototype-based interpretation of the results. Since the
about this choice of distance function and the weighting
use of distance functions (instead of weighted activation)
function. Any distance function D(X,R) and the weighting
in neural network models is a novel idea it is described in
function G(D) may be used to create neural network. In the
Gaussian classifier [9] or in the original RBF network only small distances D, reaching the value of 0.5 for D(W,X) =
one parameter, dispersion, was optimized [10]. Optimization d0 and approaching zero for larger distances. For normal-
of the positions of the reference centers Rj leads to the LVQ ized X but arbitrary W the sigmoid arguments belong to the
method [11] in which the training set vectors are used to [Έ -|W|,Έ+|W|] interval. A unipolar sigmoid has its max-
define the initial prototypes and the minimal distance rule is imum curvature around Δ
2.4, therefore smaller thresholds
used to assign the classes. The Restricted Coulomb Energy and norms of the weights mean that the network operates
(RCE) classifier [12] uses a hard-sphere weighting function- in an almost linear regime, smoothing the network approxi-
s. The Feature Space Mapping model (FSM) is based on mation to the training data. This is one of the reasons why
separable, rather than radial weighting functions [13]. Re- regularization methods, adding penalty terms to the error
cently a method to create oblique probability distributions function to reduce the weights, improve generalization [3].
in N-dimensional space using only N parameters has been
From the similarity-based point of view MLP networks use
described [1].
sigmoidal functions to estimate the influence of weight vec-
MLPs and other networks using discriminant functions are tors according to distance between these vectors and the
also special cases of general SBM framework. Threshold training vectors, combining many such estimations to com-
neurons compute distances in a natural way. If the input pute the final output. Changing the distance function in
signals X and the weights W are (Δ
1... Δ
1) vectors, neu- equation (3) from the square of the Euclidean distance to
ron with N inputs and the threshold Έ realizes the following some other distance measures, new types of neural networks,
function: called D-MLP networks [8], are defined. Another possibility
is to write the weighted product in the form:
N
0 if ||W - X|| > (N - Έ)/2
Ε( Xi - Έ) = (1)
i
"W
1
1 if ||W - X|| d" (N - Έ)/2
i Γ(W · X) = Γ (||W + X||2 -||W-X||2) (4)
4
where || · || norm is defined by the Hamming distance. One
Euclidean norms may be replaced by Minkovsky or other
can interpret the weights of neurons in the first hidden layer
type of norms. Backpropagation procedure requires deriva-
as addresses of the reference vectors in the input space and
tives of the distance functions, but for Minkovsky and several
the activity of threshold neuron as activation by inputs falling
functions they are easily provided. Generalized Minkovsky s
into a hard sphere of radius Έ centered at W. The Hamming
distance with the scaling factors is given by:
neural network [14] is actually a neural realization of the
nearest neighbor method for a single neighbor and binary
N
inputs. Changing binary into real values and threshold into
D(A,B;s)² = d(Ai,Bi)Δ
(5)
i
"s
sigmoidal neurons for inputs normalized to ||X|| = ||W|| = 1
i
leads to soft activation of neurons by input vectors close to
where ² = Δ
is usually taken. The d(·) function is used to
W on a unit sphere. In general the activation of a neuron is
estimate similarity at the feature level and in the simplest
written as:
case d(Ai,Bi)=|Ai -Bi|. For Δ
= ² = 2 the vectors ||A|| = 1
1 are on the unit sphere, for large Δ
the sphere is changed
W · X = ||W||2 + ||X||2 -||W-X||2 (2)
into a soft cuboid, for Δ
= 1 it has pyramidal and for Δ
< 1
2
hypocycloidal shape.
For normalized input vectors sigmoidal functions (or any
Thus using non-Euclidean distance activation functions
other monotonically growing transfer functions) may there-
changes the shape of decision borders completely, from the
fore be written in the form:
usual hyperplanes (² = 1, Δ
= 2 and Wr =0 for the weight
corresponding to the Xr component) to spherical, cuboidal or
Γ(W · X + Έ) =Γ(d0 -D(W,X)) (3)
hypocycloidal. Derivation of the backpropagation equations
for Γ(d0 - D(X,W)) functions with generalized Minkovsky
where D(W,X) is the square of Euclidean distance between
distances is straightforward but requires extensive modifi-
W and X and the 1/2 factor is absorbed in the sigmoid s
cation of standard MLP software. Instead of changing the
slope. This function evaluates the influence of the refer-
activation function nonlinear transformation of input features
ence vectors W on the classification probability p(Ci|X;W).
may also change decision borders.
To avoid loss of information during normalization of input
vectors an additional component Xr is added, a new feature
III. NORMALIZATION OF INPUT VECTORS IN
measuring the difference between largest norm and the norm
NON-EUCLIDEAN SPACES
of the original input vectors.
Transfer function f (D(W,X)) = Γ(d0 -D(W,X)) decreases The parameter d0 in Eq. (3) should be treated as an adaptive
monotonically as a function of distance, with flat plateau for parameter only if X is normalized. This may always be
done without loss of information if one or more additional is to place them in corners of equilateral triangle, for exam-
components are added to the vector, extending the feature ple at angles 0,Δ
120Δ%. One can search for the best input
space by at least one dimension. Taking Xr = R2 -||X||2, preprocessing treating it as a rigorous optimization problem,
where R e" maxX ||X||, amounts to a projection of the data or just use polar coordinates to shift some upper hemisphere
on a unit hemisphere with radius R. Vectors (X,Xr) may be vectors to the part of the lower hemisphere. Much simpler
renormalized ||(X,Xr)||D = 1 using the metric defined by the approach is to rescale all vectors to get their Euclidean norms
distance function D(X,R). d" 1, use the norm ||X|| mapping it to points on a cir-
and Δ
Δ
cle: sin (4 - 5||X||),cos (4 - 5||X||) . These points for
3 3
The distance function may be heterogeneous, using
0 d"||X|| d" 1 are within the angle -Δ/3 and 4Δ/3. The first
Minkovsky s metric for numerical features and probabilistic
Δ
factor, sin (4 - 5||X||) is used to rescale all components of
3
metric functions for symbolic features. In memory-based
the vector X, while the second factor is taken as an extra Xr
reasoning the Modified Value Difference Metric (MVDM)
component. Extended vectors ||(Xj,Xrj)||D are renormalized
has gained popularity [15]. The distance between two N-
using the metric function D(·), placing them on a unit sphere
dimensional vectors A,B with discrete (nominal, symbolic)
defined by this metric.
elements, in a K class problem, is computed using condition-
al probabilities:
IV. INITIALIZATION OF THE NETWORK
N K
Δ
DV (A,B) =
"" p(Ci|Aj) -p(Ci|Bj) Δ
=
The network should be initialized taking the centers of clus-
j i
ters in the extended space as W and taking d0 = D(W,Xb),
N
where Xb is a vector at the border of the given cluster (we
(6)
j
"dΔ
(A ,Bj)
V
have tried [16] dendrograms and decision trees but other
j
clusterization methods may also be used for initialization
where p(Ci|Aj) is estimated by calculating the number
[9]). Using weighted activation the contribution of a center
Ni(Aj) of times the value Aj of the feature j occurred in
of an input data cluster C laying on the unit sphere is W · C.
vectors belonging to class Ci, and dividing it by the number
The largest activation is obtained when the weights W point
of times Aj occurred for any class. A value difference
in the same direction as the center C. The logistic function
Δ
dV (Aj,Bj) for each feature j allows to compute DV (A,B) as
Γ(C · X- Έ)=(1+exp((-C · X+ Έ)/T))-1, where T deter-
a sum of additive factors for all features. Distance is defined
mines the slope, has the largest gradient in the direction of
here via a data-dependent matrix with the number of rows
W = C. The value Γ(0) =0.5 is obtained at a Έ distance
equal to the number of classes and the number of columns
from the origin of the coordinate system. Since the C vector
equal to the number of all attribute values. Generalization
is normalized Έ = 1 places the contours for 0.5 value tan-
for continuos values requires a set of probability density
gentially to the unit hypersphere. Contours for lower values
functions pij(x), with i = 1..K, j = 1..N. Additional scaling
Γ(C · X- Έ) < 0.5 cut segments of the hypersphere in which
factors may be introduced, as in Eq. (5).
the value of Γ(C · X - Έ) is constant.
Using VDM type of metrics leads to problems with calcula-
A parameter which is rarely changed in MLPs is the slope
tion of gradients, but replacing symbolic features by vectors
of sigmoidal functions. It defines the area which has an
of p(Ci|Aj) probabilities (with dimension equal to the num-
influence on performance of each node. If the slope is too
ber of classes times the number of different symbolic val-
high the area in which the sigmoidal function is not approxi-
ues the feature takes) allows to reproduce MVDM distances
mately constant is small and only a few training vectors have
using numerical values of vector components. Many other
a chance to influence the gradient-based learning procedures.
types of metric functions exist [15] and their performance
If it is too low then all functions strongly overlap and there
should be empirically verified. Several alternative extensions
is no possibility to create sharp decision borders. Normaliza-
of the input space may be considered, for example adding
tion of the weights W is equivalent to a local change of the
one or more features Xr = D(X,R) equal to the distance of a
slope:
given vector X to some fixed vector R a parabolic projection
is made.
W Έ
(W · X + Έ)/T = ( · X + )||W||/T =
||W|| ||W||
It may be of some advantage to increase the separation of
the clusters projected on the hypersphere. It is impossible to (W · X + Έ )/T = (d0 - D(W ,X))/T (7)
make such a projection on the whole hypersphere without vi-
olating topological constraints. In the one-dimensional case Thus without loss of generality both X and W may be nor-
with X " [-1,+1] the (X,Xr) vector should not make a full malized. No special learning for slopes is required since they
circle when X is changed from -1to+1 because the two ex- are computed from norms of weights. A useful variabili-
treme vectors X = Δ
1 will then be identical. An optimal sep- ty range of the sigmoid is between its maximum curvature
aration for 3 vectors with the length ||X||,||X||+",||X||+2" points, which for T = 1 are between "(T ) =Δ
2.4. If the
variability range is assumed to be 1/10 of the size of the " "" " ""
1.0 1.0
1.0 1.0
1.0 1.0
1.0 1.0
1.0 " 1.0 "
1.0 " " 1.0 " "
"""" " " " " """" " " " "
" " " " " "
0.8 0.8
0.8 0.8
0.8 0.8
0.8 0.8
0.8 0.8
0.8 0.8
"""" " " """" " "
cluster, i.e. "(T) =Δ
d0/10 then setting T H" d0/24 will be
"""" " " """" " "
0.6 0.6
0.6 0.6
0.6 0.6
0.6 0.6
0.6 0.6
0.6 0.6
"" " " "" " "
" "
"" " "" " " " "" " "" " " "
"" " "" "
+ +
0.4 0.4
0.4 0.4
0.4 0.4
0.4 0.4
0.4 0.4
0.4 " + 0.4 " +
appropriate. After such initialization training of the network
+ + + " + + + "
0.8 0.8
+ +++ +"" + +++ +""
+ +
+ +
+ +
+ + 0.6 + + 0.6
0.2 0.2
0.2 0.2
0.2 0.2
0.2 0.2
0.2 0.2
0.2 0.4 0.2 0.4
+ + +++ 0.2 0.2+++++++++ "
+ + " + + +
0.4
+0.2+++++++ + ++ +
+ 0.6 +
++++ 0.4+++ 0.2
0.6
is usually quite short. 0.0 0.8 0.0
0.0 0.0
0.0 0.0
0.0 0.0
0.0 0.0
0.0 0.0
++ + + + ++ 0.8 +
+ +
+ ++ + ++
-0.2 -0.2
-0.2 -0.2
-0.2 -0.2
-0.2 -0.2
-0.2 + + -0.2 + +
-0.2 + + + ++ -0.2 + + + ++
0.2 0.2
0.4 0.4
-0.4 -0.4
-0.4 -0.4
-0.4 -0.4
-0.4 -0.4
-0.4 -0.4
-0.4 -0.4
0.6 0.6
In the XOR case the input vectors for class = T are
" 0.8 " 0.8
-0.6 -0.6
-0.6 -0.6
-0.6 -0.6
-0.6 -0.6
-0.6 -0.6
-0.6 " -0.6 "
" " " " " " " "
" "
" "
" "
" "
" " " " " "
" "
" "
" "
-0.8 -0.8
-0.8 -0.8
-0.8 -0.8
-0.8 -0.8
-0.8 -0.8
-0.8 -0.8
(0,1),(1,0) and for the class = F are (0,0),(1,1). The mean " " " " " " " " " "
" " " " " "
"" ""
" " " "
"" ""
" " " " " "
" ""
" ""
" "
" " " "
"" ""
" " " "
"
"
" " " "
" "
" "
" "
" "
-1.0 -1.0
-1.0 -1.0
-1.0 -1.0
-1.0 -1.0
-1.0 -1.0
-1.0 -1.0
-1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
-1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
-1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
-1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
-1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
for each feature is 0.5 and after" and renormalizing -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
shifting
"
the vectors are C1 =(-",+1)/ 2, C2"
1 =(+1,-1)/ 2 for
" "" " ""
1.0 1.0
1.0 1.0
1.0 1.0
1.0 1.0
1.0 " 1.0 "
1.0 " " 1.0 " "
"""" " " " " """" " " " "
" " " " " "
0.8 0.8
0.8 0.8
0.8 0.8
0.8 0.8
0.8 0.8
class T and (-1,-1)/ 2, (+1,+1)/ 2 for class F. Se- 0.8 0.8
"""" " " """" " "
"""" " " """" " "
0.6 0.6
0.6 0.6
0.6 0.6
0.6 0.6
0.6 0.6
0.6 0.6
"" " " "" " "
" "
"" " "" "
"" " "" " " " "" " "" " " "
+ +
lecting one of the classes for output, for example class T, 0.4 " + 0.4 " +
0.4 0.4
0.4 0.4
0.4 0.4
0.4 0.4
0.4 0.4
+ + + " + + + "
0.8 0.8
+ +++ +""0.6 0.2 + +++ +""
+ +
+ +
+ +
+ + + +
0.2 0.2
0.2 0.2
0.2 0.2
0.2 0.2
0.2 0.2 0.6"
0.2 0.2 0.4
0.4
+ + +++ " 0.4
+ + + +
0.2
0.6++++++++++ 0.2
+
++++ +
initial weights for the first neuron are given by C1 and for + +++++++ + ++ +
0.0 0.0
0.0 0.0
0.0 0.0
0.0 0.0
0.0 0.0
0.0 0.0 0.8+++ +
++ + + + ++ + +
+ ++ + ++
+ + + +
+ + + ++ + + + ++
-0.2 -0.2
-0.2 -0.2
-0.2 -0.2
-0.2 -0.2
-0.2 -0.2
-0.2 0.009 -0.2
0.018
the second neuron by C2, while the hidden to output layer 0.2 0.027 0.2
0.036
0.4 0.4
-0.4 -0.4
-0.4 -0.4
-0.4 -0.4
-0.4 -0.4
-0.4 -0.4
-0.4 -0.4
0.6 0.6
" 0.8 " 0.8
-0.6 -0.6
-0.6 -0.6
-0.6 -0.6
-0.6 -0.6
-0.6 -0.6
-0.6 " -0.6 "
weights are all +1. This is the correct and the simplest so- " " " " " " " "
" "
" "
" "
" "
" " " " " "
" "
" "
" "
-0.8 -0.8
-0.8 -0.8
-0.8 -0.8
-0.8 -0.8
-0.8 -0.8
-0.8 -0.8
" " " " " " " " " "
" " " " " "
"" ""
" " " "
"" ""
" " " " " "
" ""
" ""
" "
" " " "
"" ""
" " " "
"
"
" " " "
" "
" "
" "
" "
-1.0 -1.0
-1.0 -1.0
-1.0 -1.0
-1.0 -1.0
-1.0 -1.0
lution for the XOR problem found without any optimization -1.0 -1.0
-1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
-1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
-1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
-1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
-1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
-1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
of the network! For more complex examples of this type
of initialization see [16]. Since the architecture of the MLP
Fig. 1. Shapes of decision borders in the Iris case for the network without
network in the extended space is completely determined by the hidden layer (3 output neurons, 3 inputs), with growing W3 weight.
the initialization procedure (clusterization method used de-
termines all parameters), and the training should be rapid due
dimensional vectors normalized using various Minkovsky
to a good starting point, many distance functions may be tried
distance measures. The network has been initialized tak-
on a given problem.
ing the normalized weights that are equal to the centers of
the three clusters. In the extended feature space the same
V. PEDAGOGICAL ILLUSTRATION
accuracy is achieved using only 3 input and 3 output neu-
rons without the hidden layer, for a total of 12 adaptive
The influence of input renormalization (using non-Euclidean
parameters. Using such network architecture with squared
distance functions) on the shapes of decision borders is il-
Euclidean metric and the weight W3 = 0 for the third X3
lustrated below on the classical Iris flowers dataset, con-
component, only two classes are separated. The transition
taining 150 cases divided into 3 classes [17]. The flowers
process from the planar to circular decision borders is shown
are described by 4 measurements (petal and sepal width and
in Fig. 1 (clockwise, from top left). In the learning process
length). Two classes, Iris virginica and Iris versicolor, over-
W3 grows, curving the borders for vectors near the center of
lap, and therefore a perfect partition of the input space into
the drawing.
separate classes is not possible. An optimal solution (from
the point of view of generalization) contains 3 errors [18] and
In Fig. 2 dramatic changes in the final shapes of decision
may be obtained using only two of the four input features (x3 borders for Minkovsky metric are shown. Euclidean case
and x4), therefore results are easy to display and only those
corresponds to circular decision borders, the city block met-
two features have been left in simulations described below.
ric Δ
= 1 gives sharp, romboidal shapes, for large Δ
almost
The data has been standardized and rescaled to fit inside a
rectangular decision borders are obtained (an approximation
square with Δ
1 corners.
using logical rules is in this case quite accurate) while for
small Δ
hypocycloidal shapes are created. For the Iris data
A standard MLP solution is obtained with 2 input, 4 hidden
the optimal solution (3 errors) has been recovered for all
and 3 output neurons, with a total of 27 adaptive parameter-
values of Δ
e" 0.8. Smooth transition between these cases
s. One discriminating plane per class for the smallest and
is made by changing Δ
and retraining the network.
the largest flowers (setosa and virginica) is needed and two
planes are needed to separate the vectors of the versicolor
For other datasets we have found significant improvements
class. To increase accuracy and speed up learning, in the
of accuracy for optimized Δ
.
final phase of learning only the vectors near the class borders
were presented to the network. The selection algorithm loops
VI. DISCUSSION
over all vectors and for a given vector X finds k (for example
k = 10) nearest vectors belonging to a different class than
Distance-based neural networks increase flexibility of de-
X. These vectors are written to a new training file providing
cision borders by modifying transfer functions, either in a
a description of the border region. This method of training
global way (if the same distance function is used for each
leads to sharper and more accurate decision borders, as seen
node), or locally (if distance function are different at each
in the first drawing of Fig. 2.
node). Non-Euclidean transformations of input vectors also
An additional input feature has been added and the 3- lead to very flexible shapes of neural network decision bor-
1 0.9
0.9
1
0.8 Acknowledgments
0.8
0.8 0.8
0.6
0.7
0.6
0.7
0.4
0.4
0.6 Support by the KBN, grant 8 T11F 014 14, is gratefully
0.6
0.2
0.2
0.5
0.5
0 0 acknowledged.
0.4
-0.2 0.4 -0.2
-0.4
0.3
-0.4 REFERENCES
0.3
-0.6
-0.6 0.2
0.2 -0.8
-0.8 0.1 [1] W. Duch, N. Jankowski, New neural transfer functions, Applied Math-
-1
0.1
-1 ematics and Computer Science 7 (1997) 639-658
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
1 0.9 1 0.9
[2] W. Duch, N. Jankowski, Survey of Neural Transfer Functions. Neural
0.8 0.8
0.8 0.8
Computing Surveys (submitted, 1999)
0.6 0.6
0.7 0.7
0.4 0.4
[3] Bishop C, Neural networks for pattern recognition. Clarendon Press,
0.6 0.6
0.2 0.2
Oxford, 1995
0.5 0.5
0 0
0.4 0.4
[4] W. Schiffman, M. Joost, R. Werner, Comparison of optimized back-
-0.2 -0.2
0.3 0.3 propagation algorithms, Proc. of ESANN 93, Brussels 1993, pp. 97-
-0.4 -0.4
0.2 104
0.2 -0.6
-0.6
-0.8 0.1 -0.8 0.1
[5] D. Michie, D.J. Spiegelhalter and C.C. Taylor, Machine learning, neu-
-1 -1
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
ral and statistical classification. Elis Horwood, London 1994
[6] W. Duch, Neural minimal distance methods, Proc. 3-rd Conf. on Neu-
Fig. 2. Shapes of decision borders in the Iris case for standard MLP (2
ral Networks and Their Applications, Kule, Poland, Oct. 14-18, 1997,
inputs, 3 hidden and 3 output neurons, first drawing) and using addition-
pp. 183-188
al input dimension to renormalize data vectors with Minkovsky metric,
[7] W. Duch, K. Grudzinski, G.H.F. Diercksen, Minimal distance neural
Δ
=2.0, 0.5, and 7.0.
methods. World Congress of Computational Intelligence, May 1998,
Anchorage, Alaska, IJCNN 98 Proceedings, pp. 1299-1304
[8] W. Duch, R. Adamczak, G.H.F. Diercksen, Distance-based multilayer
perceptrons, In: Computational Intelligence for Modelling Control
ders without any change in the standard computer programs.
and Automation. Neural Networks and Advanced Control Strategies.
The training times are short since a good initialization proce-
Ed. M. Mohammadian, IOS Press, Amsterdam, pp. 75-80
dure based on clusterization techniques determines weights,
[9] P.R. Krishnaiah, L.N. Kanal, eds, Handbook of statistics 2: classi-
fication, pattern recognition and reduction of dimensionality. (North
thresholds and slopes of all neurons. The complexity of
Holland, Amsterdam 1982)
network architecture defined in extended space is usually
[10] P.D. Wasserman, Advanced methods in neural networks. (van Nos-
smaller comparing to the standard MLPs needed to obtain
trand Reinhold 1993)
similar accuracy on a given dataset, as has been observed in
[11] T. Kohonen, Self-organizing maps. (Berlin, Springer-Verlag 1995).
the Iris example. Since the training is fast many different
[12] D.L. Reilly, L.N. Cooper, C. Elbaum, A neural model for category
metric functions may be tried before selecting (using cross- learning, Biological Cybernetics 45 (1982) 35 41
validation tests) the best model. Networks with activation [13] W. Duch, G.H.F. Diercksen, Feature Space Mapping as a universal
adaptive system, Comp. Phys. Communic. 87 (1995) 341-371
given by Eq.(3) or (4) have not yet been implemented but
[14] R.P. Lippmann, An introduction to computing with neural nets, IEEE
such models seem to be quite promising.
Magazine on Acoustics, Signal and Speech Processing 4 (1987) 4-22;
P. Floreen, The convergence of Hamming memory networks, Trans.
The change of the shapes of decision borders has been ac-
Neural Networks 2 (1991) 449-457
complished before by adding new type of units to neural net-
[15] D.R. Wilson, T.R. Martinez, Improved heterogenous distance func-
works (see the forthcoming review [2]). For example, Ridella
tions. J. Artificial Intelligence Research 6 (1997) 1-34
et al. [19] used circular units in their Circular Backpropa-
[16] W. Duch, R. Adamczak, N. Jankowski, Initialization and optimization
gation Networks. Different type of circular units have been of multilayered perceptrons, 3rd Conf. on Neural Networks and Their
Applications, Kule, Poland, October 1997, pp. 105-110
used by Kirby and Miranda [20] in their implementation
[17] C.J. Mertz, P.M. Murphy, UCI repository,
two sigmoidal units are coupled together and their output is
http://www.ics.uci.edu/pub/machine-learning-databases.
restricted to lie on a unit circle. Dorffner [21] proposed conic
[18] Duch W, Adamczak R, Grabczewski K, Ε»al G, Hybrid neural-global
Έ
section transfer functions as a unified framework for MLP
minimization method of logical rule extraction, Journal of Advanced
Computational Intelligence (in print, 1999)
and RBF networks. Straight lines and ellipses are special
[19] S. Ridella, S. Rovetta, R. Zunino, Circular Backpropagation Networks
cases of conic sections. Non-Euclidean metrics have been
for Classification, IEEE Trans. Neural Networks 8 (1997) 84 97
used to characterize the manifold of all Boltzman machines
[20] M.J. Kirby, R. Miranda, Circular Nodes in Neural Networks, Neural
and EM algorithms by Amari [22], but not for increasing
Computations 8 (1996) 390-402
the flexibility of neural network decision borders. The input
[21] G. Dorffner, A Unified Framework for of MLPs and RBFNs: Intro-
renormalization method may be treated as a generalization ducing Conic Section Function Networks, Cybernetics & Systems 25
(1994) 511-554
of the circular or conical unit method. It is not restricted
[22] S-i. Amari, Information geometry of the EM and em algorithms for
to MLP neural networks, but can be used with any neural
neural networks, Neural Networks 8 (1995) 1379-1408
network and any classifier. An additional advantage of our
approach is the understanding of what the network has really
learned in terms of the prototypes (weights) and sigmoidally
weighted distances (similarities) to these prototypes.
Wyszukiwarka
Podobne podstrony:
Neural Network II SCILABArtificial Neural Networks The Tutorial With MATLABArtificial Neural Networks for Beginnersexploring the social ledger negative relationship and negative assymetry in social networks in organOntology & Cosmology Of Non Euclidean GeometryA neural network based space vector PWM controller for a three level voltage fed inverter inductionA neural network based space vector PWM controller for a three level voltage fed inverter inductionNeural Network I SCILABPrediction Of High Weight Polymers Glass Transition Temperature Using Rbf Neural Networks Qsar QsprHalting viruses in scale free networksMalware in Popular NetworksMicrowaves in organic synthesis Thermal and non thermal microwavehao do they get there An examination of the antecedents of centrality in team networksmanaging in complex business networksQoS in Integrated 3G Networks Robert Lloyd Evanstheory,empirisicm and parctice archeaeological discourses in a networ of dependency and oppositionErosion of Secular Spaces in the UKwiΔcej podobnych podstron