F
ast
Minim
um
T
raining
Error
Disretization
T
apio
Elomaa
elomaas.helsinki.fi
Juho
Rousu
r
ousus.helsinki.fi
Departmen
t
of
Computer
Siene,
P
.
O.
Bo
x
26,
FIN-00014
Univ
ersit
y
of
Helsinki,
Finland
Abstrat
The
need
to
disretize
a
n
umerial
range
in
to
lass
oheren
t
in
terv
als
is
a
problem
frequen
tly
enoun
tered.
T
raining
Set
Error
(TSE )
is
one
of
the
ommonly
used
impurit
y
funtions
in
this
task.
W
e
sho
w
that
in
order
to
nd
TSE -optimal
disretization
one
only
needs
to
examine
a
small
subset
of
all
ut
p
oin
ts,
alled
alter-
nation
p
oints.
On
the
other
hand,
w
e
pro
v
e
that
failing
to
he
k
an
alternation
p
oin
t
ma
y
lead
to
a
sub
optimal
disretization.
Alterna-
tion
p
oin
ts
an
b
e
iden
tied
eien
tly
one
the
data
has
b
een
ordered.
Our
empirial
ev
aluation
demonstrates
that
the
n
um
b
er
of
alternations
in
n
umerial
ranges
t
ypially
is
m
u
h
lo
w
er
than
the
total
n
um
b
er
of
ut
p
oin
ts.
In
our
exp
erimen
ts
the
disretization
algorithm
running
on
top
of
al-
ternation
p
oin
ts
w
as
signian
tly
faster
than
the
same
algorithm
on
all
ut
p
oin
ts.
Th
us,
the
sublinear
n
um
b
er
of
ev
aluated
threshold
andidates
further
redues
the
pratial
time
requiremen
t
of
TSE
optimization.
1.
In
tro
dution
Consider
ha
ving
a
set
of
lab
eled
data
p
oin
ts
(with
du-
pliate
v
alues
allo
w
ed)
from
some
range
in
the
real
axis.
In
order
to
distill
some
general
information
on
the
relation
b
et
w
een
the
data
v
alues
and
their
lab
els,
the
n
umerial
range
is
disretized
in
to
lab
eled
in
ter-
v
als.
An
upp
er
b
ound
on
the
n
um
b
er
of
allo
w
ed
in
ter-
v
als
is
usually
giv
en
as
an
external
parameter.
Morgan
Kaufmann
2002;
C.
Samm
ut
and
A.
Homann
(eds.):
Pr
o
.
19th
International
Confer
en
e
on
Mahine
L
e
arning,
ICML
2002
(Sydney
,
Australia)
pp.
131138.
As
go
o
dness
riterion
an
ev
aluation
funtion
to
disriminate
b
et
w
een
the
andidate
disretizations
w
e
use
the
T
r
aining
Set
Err
or
(subsequen
tly
TSE
for
short):
ho
ose
the
disretization
in
whi
h
the
in
ter-
v
al
lab
eling
diers
least
often
from
that
of
the
data
p
oin
ts.
In
this
pap
er
w
e
examine
eien
t
w
a
ys
to
ob-
tain
disretizations
that
are
optimal
with
resp
et
to
TSE.
TSE
is
arguably
the
most
ommonly
used
attribute
ev
aluation
funtion
in
ma
hine
learning
algorithms
(Auer,
Holte
&
Maass,
1995;
Bro
dley
,
1995;
Lubin-
sky
,
1995;
Kearns
et
al.,
1997).
Using
TSE
in
gro
w-
ing
the
h
yp
othesis
is
prone
to
o
v
ert
it
to
the
train-
ing
examples.
On
the
other
hand,
the
priniple
of
empiri
al
risk
minimization
is
based
on
the
ho
osing
the
h
yp
othesis
that
minimizes
training
error
(V
apnik
&
Cherv
onenkis,
1971).
Under
ertain
onditions
one
an
guaran
tee
that
the
minim
um-error
h
yp
othesis
will
also
ha
v
e
a
small
generalization
error.
An
union
of
in
terv
als
of
a
n
umerial
range
has
some-
times
b
een
used
as
h
yp
otheses
in
theoretial
frame-
w
orks
(Maass,
1994;
Kearns
et
al.,
1997;
Lozano,
2000),
but
more
often
the
b
ounded-arit
y
disretization
problem
is
enoun
tered
as
a
subproblem
in
learning
more
omplex
h
yp
otheses.
F
or
example,
in
learning
deision
trees
one
needs
to
partition
the
domains
of
n
umerial
attributes
in
to
a
mo
dest
n
um
b
er
of
in
ter-
v
als
(Breiman
et
al.,
1984;
Quinlan,
1986).
In
las-
sier
indution
disretization
of
n
umerial
domains
is
a
p
oten
tial
time-onsumption
b
ottlene
k,
sine
in
the
general
ase
the
n
um
b
er
of
p
ossible
disretizations
is
exp
onen
tial
in
the
n
um
b
er
of
in
terv
al
threshold
andi-
dates
within
the
domain.
With
resp
et
to
man
y
ommonly
used
ev
aluation
fun-
tions
n
umerial
ranges
an
b
e
optimally
disretized
in
quadrati
time
in
the
n
um
b
er
of
in
terv
al
threshold
an-
didates
using
a
dynami
programming
algorithm
(F
ul-
ton,
Kasif
&
Salzb
erg,
1995;
Zighed,
Rak
otomalala
&
F
es
het,
1997;
Elomaa
&
Rousu,
1999),
but
only
TSE
is
kno
wn
to
optimize
in
linear
time
(F
ulton
et
al.,
1995;
Auer,
1997;
Birk
endorf,
1997).
F
or
quadrati-time
al-
gorithms
pruning
ut
p
oin
t
andidates
in
prepro
ess-
ing
has
turned
out
to
b
e
a
useful
w
a
y
to
sp
eed
up
the
sear
h
(F
a
yy
ad
&
Irani,
1992;
Elomaa
&
Rousu,
1999;
2000).
In
this
pap
er
w
e
explore
the
p
ossibilities
and
limitations
of
further
sp
eed-up
of
minim
um
TSE
disretization
via
prepro
essing
the
data.
The
remainder
of
this
pap
er
is
organized
as
follo
ws.
In
the
next
setion
w
e
in
tro
due
the
minim
um
training
error
disretization
problem
more
exatly
and
review
sub
quadrati-time
sear
h
algorithms
that
ha
v
e
b
een
put
forw
ard
for
nding
TSE-optimal
disretizations.
Pruning
threshold
andidates
without
losing
the
p
ossi-
bilit
y
to
reo
v
er
an
optimal
disretization
is
onsidered
in
Setion
3.
In
Setion
4
w
e
pro
v
e
that
when
using
TSE
it
is
p
ossible
to
prune
more
thresholds
than
when
using
other
ommon
ev
aluation
funtions.
In
Setion
5
w
e
rep
ort
empirial
exp
erimen
ts
on
the
pruned
set
of
threshold
andidates.
The
nal
setion
presen
ts
the
onluding
remarks
of
this
pap
er.
2.
Minim
um
TSE
Disretization
Let
us
rst
dene
the
disretization
problem
more
formally
.
A
sample
S
=
f
(x
1
;
y
1
);
:
:
:
;
(x
n
;
y
n
)
g
onsists
of
n
lab
eled
real
v
alues
(giv
en
in
the
in-
reasing
order
of
x).
F
or
ea
h
(x;
y
),
x
2
R
and
y
is
the
lab
el
of
x
from
the
set
of
lasses
C
=
f
1
;
:
:
:
;
m
g
.
A
k
-in
terv
al
disretization
of
the
sam-
ple
is
generated
b
y
pi
king
k
1
interval
thr
esh-
olds
or
ut
p
oints
T
1
<
T
2
<
<
T
k
1
;
T
j
2
(x
min
;
x
max
),
where
x
min
=
min
f
x
j
(x;
y
)
2
S
g
and
x
max
=
max
f
x
j
(x;
y
)
2
S
g.
Moreo
v
er,
empt
y
in
ter-
v
als
are
not
allo
w
ed.
The
set
of
k
1
thresholds
denes
a
partition
U
k
i=1
S
i
of
the
set
S
as
follo
ws:
S
i
=
8
<
:
f
(x;
y
)
2
S
j
x
T
1
g
if
i
=
1,
f
(x;
y
)
2
S
j
T
i
1
<
x
T
i
g
if
1
<
i
<
k
,
f
(x;
y
)
2
S
j
x
>
T
k
1
g
if
i
=
k
.
In
this
pap
er
w
e
use
T
raining
Set
Error
to
deter-
mine
the
go
o
dness
of
a
partition.
Let
Æ
j
(S
)
=
j
f
(x;
y
)
2
S
j
y
6=
j
g
j
denote
the
err
or,
or
the
num-
b
er
of
disagr
e
ements,
with
resp
et
to
lass
j
in
the
set
S
.
That
is,
if
all
instanes
in
S
w
ere
predited
to
b
elong
to
lass
j
,
w
e
w
ould
mak
e
Æ
j
(S
)
errors
on
S
.
F
urthermore,
let
Æ
(S
)
=
min
j
2C
Æ
j
(S
)
denote
the
minim
um
error
on
S
.
A
lass
j
2
C
is
alled
a
major-
ity
lass
of
S
,
if
prediting
lass
j
leads
to
minim
um
n
um
b
er
of
errors
on
S
,
that
is,
Æ
j
(S
)
=
Æ
(S
).
Note
that
more
than
one
lass
an
qualify
as
a
ma
jorit
y
lass.
The
ma
jorit
y
lasses
of
a
set
S
are
denoted
b
y
ma
j
C
(S
)
=
f
j
2
C
j
Æ
j
(S
)
=
Æ
(S
)
g.
Giv
en
a
k
-in
terv
al
partition
U
k
i=1
S
i
of
S
,
where
ea
h
in
terv
al
is
lab
eled
b
y
(one
of
)
its
ma
jorit
y
lass,
its
T
raining
Set
Error
is
giv
en
b
y
TSE
k
℄
i=1
S
i
!
=
k
X
i=1
Æ
(S
i
):
In
tuitiv
ely
,
TSE
is
the
n
um
b
er
of
training
instanes
falsely
lassied
in
the
partition
when
ea
h
in
terv
al
is
lab
eled
b
y
one
of
its
ma
jorit
y
lasses.
The
glob
al
minim
um
error
disretization
problem
is
to
nd
a
partition
U
k
i=1
S
i
of
S
that
has
the
minim
um
TSE
v
alue
o
v
er
all
partitions
of
S
.
The
maxim
um
n
um
b
er
of
in
terv
als
k
ma
y
b
e
giv
en
as
a
parameter.
Then
the
problem
is
to
nd
the
TSE -optimal
partition
among
those
that
ha
v
e
at
most
k
in
terv
als.
This
is
alled
b
ounde
d-arity
disretization.
In
the
follo
wing
w
e
review
eien
t
(sub
quadrati)
al-
gorithms
that
ha
v
e
b
een
prop
osed
for
minim
um
train-
ing
error
disretization
of
n
umerial
ranges.
All
algo-
rithms
require
an
O
(n
log
n)
time
sorting
step
prior
to
disretization.
Note
that
it
is
not
p
ossible
to
disretize
an
unor
der
e
d
sample
of
(x;
y
)
2
R
C
pairs
faster
than
sorting;
giv
en
su
h
a
h
yp
othetial
disretization
algorithm
one
ould
sort
arbitrary
real
n
um
b
ers
b
y
assigning
ea
h
data
p
oin
t
a
unique
lass
lab
el
and
ask-
ing
the
disretization
algorithm
to
reate
TSE -optimal
disretization.
The
output
w
ould
ha
v
e
to
dene
the
se-
quene
of
bins
and
the
sorted
data
ould
b
e
reo
v
ered
in
linear
time
from
it.
Maass
(1994)
w
as
the
rst
to
devise
a
sub
quadrati
algorithm
for
minimizing
TSE
in
b
ounded
arit
y
dis-
retization.
In
his
algorithm
a
balaned
binary
tree,
to
lea
v
es
of
whi
h
the
data
p
oin
ts
are
assigned,
is
on-
struted
in
O
(n
log
n
+
nk
2
)
time.
Optimal
in
terv
al
assignmen
t
and
lab
eling
are
then
found
in
O
(k
2
)
time
using
this
data
struture.
Kearns
et
al.
(1997)
pre-
sen
ted
an
algorithm
requiring
O
(n
log
n)
time.
Their
algorithm
is
based
on
the
observ
ation
that
giv
en
an
optimal
partition
of
arit
y
k
,
in
the
t
w
o-lass
ase,
the
optimal
partition
of
arit
y
k
2
either
has
the
lab
el
of
b
oth
its
rst
and
last
in
terv
al
ipp
ed,
or
one
of
the
other
(in
ternal)
in
terv
als
j
has
its
lab
el
ipp
ed,
in
whi
h
ase
one
in
terv
al
is
omp
osed
of
out
of
the
three
in
terv
als
j
1;
j;
j
+
1
in
the
original
k
-in
terv
al
partition.
Using
dynami
programming
to
omp
ose
the
partition
from
the
b
est
lo
w
er-arit
y
partitions
of
subsets
yields
a
linear-time
algorithm
in
the
t
w
o-lass
ase.
Su
h
an
algorithm
w
as
rst
presen
ted
b
y
F
ulton,
Kasif,
and
Salzb
erg
(1995).
In
pro
essing
the
sorted
data
from
left
to
righ
t,
w
e
only
ha
v
e
to
deide,
at
suitable
thresh-
old
p
oin
ts,
whether
the
unommitted
data
p
oin
ts
are
om
bined
to
the
last
in
terv
al
of
the
already
onstruted
disretization
or
do
they
start
a
new
in
terv
al
in
the
dis-
retization.
The
deision,
of
ourse,
is
based
on
whi
h
of
the
alternativ
es
yields
a
smaller
training
error.
Us-
ing
this
approa
h
and
k
eeping
tra
k
of
the
optimal
dis-
retization
of
the
pro
essed
data
for
all
arities
1;
:
:
:
;
k
lets
us,
in
the
end,
ho
ose
the
optimal
disretization
X
X
X
X
Y
Y
Y
Y
Y
Y
X
X
X
Y
Y
Y
Y
X
X
X
X
Y
X
X
X
Y
Y
1
2
2
3
3
3
4
4
4
4
4
4
5
5
5
5
5
6
7
7
7
8
8
9
9
9
9
1/
2/
1/2
2/4
1/4
1/
3/
1/1
2/2
1
2
3
4
5
6
7
8
9
Figure
1.
A
sequene
of
data
p
oin
ts
sorted
aording
to
their
n
umerial
v
alues
(ab
o
v
e).
The
lass
lab
els
(X
and
Y
)
of
the
data
p
oin
ts
are
also
sho
wn.
The
sequene
of
data
bins
with
the
resp
etiv
e
lass
distributions
(b
elo
w).
In
terv
al
thresholds
an
b
e
set
at
the
bin
b
orders.
of
the
n
umerial
range
with
at
most
k
in
terv
als.
The
time
omplexit
y
of
this
approa
h
learly
is
O
(k
n).
Ho
w
to
generalize
the
dynami
programming
ompu-
tation
to
the
ase
when
there
are
m
2
lasses
w
as
sho
wn
b
y
Auer
(1997).
The
suien
t
mo
diation
is
to
main
tain
separately
for
ea
h
lass
j
(1
j
m)
at
ea
h
threshold
andidate
the
optimal
disretizations
of
arit
y
1;
:
:
:
;
k
su
h
that
the
last
in
terv
al
is
lab
eled
with
j
.
T
aking
m
ultiple
lasses
in
to
aoun
t
raises
the
time
requiremen
t
of
the
algorithm
to
O
(k
mn).
Auer
(1997)
also
sho
w
ed
ho
w
the
spae
omplexit
y
of
the
al-
gorithm
an
b
e
k
ept
lo
w.
Birk
endorf
(1997)
has
giv
en
a
similar
algorithm
as
a
sp
eial
ase
of
a
more
general
optimization
problem.
Belo
w,
w
e
exp
erimen
t
with
this
Auer-Birk
endorf
algorithm.
The
linear-time
sear
h
algorithms
review
ed
ab
o
v
e
are
asymptotially
optimal
in
the
n
um
b
er
of
examples.
One
annot
guaran
tee
nding
the
optimal
disretiza-
tion
without
examining
all
of
the
ut
p
oin
t
andidates,
and
they
annot
b
e
found
without
going
through
all
of
the
data.
F
or
the
dynami
programming
s
heme,
a
w
orst-ase
lo
w
er
b
ound
of
(k
mn)
an
b
e
deriv
ed
from
the
general
literature
onerning
dynami
pro-
gramming
optimization
(Elomaa
&
Rousu,
2001).
Ho
w
ev
er,
in
pratie
there
are
man
y
p
oten
tial
thresh-
olds
that
an
b
e
o
v
erlo
ok
ed
without
risking
to
lose
the
optimal
disretization.
Hene,
the
sublinear
n
um
b
er
of
threshold
andidates
giv
es
a
p
ossibilit
y
to
redue
the
time
requiremen
t
of
the
sear
h.
In
the
follo
wing
w
e
review
approa
hes
to
optima-preserving
pruning
of
ut
p
oin
t
andidates.
3.
Pruning
Threshold
Candidates
in
Prepro
essing
As
noted
b
efore,
the
pro
essing
of
a
n
umerial
v
alue
range
starts
with
sorting
of
the
data
p
oin
ts.
If
one
ould
mak
e
its
o
wn
partition
in
terv
al
out
of
ea
h
data
p
oin
t
in
the
sorted
sequene,
this
disretization
w
ould
ha
v
e
zero
training
error.
Ho
w
ev
er,
one
annot
nor
w
an
ts
to
disern
b
et
w
een
all
data
p
oin
ts.
Only
those
that
dier
in
their
v
alue
an
b
e
separated
from
ea
h
other.
Consider,
for
example,
the
data
set
sho
wn
in
Figure
1.
There
are
27
in
teger-v
alued
data
p
oin
ts.
They
are
instanes
of
t
w
o
lasses;
X
and
Y
.
In
ter-
v
al
thresholds
an
only
b
e
set
in
b
et
w
een
those
p
oin
ts
where
the
data
p
oin
t
v
alue
hanges.
Therefore,
w
e
an
prepro
ess
the
data
in
to
bins.
There
is
one
bin
for
ea
h
existing
data
p
oin
t
v
alue.
Within
ea
h
bin
w
e
reord
the
lass
distribution
of
the
instanes
that
b
elong
to
it.
The
lass
distribution
information
sues
to
ev
aluate
the
go
o
dness
of
the
partition;
the
atual
data
set
do
es
not
need
to
b
e
main
tained.
The
sequene
of
bins
has
the
minimal
attainable
mis-
lassiation
rate.
Ho
w
ev
er,
the
same
rate
an
usu-
ally
b
e
obtained
with
a
smaller
n
um
b
er
of
in
terv
als.
The
analysis
of
the
en
trop
y
funtion
b
y
F
a
yy
ad
and
Irani
(1992)
has
sho
wn
that
ut
p
oin
ts
em
b
edded
in
to
lass-uniform
in
terv
als
need
not
b
e
tak
en
in
to
aoun
t,
only
the
end
p
oin
ts
of
su
h
in
terv
als
the
b
oundary
p
oints
need
to
b
e
onsidered
to
nd
the
optimal
dis-
retization.
Elomaa
and
Rousu
(1999)
sho
w
ed
that
the
same
is
true
for
a
large
lass
of
ommonly
used
ev
alua-
tion
funtions
inluding
also
TSE.
The
analysis
an
b
e
used
in
prepro
essing
in
a
straigh
tforw
ard
manner:
w
e
merge
together
adjaen
t
lass
uniform
bins
with
the
same
lass
lab
el
to
obtain
example
blo
ks
(see
Figure
2).
The
b
oundary
p
oin
ts
of
the
v
alue
range
are
the
b
orders
of
its
blo
ks.
Blo
k
onstrution
still
lea
v
es
all
bins
with
a
mixed
lass
distribution
as
their
o
wn
blo
ks.
Subsequen
tly
,
a
more
general
prop
ert
y
w
as
also
pro
v
ed
for
TSE
and
some
other
ev
aluation
funtions
(Elo-
maa
&
Rousu,
2000):
se
gment
b
or
ders
p
oin
ts
that
lie
in
b
et
w
een
t
w
o
adjaen
t
bins
with
dieren
t
relativ
e
lass
distributions
are
the
only
p
oin
ts
that
need
to
b
e
tak
en
in
to
aoun
t.
It
is
easy
to
see
that
segmen
t
b
orders
are
a
subset
of
b
oundary
p
oin
ts.
Example
se
g-
ments
are
easily
obtained
from
bins
b
y
omparing
the
relativ
e
lass
distributions
of
adjaen
t
bins
(see
Figure
2).
3/
1/2
2/4
1/4
4/
1/1
2/2
2
3
4
5
7
8
9
3/
3/6
1/4
4/
3/3
2
4
5
7
9
Figure
2.
The
blo
ks
(ab
o
v
e)
and
segmen
ts
(b
elo
w)
in
the
sample
of
Figure
1.
Blo
k
b
orders
are
the
b
oundary
p
oin
ts
of
the
n
umerial
range
and
segmen
t
b
orders
are
a
subset
of
them.
4.
F
urther
Prepruning
Opp
ortunities
for
T
raining
Set
Error
An
earlier
pro
of
sho
ws
that
only
segmen
t
b
orders
need
to
b
e
onsidered
when
trying
to
nd
TSE -optimal
par-
titions
(Elomaa
&
Rousu,
2000).
In
this
setion
w
e
pro
v
e
that
ev
en
some
segmen
t
b
orders
an
safely
b
e
disregarded.
Let
a
majority
alternation
p
oint
1
b
e
the
b
order
in
b
et
w
een
t
w
o
onseutiv
e
bins
(or,
as
w
ell,
blo
ks
or
segmen
ts)
that
ha
v
e
dieren
t
ma
jorit
y
lasses.
More
exatly
,
let
S
1
=
f
(x;
y
)
2
S
j
x
=
v
1
g
and
S
2
=
f
(x;
y
)
2
S
j
x
=
v
2
g
,
v
1
;
v
2
2
R,
b
e
t
w
o
adjaen
t
bins.
There
is
a
ma
jorit
y
alternation
p
oin
t
in
b
et
w
een
S
1
and
S
2
,
if
and
only
if
ma
j
C
(S
1
)
\
ma
j
C
(S
2
)
=
;;
that
is,
the
sets
of
ma
jorit
y
lasses
in
S
1
and
S
2
are
disjoin
t.
F
or
example
in
the
dataset
of
Figure
2
there
are
only
t
w
o
ma
jorit
y
alternations.
Figure
3
sho
ws
the
data
organized
in
to
sequenes
b
et
w
een
ma
jorit
y
alter-
nations,
whi
h
help
to
nd
TSE-optimal
partitions.
Theorem
1
The
p
artition
dene
d
by
al
l
majority
al-
ternation
p
oints
in
the
sample
S
has
the
minimum
TSE
value
and
has
minimal
numb
er
of
intervals.
Pro
of
Let
us
rst
sho
w
that
all
ma
jorit
y
alternation
p
oin
ts
m
ust
b
e
ut
p
oin
ts
in
the
minimal
TSE -optimal
partition.
Assume
that
=
U
k
i=1
S
i
is
a
partition
of
the
sample
S
su
h
that
it
do
es
not
on
tain
all
ma
jor-
it
y
alternation
p
oin
ts.
Then
in
there
m
ust
exist
an
in
terv
al
S
i
=
f
(x;
y
)
2
S
j
T
i
1
<
x
T
i
g
that
on-
tains
the
ma
jorit
y
alternation
p
oin
ts
a
1
;
:
:
:
;
a
r
;
r
1,
whi
h
satisfy
T
i
1
<
a
1
<
<
a
r
<
T
i
.
Let
U
r
+1
h=1
Q
h
denote
the
partition
of
S
i
indued
b
y
the
r
alternation
p
oin
ts.
F
rom
the
denition
of
a
ma
jorit
y
alternation
p
oin
t
it
1
Note
that
the
term
alternation
has
a
dieren
t
meaning
here
from,
e.g.,
that
in
(F
ulton
et
al.,
1995;
Kearns
et
al.,
1997).
3/
4/10
7/3
2
5
9
Figure
3.
The
ma
jorit
y
alternations
in
the
sample
of
Figure
1.
follo
ws
that
for
an
y
lass
j
there
is
a
subset
Q
h
j
within
U
r
+1
h=1
Q
h
for
whi
h
Æ
j
(Q
h
j
)
>
Æ
(Q
h
j
),
that
is,
j
is
not
a
ma
jorit
y
lass
in
Q
h
j
.
Th
us,
Æ
j
(S
i
)
>
P
r
+1
i=1
Æ
(Q
i
),
and
sine
j
is
arbitrary
also
Æ
(S
i
)
>
P
r
+1
h=1
Æ
(Q
h
).
Therefore,
partitioning
S
i
in
to
r
+
1
subin
terv
als
at
these
ma
jorit
y
alternations
redues
the
n
um
b
er
of
mis-
lassiations
in
S
i
and,
th
us,
giv
es
a
b
etter
partition
than
.
Hene,
annot
b
e
TSE-optimal.
This
sho
ws
that
an
y
TSE -optimal
partition
has
a
ut
p
oin
t
in
ea
h
ma
jorit
y
alternation
p
oin
t.
Let
us
no
w
assume
that
a
TSE -optimal
partition
=
U
k
i=1
S
i
has
also
other
ut
p
oin
ts
than
ma
jor-
it
y
alternations.
Let
the
ut
p
oin
ts
p
oin
ts
in
the
partition
b
e
T
1
;
:
:
:
;
T
k
1
and
let
us,
further,
dene
T
0
=
1
and
T
k
=
1.
Let
T
i
b
e
one
ut
p
oin
t
in
that
is
not
a
ma
jorit
y
alternation
p
oin
t.
No
w
T
i
separates
t
w
o
in
terv
als
in
the
partition,
S
i
=
f
(x;
y
)
2
S
j
T
i
1
<
x
T
i
g
and
S
i+1
=
f
(x;
y
)
2
S
j
T
i
<
x
T
i+1
g.
Sine
T
i
is
not
a
ma
jorit
y
alternation,
there
m
ust
exist
a
lass
j
for
whi
h
Æ
j
(S
i
)
=
Æ
(S
i
)
and
Æ
j
(S
i+1
)
=
Æ
(S
i+1
).
Th
us,
lab
eling
b
oth
in
terv
als
S
i
and
S
i+1
with
the
same
lab
el
j
results
in
a
minim
um
error
partition.
Therefore,
remo
ving
T
i
will
not
aet
the
n
um
b
er
of
mislassiatio
ns.
The
same
holds
for
all
ut
p
oin
ts
that
are
not
ma
jorit
y
alternations.
Hene,
w
e
ha
v
e
sho
wn
that
for
globally
minimal
TSE
v
alue
one
needs
to
ut
on
ea
h
ma
jorit
y
alternation
p
oin
t
and
to
k
eep
the
n
um
b
er
of
in
terv
als
at
minim
um,
no
other
thresholds
exept
ma
jorit
y
alternation
p
oin
ts
an
b
e
used.
2
The
ab
o
v
e
theorem
sho
ws
that,
when
the
n
um
b
er
of
in
terv
als
in
the
partition
is
not
b
ounded,
the
b
est
par-
tition
is
the
one
that
uts
on
ea
h
ma
jorit
y
alterna-
tion
p
oin
t.
Ho
w
ev
er,
when
an
optimal
partition
of
b
ounded-arit
y
is
required,
examining
ma
jorit
y
alter-
nation
p
oin
ts
alone
do
es
not
sue;
one
has
to
on-
sider
the
frequeny
of
the
minorit
y
lasses
as
w
ell.
In
the
follo
wing,
w
e
generalize
the
onept
of
a
ma
jorit
y
alternation
p
oin
t
so
that
handling
the
b
ounded-arit
y
ase
b
eomes
p
ossible.
Let
P
and
Q
b
e
t
w
o
adjaen
t
bins
with
relativ
e
lass
frequeny
distributions
D
P
=
f
p
1
;
:
:
:
;
p
m
g
and
D
Q
=
f
q
1
;
:
:
:
;
q
m
g,
resp
etiv
ely
.
There
is
an
alternation
p
oint
in
b
et
w
een
P
and
Q
if
there
is
no
index
set
I
=
f
i
1
;
:
:
:
;
i
m
g
su
h
that
p
i
1
p
i
2
p
i
m
and
q
i
1
q
i
2
q
i
m
.
In
other
w
ords,
an
alter-
nation
o
urs
in
b
et
w
een
t
w
o
bins,
if
ordering
of
the
lasses
in
desending
order
of
frequeny
is
dieren
t
in
them;
that
is,
there
exists
a
pair
of
lasses
h
and
j
su
h
that
p
h
>
p
j
and
q
h
<
q
j
.
Let
us
all
the
n
umer-
ial
range
in
terv
als
indued
b
y
the
set
of
alternation
p
oin
ts
as
alternating
se
gments.
Clearly
,
a
ma
jorit
y
al-
ternation
is
a
sp
eial
ase
of
an
alternation.
In
the
t
w
o-lass
setting
the
t
w
o
onepts
oinide.
Th
us
the
n
um
b
er
of
alternations
is
alw
a
ys
at
least
as
high
as
that
of
ma
jorit
y
alternations.
Next
w
e
sho
w
that
only
alternations
need
to
b
e
onsid-
ered
when
sear
hing
for
the
TSE-optimizing
partition.
Theorem
2
F
or
any
numeri
al
r
ange
and
e
ah
k
2
ther
e
is
an
TSE-optimal
p
artition
with
at
most
k
intervals
that
is
dene
d
on
alternation
p
oints.
Pro
of
Let
L,
P
,
Q,
and
R
form
a
sequene
of
ad-
jaen
t
subsets
along
the
n
umerial
range.
Let
the
relativ
e
lass
distributions
of
sets
L
and
R ,
D
L
and
D
R
,
b
e
arbitrary
and
let
D
P
=
f
p
1
;
:
:
:
;
p
m
g
and
D
Q
=
f
q
1
;
:
:
:
;
q
m
g
b
e
su
h
that
there
is
no
alter-
nation
p
oin
t
in
b
et
w
een
the
sets
P
and
Q.
Let
us
no
w
onsider
splitting
the
set
in
to
t
w
o
in
terv
als
and
lab
eling
the
left-hand
side
with
lass
h
and
the
righ
t-hand
side
with
lass
j
.
In
su
h
a
situation,
let
Æ
h;j
(S
℄
T
)
denote
the
error
of
a
binary
partition
with
S
as
the
left-hand
side
and
T
as
the
righ
t-hand
side.
The
errors
of
the
partitions
are
Æ
h;j
(L
℄
(P
[
Q
[
R ))
=
Æ
h
(L)
+
X
S
=P
;Q;R
Æ
j
(S
)
Æ
h;j
((L
[
P
)
℄
(Q
[
R ))
=
X
S
=L;P
Æ
h
(S
)
+
X
S
=Q;R
Æ
j
(S
)
Æ
h;j
((L
[
P
[
Q)
℄
R )
X
S
=L;P
;Q
Æ
h
(S
)
+
Æ
j
(R ):
By
assumption,
the
p
oin
t
in
b
et
w
een
P
and
Q
is
not
an
alternation.
Therefore,
from
the
denition
of
an
alternation
it
follo
ws
that
for
an
y
pair
of
lasses
h
and
j
,
1
h;
j
m,
either
1)
p
h
p
j
and
q
h
q
j
or
2)
p
h
p
j
and
q
h
q
j
.
In
the
rst
ase
Æ
j
(P
)
Æ
h
(P
)
and,
onsequen
tly
,
Æ
h;j
(L
℄
(P
[
Q
[
R ))
Æ
h;j
((L
[
P
)
℄
(Q
[
R )):
In
the
seond
ase
Æ
j
(Q)
Æ
h
(Q),
and
Æ
h;j
((L
[
P
)
℄
(Q
[
R ))
Æ
h;j
((L
[
P
[
Q)
℄
R ):
Hene,
the
partition
(L
[
P
)
℄
(Q
[
R )
is
alw
a
ys
at
most
as
go
o
d
as
the
t
w
o
other
partitions.
Sine
the
lasses
h
and
j
and
the
sets
L
and
R
are
arbitrary
,
w
e
ha
v
e
sho
wn
that
in
an
y
partition
a
ut
p
oin
t
that
is
not
an
alternation
p
oin
t,
an
b
e
replaed
with
another
ut
p
oin
t
without
inreasing
the
error.
Th
us,
one
an
slide
ut
p
oin
ts
to
the
left
or
to
the
righ
t
un
til
an
alternation
p
oin
t
or
the
end
of
in
terv
al
is
enoun
tered.
Hene,
a
TSE-optimal
binary
partition
of
an
in
terv
al
is
dened
on
alternation
p
oin
ts.
Sine
in
the
TSE -optimal
k
-in
terv
al
disretization
the
em
b
edded
binary
partitions
are
TSE -optimal,
the
laim
follo
ws.
2
The
onsequene
of
the
ab
o
v
e
theorem
is
that
only
those
ut
p
oin
ts
that
are
alternations
need
to
b
e
on-
sidered
when
lo
oking
for
the
b
ounded
arit
y
optimal
TSE
partition.
Hene
the
sample
an
b
e
pro
essed
in
to
alternating
segmen
ts.
W
e
sho
w
next
that
no
alternations
an
b
e
pro
v
en
sub-
optimal
without
onsidering
the
on
text
in
whi
h
the
ut
p
oin
t
is;
that
is,
whi
h
other
ut
p
oin
ts
are
presen
t
in
the
k
-in
terv
al
disretization
of
the
range.
Theorem
3
F
or
e
ah
alternation
p
oint
ther
e
is
a
on-
text
in
whih
it
is
the
TSE -optimal
ut
p
oint.
Pro
of
Let
L,
P
,
Q,
and
R
b
e
adjaen
t
subsets
along
a
n
umerial
range.
Let
there
b
e
a
pair
of
lasses
h
and
j
for
whi
h
it
holds
that
p
h
>
p
j
and
q
h
<
q
j
.
That
is,
there
is
an
alternation
p
oin
t
in
b
et
w
een
P
and
Q.
No
w,
let
us
ho
ose
the
lass
distribution
for
L
so
that
h
is
the
ma
jorit
y
lass
of
L
[
P
[
Q.
Su
h
a
distribution
is
easily
generated
b
y
ho
osing
L
to
b
e
large
enough
and
onsist
of
instanes
of
a
single
lass
h.
No
w,
Æ
j
(P
)
>
Æ
h
(P
)
and,
onsequen
tly
,
Æ
h;j
(L
℄
(P
[
Q
[
R ))
>
Æ
h;j
((L
[
P
)
℄
(Q
[
R )):
Similarly
,
the
lass
distribution
of
R
an
b
e
set
so
that
j
is
the
ma
jorit
y
lass
of
P
[
Q
[
R .
Sine
Æ
j
(Q)
<
Æ
h
(Q),
w
e
get
Æ
h;j
((L
[
P
)
℄
(Q
[
R ))
<
Æ
h;j
((L
[
P
[
Q)
℄
R ):
Th
us,
the
optimal
binary
split
with
the
left
side
lab
eled
with
lass
h
and
the
righ
t
side
lab
eled
with
lass
j
has
the
ut
p
oin
t
on
the
alternation
p
oin
t
in
b
et
w
een
P
and
Q.
Sine
h
and
j
are
ma
jorit
y
lasses
of
the
left
and
the
righ
t
sides
of
all
binary
splits
of
the
subsets
L,
P
,
Q,
and
R ,
the
split
has
the
minim
um
TSE.
Sine
h
and
j
are
arbitrary
,
the
laim
follo
ws.
2
The
ab
o
v
e
theorem
sho
ws
that
the
usefulness
of
an
al-
ternation
p
oin
t
annot
b
e
judged
b
y
examining
the
t
w
o
adjaen
t
bins
alone.
This
sho
ws
that
while
a
fast
left-
to-righ
t
san
o
v
er
the
data
examining
t
w
o
bins
at
a
time
is
suien
t
to
nd
alternation
p
oin
ts,
diso
v-
ering
an
equally
simple
and
fast
algorithm
for
pruning
out
some
of
the
alternation
p
oin
ts
is
unlik
ely
.
5.
Empirial
Ev
aluation
In
this
setion
w
e
examine
alternation
p
oin
ts
with
real-
w
orld
data.
W
e
test
rst
for
29
w
ell-kno
wn
data
sets
from
the
UCI
data
rep
ository
(Blak
e
&
Merz,
1998)
what
are
the
relations
of
a
v
erage
n
um
b
ers
of
bin
b
or-
ders,
b
oundary
p
oin
ts,
segmen
t
b
orders,
alternation
p
oin
ts,
and
ma
jorit
y
alternations
p
er
n
umerial
at-
tribute.
Then
the
pratial
sp
eed-up
gained
b
y
exam-
ining
alternations
rather
than
bin
b
orders,
b
oundary
p
oin
ts,
or
segmen
t
b
orders
is
insp
eted.
5.1
On
Finding
Alternation
P
oin
ts
Eien
tly
Let
us
rst
mak
e
a
note
ab
out
the
prepro
essing
algo-
rithm
used
in
these
tests.
All
prepro
essing
algorithms
rst
extrated
bins
from
the
sorted
example
sequene.
Then,
b
oundary
p
oin
ts,
segmen
t
b
orders,
or
alterna-
tion
p
oin
ts
w
ere
extrated.
The
prepro
essing
algo-
rithms
for
b
oundary
p
oin
ts
and
segmen
t
b
orders
b
oth
run
in
time
O
(n
+
mV
),
where
V
is
the
n
um
b
er
of
bins.
The
trivial
algorithm
for
alternation
p
oin
ts
is
one
whi
h
he
ks
frequenies
of
up
to
(m
2
m)=2
pairs
of
lasses
in
t
w
o
adjaen
t
bins
un
til
a
disagreemen
t
in
the
lass
ordering
is
found.
The
pro
essing
of
the
en-
tire
range
tak
es
O
(n
+
m
2
V
)
time
using
this
approa
h.
A
faster
algorithm
for
nding
alternation
p
oin
ts
is
ob-
tained
b
y
sorting
(in
a
left-to-righ
t
pass)
the
lass
dis-
tribution
histograms
of
the
bins
using
bu
k
et
sort
in
amortized
O
(n)
time
and
he
king
a
bu
k
et
and
the
lass
frequeny
distribution
of
the
adjaen
t
in
terv
al
to
the
righ
t
for
a
disagreemen
t
in
lass
ordering.
The
bu
k
et
is
merged
to
the
in
terv
al
if
no
alternation
p
oin
t
is
found.
This
O
(n
+
mV
)
time
alternation
p
oin
t
ex-
tration
meets
the
asymptoti
b
ound
for
extrating
bins
from
the
data.
Ho
w
ev
er,
the
o
eien
ts
in
this
approa
h
are
quite
large.
The
UCI
data
sets
mostly
ha
v
e
few
lasses,
whi
h
prohibits
time
sa
vings
for
the
linear-time
approa
h.
Therefore,
w
e
ha
v
e
used
the
trivial
alternation
p
oin
t
detetion
in
our
exp
erimen
ts.
5.2
Empirial
Results
Figure
4
depits
the
results
of
the
rst
exp
erimen
t.
Ov
er
all
29
test
domains
the
a
v
erage
redution
in
ut
p
oin
t
andidates
when
mo
ving
from
segmen
t
b
orders
to
alternation
p
oin
ts
is
appro
ximately
40%
and
lose
to
60%
when
ompared
to
all
ut
p
oin
ts.
The
n
um-
#
thresholds
25%
50%
75%
29
Abalone
863.7
2
A
dult
3,673.7
5
Annealing
27.5
2
Australian
188.2
6
Auto
insurane
61.4
2
Breast
Wisonsin
9.9
2
Census
inome
12,864.3
2
Coli
86.3
2
Diab
etes
156.8
2
Euth
yroid
123.8
2
German
145.9
6
Glass
115.3
5
Heart
C
30.5
2
Heart
H
25.1
2
Hepatitis
54.3
2
Hyp
oth
yroid
165.7
3
Iris
30.8
26
Letter
reognition
16.0
2
Liv
er
54.7
5
P
age
blo
ks
909.2
6
Satellite
76.3
7
Segmen
tation
137.7
7
Sh
uttle
123.2
2
Sonar
187.6
4
V
ehile
79.4
11
V
o
w
el
623.5
3
W
a
v
eform
714.1
3
Wine
98.2
10
Y
east
51.5
Figure
4.
The
a
v
erage
n
um
b
er
of
bin
b
orders
(the
gures
on
the
righ
t)
and
the
relativ
e
n
um
b
ers
of
b
oundary
p
oin
ts
(bla
k
bars
b
elo
w),
segmen
t
b
orders
(white
bars),
alter-
nation
p
oin
ts
(gra
y
bars),
and
ma
jorit
y
alternations
(dark
gra
y
bars
on
top)
p
er
n
umerial
attribute
of
the
domain.
The
white
gure
on
top
of
the
dark
gra
y
bar
is
the
n
um
b
er
of
lasses.
b
er
of
segmen
t
b
orders
is
only
sligh
tly
smaller
than
that
of
b
oundary
p
oin
ts.
The
n
um
b
er
of
alternations
and
ma
jorit
y
alternations
is
the
same
on
t
w
o-lass
domains
(e.g.,
A
dult,
Australian,
Breast
Wisonsin,
et.),
whi
h
is
lear
from
their
denition.
F
or
some
t
w
o-lass
domains
(e.g.,
Breast
Wisonsin,
Euth
yroid,
Heart
H,
and
Hyp
oth
yroid)
this
n
um
b
er
is
signian
tly
(a.
75%)
lo
w
er
than
that
of
segmen
t
b
orders.
On
other
m
ultilass
domains
(e.g.,
Heart
C,
Letter
reog-
0 %
20 %
40 %
60 %
80 %
100 %
120 %
0
5
10
15
20
25
30
number of classes (m)
#
a
lt
e
rn
a
ti
n
g
s
e
g
m
e
n
ts
v
s
.
#
n
u
m
b
e
r
o
f
b
in
s
(
A
/V
)
Figure
5.
The
eets
of
the
n
um
b
er
of
lasses
on
the
e-
ieny
gain
obtained
b
y
using
alternations
rather
than
bin
b
orders.
nition,
Satellite,
and
Y
east)
there
is
a
signian
t
dif-
ferene
in
the
n
um
b
ers
of
alternations
and
ma
jorit
y
alternations.
A
striking
result
is
that
in
man
y
of
those
domains
that
ha
v
e
man
y
lasses
(e.g.,
Abalone,
Auto
insurane,
Let-
ter
reognition,
V
o
w
el,
and
Y
east)
the
n
um
b
er
of
alter-
nations
is
not
m
u
h
smaller
than
that
of
segmen
t
b
or-
ders.
The
n
um
b
er
of
ma
jorit
y
alternations,
though,
is
usually
still
somewhat
smaller
ev
en
for
these
do-
mains.
An
explanation
for
this
is
that
the
more
there
are
lasses
the
less
ommon
it
is
that
t
w
o
adjaen
t
seg-
men
ts
ha
v
e
the
same
frequeny
order
for
all
of
them.
The
ma
jorit
y
lass
ma
y
still
b
e
the
same
in
t
w
o
ad-
jaen
t
segmen
ts
ev
en
though
the
n
um
b
er
of
lasses
is
high.
Figure
5
plots
the
relation
b
et
w
een
the
n
um
b
er
of
lasses
and
the
amoun
t
of
redution
obtained
in
the
n
um
b
er
of
ut
p
oin
t
andidates
b
y
using
alternation
p
oin
ts.
Ea
h
n
umerial
attribute
within
our
test
do-
mains
is
represen
ted
b
y
one
dot.
Largest
redutions
are
obtained
in
domains
with
few
lasses.
In
the
t
w
o
data
sets
with
o
v
er
25
lasses
the
redution
sta
ys
b
elo
w
25%.
The
orrelation
b
et
w
een
the
ratio
A=V
,
where
A
is
the
n
um
b
er
of
alternating
segmen
ts,
and
the
n
um
b
er
of
lasses
is
quite
lear.
Hene,
one
ma
y
exp
et
that
in
domains
with
man
y
lasses
the
time
sa
vings
are
not
as
great
as
in
domains
with
few
er
lasses.
No
w
that
w
e
kno
w
that
the
n
um
b
er
of
alternation
p
oin
ts
is
often
ev
en
substan
tially
lo
w
er
than
that
of
b
oundary
p
oin
ts
and
segmen
t
b
orders,
the
question
remains
ho
w
m
u
h,
if
at
all,
an
w
e
b
enet
from
using
alternations
rather
than
their
alternativ
es.
Figure
6
plots
the
total
running
time
(prepro
essing
0,1
0,4
0,7
1
2
4
8
16
32
64
128 256
0 %
20 %
40 %
60 %
80 %
100 %
120 %
140 %
160 %
ti
m
e
o
n
a
lt
e
rn
a
ti
o
n
s
v
s
.
ti
m
e
o
n
a
ll
c
u
t
p
o
in
ts
# alternating
segments vs.
# all cut points
(A/V)
maximum arity (k)
Figure
6.
The
time
required
in
sear
hing
for
TSE-optimal
disretization
when
op
erating
on
alternations
on
trasted
to
that
on
all
ut
p
oin
ts.
The
eets
of
the
allo
w
ed
maxim
um
arit
y
and
the
relation
A=V
are
also
tak
en
in
to
aoun
t.
and
sear
h)
of
the
Auer-Birk
endorf
algorithm
on
al-
ternation
p
oin
ts
against
the
same
algorithm
op
erating
on
bin
b
orders.
It
an
b
e
learly
observ
ed
that
as
the
relativ
e
n
um
b
er
of
alternations
redues
or
the
n
um-
b
er
of
in
terv
als
allo
w
ed
in
the
disretization
gro
ws,
the
adv
an
tage
gained
b
y
using
alternation
p
oin
ts
in-
reases
signian
tly
.
The
o
v
erhead
asso
iated
with
nding
the
alternation
p
oin
ts
ditates
that
if
either
of
these
parameters
is
not
fa
v
orable,
then
disretiza-
tion
based
on
alternations
is
less
eien
t
than
that
on
easily-found
bin
b
orders.
F
or
k
8
an
A=V
ratio
b
e-
lo
w
0.9
promises
gains,
while
in
binary
disretization
no
gains
are
obtained
irresp
etiv
e
of
the
A=V
ratio.
In
the
exp
erimen
ts
w
e
also
observ
ed
that
the
dis-
retizations
dened
on
ma
jorit
y
alternations
w
ere
al-
w
a
ys
optimal
ev
en
for
domains
with
more
than
t
w
o
lasses.
This
indiates
that
the
ondition,
whi
h
sets
a
ut
p
oin
t
on
an
alternation
p
oin
t
that
is
not
a
ma-
jorit
y
alternation,
is
v
ery
rare
in
pratie.
So,
if
the
optimal
sore
is
not
required,
one
an
safely
ho
ose
the
ut
p
oin
ts
from
among
ma
jorit
y
alternations.
6.
Conlusion
Examining
segmen
t
b
orders
a
subset
of
b
oundary
p
oin
ts
is
neessary
and
suien
t
in
sear
hing
for
the
optimal
partition
of
a
v
alue
range
with
resp
et
to
a
stritly
on
v
ex
ev
aluation
funtion
(Elomaa
&
Rousu,
2002).
W
e
sho
w
ed
that
with
TSE
whi
h
is
not
stritly
on
v
ex
only
a
w
ell-dened
subset
of
segmen
t
b
orders
need
to
b
e
examined:
only
ma
jorit
y
alternations
and
alternation
p
oin
ts
need
to
b
e
onsid-
ered
when
sear
hing
for
the
global
and
b
ounded-arit
y
optim
um,
resp
etiv
ely
.
On
the
other
hand,
w
e
w
ere
able
to
sho
w
that
in
b
ounded-arit
y
disretization
no
alternations
an
b
e
ig-
nored
with
TSE
without
onsidering
the
plaemen
t
of
adjaen
t
ut
p
oin
ts.
The
Auer-Birk
endorf
algorithm,
iniden
tally
,
do
es
exatly
this
kind
of
b
o
okk
eeping:
it
k
eeps
tra
k
of
the
b
est
on
text
to
the
left
for
ea
h
lass
and
ea
h
arit
y
.
Hene,
impro
ving
the
dynami
programming
s
heme
on
the
oneptual
lev
el
seems
diult.
On
some
real-w
orld
domains
the
n
um
b
er
of
alterna-
tion
p
oin
ts
w
as
diso
v
ered
to
b
e
signian
tly
smaller
than
that
of
segmen
t
b
orders.
Pratial
gains
w
ere
ob-
tained
b
y
using
alternation
p
oin
ts
rather
than
segmen
t
b
orders.
Referenes
Auer,
P
.
(1997).
Optimal
splits
of
single
attributes
(Unpublished
man
usript).
Institute
for
Theoretial
Computer
Siene,
Graz
Univ
ersit
y
of
T
e
hnology
.
Auer,
P
.,
Holte,
R.
C.,
&
Maass,
W.
(1995).
Theory
and
appliation
of
agnosti
P
A
C-learning
with
small
deision
trees.
Pr
o
e
e
dings
of
the
Twelfth
Interna-
tional
Confer
en
e
on
Mahine
L
e
arning
(pp.
2129).
San
F
raniso:
Morgan
Kaufmann.
Birk
endorf,
A.
(1997).
On
fast
and
simple
algorithms
for
nding
maximal
subarra
ys
and
appliations
in
learning
theory
.
Computational
L
e
arning
The
ory,
Pr
o
e
e
dings
of
the
Thir
d
Eur
op
e
an
Confer
en
e
(pp.
198209).
Berlin,
Heidelb
erg:
Springer-V
erlag.
Blak
e,
C.
L.,
&
Merz,
C.
J.
(1998).
UCI
r
ep
ository
of
mahine
le
arning
datab
ases.
Departmen
t
of
Infor-
mation
and
Computer
Siene,
Univ
ersit
y
of
Cali-
fornia
at
Irvine.
Breiman,
L.,
F
riedman,
J.
H.,
Olshen,
R.
A.,
&
Stone,
C.
J.
(1984).
Classi
ation
and
r
e
gr
ession
tr
e
es.
P
a-
i
Gro
v
e:
W
adsw
orth.
Bro
dley
,
C.
(1995).
Automati
seletion
of
split
ri-
terion
during
tree
gro
wing
based
on
no
de
lo
ation.
Pr
o
e
e
dings
of
the
Twelfth
International
Confer
en
e
on
Mahine
L
e
arning
(pp.
7380).
San
F
raniso:
Morgan
Kaufmann.
Elomaa,
T.,
&
Rousu,
J.
(1999).
General
and
e-
ien
t
m
ultisplitting
of
n
umerial
attributes.
Mahine
L
e
arning,
36,
201244.
Elomaa,
T.,
&
Rousu,
J.
(2000).
Generalizing
b
ound-
ary
p
oin
ts.
Pr
o
e
e
dings
of
the
Sevente
enth
National
Confer
en
e
on
A
rtiial
Intel
ligen
e
(pp.
570576).
Menlo
P
ark:
AAAI
Press.
Elomaa,
T.,
&
Rousu,
J.
(2001).
On
the
omputational
omplexit
y
of
optimal
m
ultisplitting.
F
undamenta
Informati
ae,
47,
3552.
Elomaa,
T.,
&
Rousu,
J.
(2002).
Linear-time
pre-
pro
essing
in
optimal
n
umerial
range
partitioning.
Journal
of
Intel
ligent
Information
Systems,
18,
55
70.
F
a
yy
ad,
U.
M.,
&
Irani,
K.
B.
(1992).
On
the
handling
of
on
tin
uous-v
alued
attributes
in
deision
tree
gen-
eration.
Mahine
L
e
arning,
8,
87102.
F
ulton,
T.,
Kasif,
S.,
&
Salzb
erg,
S.
(1995).
Eien
t
algorithms
for
nding
m
ulti-w
a
y
splits
for
deision
trees.
Pr
o
e
e
dings
of
the
Twelfth
International
Con-
fer
en
e
on
Mahine
L
e
arning
(pp.
244251).
San
F
raniso:
Morgan
Kaufmann.
Kearns,
M.,
Mansour,
Y.,
Ng,
A.
Y.,
&
Ron,
D.
(1997).
An
exp
erimen
tal
and
theoretial
ompari-
son
of
mo
del
seletion
metho
ds.
Mahine
L
e
arning,
27,
150.
Lozano,
F.
(2000).
Mo
del
seletion
using
radema
her
p
enalization.
Pr
o
e
e
dings
of
the
Se
ond
ICSC
Sym-
p
osium
on
Neur
al
Networks.
Berlin:
ICSC
A
a-
demi.
Lubinsky
,
D.
J.
(1995).
Inreasing
the
p
erformane
and
onsisteny
of
lassiation
trees
b
y
using
the
auray
riterion
at
the
lea
v
es.
Pr
o
e
e
dings
of
the
Twelfth
International
Confer
en
e
on
Mahine
L
e
arning
(pp.
371377).
San
F
raniso:
Morgan
Kaufmann.
Maass,
W.
(1994).
Eien
t
agnosti
P
A
C-learning
with
simple
h
yp
otheses.
Pr
o
e
e
dings
of
the
Seventh
A
nnual
A
CM
Confer
en
e
on
Computational
L
e
arn-
ing
The
ory
(pp.
6775).
New
Y
ork:
A
CM
Press.
Quinlan,
J.
R.
(1986).
Indution
of
deision
trees.
Ma-
hine
L
e
arning,
1,
81106.
V
apnik,
V.,
&
Cherv
onenkis,
A.
(1971).
On
the
uni-
form
on
v
ergene
of
relativ
e
frequenies
of
ev
en
ts
to
their
probabilities.
The
ory
of
Pr
ob
ability
and
its
Appli
ations,
16,
264280.
Zighed,
D.,
Rak
otomalala,
R.,
&
F
es
het,
F.
(1997).
Optimal
m
ultiple
in
terv
als
disretization
of
on
tin-
uous
attributes
for
sup
ervised
learning.
Pr
o
e
e
dings
of
the
Thir
d
International
Confer
en
e
on
Know
le
dge
Dis
overy
and
Data
Mining
(pp.
295298).
Menlo
P
ark:
AAAI
Press.