10 1 1 34 7334


To appear in the 17th International Conference on
Machine Learning (ICML-2000)
Why Discretization Works for Na Bayesian Classi ers
ve
Chun-Nan Hsu chunnan@iis. sinica. edu.tw
Institute of Information Science, Academia Sinica, Nankang 115, Taipei City, Taiwan
Hung-Ju Huang gis85563@cis.nctu.edu.tw
Department of Computer and Information Science, National Chiao-Tung University, Hsinchu City 300, Taiwan
Tzu-Tsung Wong tzu-tsun@iis.sinica.edu.tw
Lunghwa Institute of Technology, Kueishan 333, Taoyuan County, Taiwan
Abstract algorithms such as C4.5 with a large ensemble of real
data sets. The result again showed that naive Bayes
This paper explains why well-known dis-
can signi cantly outperform those algorithms. But a
cretization methods, such as entropy-based
more interesting respect of their result is that it was
and ten-bin, work well for naive Bayesian
achieved by a version of naive Bayes that uses a very
classi ers with continuous variables, regard-
\naive" discretization method to handle continuous
less of their complexities. These meth-
data. That method, usually referred to as \ten-bin,"
ods usually assume that discretized variables
divides the domain of each continuous variable into ten
have Dirichlet priors. Since perfect aggrega-
equal-width bins. Such a simple method was consid-
tion holds for Dirichlets, we can show that,
ered to be inferior and many complex methods have
generally, a wide variety of discretization
been proposed, but surprisingly, simple methods per-
methods can perform well with insigni cant
form well while not much improvement is found when
di erence. We identify situations where dis-
complex methods are used. Preciously, Dougherty
cretization may cause performance degrada-
et al. (1995) conducted an empirical study compar-
tion and show that they are unlikely to hap-
ing the performance of four well-known discretiza-
pen for well-known methods. We empiri-
tion methods and concluded that all of the discretiza-
cally test our explanation with synthesized
tion methods can signi cantly outperform a version
and real data sets and obtain con rming re-
of naive Bayes that assumes normal for all continu-
sults. Our analysis leads to a lazy discretiza-
ous variables. Moreover, these discretization meth-
tion method that can simplify the training
ods performed approximately the same, though an
for naive Bayes. This new method can per-
entropy-based method (Fayyad & Irani, 1993) per-
form as well as well-known methods in our
formed slightly better for some data sets. Kohavi and
experiment.
Sahami (1996) compared the entropy-based method
with an error-based method, which is not considered
in (Dougherty et al., 1995). Again, both methods out-
1. Introduction
performed the normal version whereas no signi cant
di erence between their performance was found.
Learning a naive Bayesian classi er (a.k.a. naive
Bayes) (Langley et al., 1992) from data is an important
Previous work provided some informal discussion but
technique for data classi cation. In spite of its simplic-
none explained why these discretization methods can
ity, naive Bayes constantly outperforms competing al-
work well for naive Bayes regardless of their complex-
gorithms in the experiments reported in the literature.
ities. In this paper, we present an explanation on
Remarkably, in KDD-CUP-97, two of the top three
this phenomenon. The key of our explanation is the
contestants are based on naive Bayes (Parsa, 1997).
\ perfect aggregation" property of the Dirichlet distri-
Meanwhile, Domingos and Pazzani (1997) provided an
butions. A discretized continuous variable is usually
interesting analysis of the reason why naive Bayes can
assumed to have a Dirichlet prior. Perfect aggrega-
perform well even though the independence assump-
tion of Dirichlets implies that we can estimate the
tion is violated. They reported an experiment that
class-conditional probabilities of discretized intervals
compared naive Bayes with several classical learning
P
with an arbitrary accuracy. This explains why naive Beta distribution with parameters and ?
j
j2
P
Bayes with discretization can be e ective in approx- ), where = + + + . This is
j 1 2 k+1
j2
imating a broad range of probability distributions of called sum-of-subset lemma.
continuous data. We also found that to achieve op-
Another important property of Dirichlets is that a
timal performance, the probability corresponding to
Dirichlet distribution is conjugate to the multinomial
an interval must simulate the role of the true density
sampling (Heckerman, 1998). This property basically
in the classi cation. This allows us to identify sit-
states that the posterior distribution of a Dirichlet
uations where discretization may cause performance
given our observation is also a Dirichlet. Formally,
degradation. We show that these situations are un-
let D = fy1; y2; : : : ; yk+1 g be a data set for the out-
likely to happen under zero-one loss for a wide vari-
comes in n trials, where yj denotes the number of trials
ety of discretization methods including the well-known
turning to be outcome j. When the prior distribution
methods. Therefore, these methods will produce simi-
p( ) is a Dirichlet distribution with parameters for
j
lar performance for many data sets regardless of their
j = 1; 2; : : : ; k + 1, and the likelihood function L(Dj )
complexities. We also empirically test our explanation
follows a multinomial distribution, then the posterior
on synthesized and real data sets and obtain con rm-
distribution p( jD) is also a Dirichlet distribution with
ing results.
0
parameters = + yj for j = 1; 2; : : : ; k + 1. Sim-
j
j
Our analysis leads to a \lazy" discretization method. ilarly, the Beta distribution is conjugate to the bino-
Experimental results show that on average this new mial sampling. The expected value for given D is
j
+yj
j
method performs equally well as well-known methods.
E[ jD] = , for j = 1; 2; : : : ; k + 1.
j
+n
The remainder of this parer presents our explanation
and the lazy discretization method.
3. Perfect Aggregation
Let be a subset of f ; ; : : : ; g, and let the prob-
1 2 k
2. Dirichlet Distributions
ability of interest be the sum of the variables in
Pq
In a naive Bayesian classi er, discrete variables as well
; i.e. , q = . Suppose the prior distri-
j
j2
as discretized continuous variables are usually assumed
bution of = ( ; ; : : : ; ) is a Dirichlet distribu-
1 2 k
to have a Dirichlet prior. This section reviews the im-
tion Dk ( ; ; : : : ; ; ). A straightforward ap-
1 2 k k+1
portant properties of Dirichlet distributions. A Dirich- plication of the Bayesian approach is to use the train-
let distribution is formally de ned as follows:
ing data D to update the prior distribution to obtain
the posterior distribution f( j D), which is a Dirichlet
De nition 1 Random vector = ( ; ; : : : ; ) has
1 2 k
distribution Dk ( + y1 ; + y2; : : : ; + yk ; +
1 2 k k+1
a k-variate Dirichlet distribution with parameters
yk+1 ): Then the posterior distribution f(qj D) is de-
> 0 for j = 1; 2; : : : ; k + 1 if it has density
j
rived from f( j D). By subvector lemma and sum-of-
Pk k
+1
subset lemma, f(qj D) is a Beta distribution:
Y
?( )
j
?1
j
f( ) = (1 ? ? ? ) k+1 ?1
Qk j=1 1 k P P
j
+1
?( ) Beta( ( + yj); + n ? ( + yj)) (1)
j j j
j=1 j=1 j2 j2
for + + + 1 and 0 for
However, by the properties of the Dirichlet distribu-
1 2 k j
j = 1; 2; : : : ; k. This distribution will be denoted
tion, there exists a simpler alternative to compute
Dk ( ; ; : : : ; ). A Beta distribution is a
f(qj D). We can rst derive the prior distribution for
1 2 k k+1
univariate Dirichlet distributions and usually denoted
the probability of interest f(q) from the prior distri-
Beta( ; ). 2
bution of f( ). Then we can convert the training data
1 2
D into a new set of training data D0 in terms of q by
The following \closure properties" of Dirichlet distri-
computing the sum of the observations of our interest
butions are critical to our analysis. These properties
in D. Now, we can use D0 to update the prior dis-
greatly simplify the computation of the moments of
tribution of f(q) to obtain the posterior distribution
the Dirichlet distribution in Bayesian analysis. Sup-
f(qj D0).
pose random vector = ( ; ; : : : ; ) has a Dirichlet
1 2 k
We can show that in general it is always the case that
distribution Dk ( ; ; : : : ; ; ), then by Wilks
1 2 k k+1
f(qj D) = f(qj D0). Since the multinomial likelihood
(1962) any subvector ( ; ; : : : ; ) of has an
n1 n2 nm
function L(Dj ) implies that the trials for obtaining
m-variate Dirichlet distribution
P data set D are all independent, the likelihood function
m
Dm( ; ; : : : ; ; ? ):
n1 n2 nm nj
j=1
L(D0 jq) will follow a binomial distribution. By sum-of-
We call this subvector lemma. Also by Wilks (1962), subset lemma, the prior distribution of f(q) has a beta
P P
the sum of any subset f ; ; : : : ; g has a distribution Beta( ; ? ) when has
1 2 k j j
j2 j2
a Dirichlet distribution. Since the beta distribution amples whose x = Xj. Since a Dirichlet distribution
is conjugate to the binomial sampling, the posterior is conjugate to multinomial sampling, after the train-
distribution f D0) will have a Beta distribution with ing, the posterior distribution of is still a Dirichlet,
P(qj P
parameters ( + yj) and + n ? ( + but with the updated parameters + ycj for all j.
j j j
j2 j2
yj). This is exactly the same as Equation (1). This This property allows us to incrementally train a naive
property is always true for the Dirichlet distribution Bayes.
and is rst derived by Azaiez (1993), called \ perfect
In practice, we usually choose the Jaynes prior (Al-
aggregation" (Iwasa et al., 1987; Wong, 1998).
mond, 1995) = = 0 for all j and have p(xj c) =
^
j
ycj
For example, suppose that we are interested in the . However when the training data set is too small,
nc
probability of showing odd number in throwing a die. this often yields p(xj c) = 0 and impedes the classi -
^
Let be the probability that the die shows number j cation. To avoid this problem, another popular choice
j
in a trial, and let yj be the number of trials that the is = 1 for all j. This is known as smoothing or
j
die shows j in n trials. Then the probability of interest Laplace' s estimate (Cestnik & Bratko, 1991).
can be represented as q = + + . In the straight-
1 3 5
If x is a continuous variable, a conventional ap-
forward approach, we derive the distribution f(qj D)
2
proach is to assume that p(xj c) = N(x; ; ), where
c
c
from the data fy1 ; y2; : : : ; y6g, while in the alternative
2
N(x; ; ) is the p.d.f. of a normal distribution. In
c
c
approach, we can use D0 = fn; y1 + y3 + y5 g instead
this case, training involves learning the parameters c
and will obtain the same result. An implication of per-
and from the training data. This approach has been
c
fect aggregation is that with Dirichlet distributions, to
shown to be less e ective than discretization when
estimate the posterior probability of a union of disjoint
p(xj c) is not normal (John & Langley, 1995), and dis-
events, there is no need to estimate the probabilities
cretization is often used. Generally, discretization in-
of individual events.
volves partitioning the domain of x into k + 1 intervals
as a pre-processing step. Then we can treat x as a dis-
4. Discretization for Na Bayes
ve
crete variable with k + 1 possible values and conduct
the training and classi cation.
Naive Bayes classi es a feature vector x by selecting
class c that maximizes the posterior probability More precisely, let Ij be the j-th discretized interval.
Y
Training and classi cation in a naive Bayes with dis-
p(cjx) / p(c) p(xj c); (2)
cretization is to use p(x 2 Ij j c) as an estimate of p(xj c)
^ ^
x2x
as in Equation (3) for each continuous variable. This
where x is a variable in x. p(xj c) is the class-
is equivalent to assuming that after discretization, the
conditional density of x given class c. Let denote
class-conditional density of x has a Dirichlet prior.
the vector whose elements are the parameters of the
We call this assumption \ Dirichlet discretization as-
density of p(xj c). In a Bayesian learning framework,
sumption" . Apparently, this assumption holds for all
we assume that is an uncertain variable (Hecker-
well-known discretization methods, including ten-bin,
man, 1998) and can be learned from a training data
entropy-based, etc. See (Dougherty et al., 1995) for a
set. This estimation is at the heart of training in a
comprehensive survey.
naive Bayes.
Dirichlet discretization assumption is reasonable be-
Suppose x is a discrete variable with k + 1 possi-
cause of another implicit assumption described below.
ble values. In principle the class label c of the data
Let f(xj c) be the \true" probability density function
vector x dictates the probability of the value of x.
of p(xj c). Assuming that f(xj c) is integrable every-
Thus the appropriate p.d.f. is a multinomial dis-
where. Then for any discretized interval Ij, the \true"
R
tribution and its parameters are a set of probabili-
probability of p(x 2 Ij j c) = f(xjc)dx. By choos-
I
j
ties f ; ; : : : g such that for each possible value
1 2 k+1
ing equivalent sample size , the Dirichlet parame-
P
Xj, p(x = Xj j c) = and = 1. Now, let
j j
j
ter corresponding to random variable \ x 2 Ij jc" is
R
( ; ; : : : ). We choose a Dirichlet distribu-
1 2 k
f(xj c)dx. We call this assumption \ partition in-
I
j
tion with parameters ; : : : ; as the prior for .
1 k+1
dependence assumption." By partition independence
Given a train data set, we can update p(x = Xj j c) by
assumption, any discretization of x can have a Dirich-
its expected value:
let prior. Partition independence assumption implies
+ ycj
j
that for any interval, the Dirichlet parameter corre-
p(x = Xj j c) = ; (3)
^
+ nc
sponding to this interval depends only on the area be-
low the curve of the p.d.f. f(xj c), but is independent
where nc is the number of the training examples be-
of the shape of the curve in the interval.
longing to class c and ycj is the number of class c ex-
5. Performance of Discretization Suppose a discretization method creates an interval
I that contain a decision boundary. For x values in
Methods
the interval but at the di erent sides of the decision
In this section, we discuss the factors that a ect
boundary, the classi er should pick a di erent class for
the performance of di erent discretization methods.
each value. But we will have the same p(x 2 Ijc) for
^
When the density of a continuous variable x is known,
these x values. That is, x values at one of the two
the point density p(xj c) will be used in Equation (2)
sides of the decision boundary will be misclassi ed.
for classi cation. But when we discretize the variable,
Therefore, to minimize the distortion, the boundaries
the probability p(x 2 Ij c) of the discretized interval
^
of intervals that contain decision boundaries should
I will be used instead. This probability is assumed
be close to the decision boundaries, or these intervals
to have a Dirichlet prior due to Dirichlet discretiza-
should be as narrow as possible.
tion assumption. Since perfect aggregation holds for
On the other hand, the widths of intervals that contain
Dirichlet distributions, p(x 2 Ij c) can be estimated
^
no decision boundary can be as large as possible. Their
arbitrarily close to real p(x 2 Ij c).
widths do not a ect too much on the performance but
This can be proved as follows. Let f(xj c) be the \true"
when these intervals are too narrow, the number of
density of p(xj c), then
examples in the training data set in the intervals will
Z be too small to allow accurate estimation of p(x 2
^
X
Ij c). In this case, discretization will cause performance
p(x 2 Ij c) = f(xjc)dx p(x 2 jc)
^
x2I
degradation. A loose bound can be derived as follows.
I
Let n be the size of the training data set. The number
where is a disjoint sub-interval in I. Arbitrary ac-
of examples that will fall in an interval I has a binomial
curacy can be achieved by increasing the number of
distribution with parameters n and p(x 2 Ij c) and
sub-intervals in I. However, by perfect aggregation,
has expected value n p(x 2 Ij c). If we want the
there is no need to estimate p(x 2 j c) for each .
^
estimate to be accurate to two digits, we will need n >
Instead, we can estimate p(x 2 Ij c) directly and ob-
^
100=p(x 2 Ij c). This bound is loose, but is su cient
^
tain the same result. By partition independence as-
to illustrate our idea. The entropy-based method and
sumption, this result is independent of the shape of
1RD (see (Dougherty et al., 1995)) are designed to
the curve of f(xjc). Therefore, discretization methods
avoid narrow intervals, and thus their performance can
can more accurately approximate a distribution than a
be slightly better.
naive Bayes that assumes normal for continuous vari-
In summary, to avoid performance degradation, a dis-
ables. This shows why discretization can be e ective.
cretization method should partition the domain of a
Suppose we have the \true" densities of p(xj c), it can
continuous variable into intervals such that their cut
be shown that Equation (2) (i.e., the Bayes rate) is
points are close to decision boundaries to minimize the
the optimal classi cation accuracy achievable by any
distortion due to the discretization, and their width
classi er (Duda & Hart, 1973). With discretization,
should be large enough to cover su cient examples.
to achieve the optimal classi cation accuracy, p(x 2
^
The above analysis can be extended to multivariate
Ijc) must simulate the role of p(xjc) by distinguishing
cases and the \right" choice of cut points can be widely
the class that gives x high density from the class that
spread. Since a naive Bayes takes all variables into
gives x low density. More precisely, if p(x = Xj j ci)
account simultaneously, the impact of a \wrong" dis-
is the largest among all classes ci, then at least we
cretization for one variable can be absorbed by other
should require that p(x 2 Ij ci) is the largest among all
^
variables under zero-one loss performance measure.
classes, too. Otherwise, performance degradation may
Also, due to the way that a naive Bayes learns, correct
still occur.
classi cation of a given example depends only on the
Consider a simple case with only one continuous vari-
interval containing the values of that example, and is
able x with normally distributed classes. We can plot
completely independent of how other regions are dis-
the curves of the conditional densities as the one shown
cretized. As a result, a wide variety of discretization
in Figure 1. The domain of x can be divided into
methods can have similar performance in general.
regions by decision boundaries, which correspond to
the intersection points of the curves where ties occur
6. Empirical Evidence
among the largest conditional densities. The optimal
classi cation is to pick the class c such that p(xj c) is
In this section, we report the empirical veri cation of
the largest, and the pick of the class is di erent when
our analysis on synthesized and real data sets.
x is on di erent sides of a decision boundary.
95.5
S1 S2 S3 S4 S5
95
94.5
p(x|c1) p(x|c2) p(x|c3)
94
x
93.5
Figure 1. Conditional distributions of the synthesized data 93
92.5
6.1 Controlled Experiments
92
1 2 3 4 5 6 7 8 9 10 11
Width
We synthesized an arti cial data set composed of
1500 examples with three di erent classes. Each
Figure 2. Accuracies drop when we increase the widths of
class has the same number of examples. Each ex-
the intervals that contain the intersection points.
ample has two continuous attributes A1, A2 and
a class label. The values of A1 and A2 are gen-
erated by normal distributions fN(10; 2); N(15; 2)g
for class 1, fN(15; 2); N(20; 2)g for class 2, and
results show that when the widths of S2 and S4 are
fN(20; 2); N(10; 2)g for class 3. For each attribute, we
xed, the partitions of the other intervals have only
divided its domain into ve intervals as shown in Fig-
tiny impact on the classi cation accuracy.
ure 1 such that S2 and S4 contain the decision bound-
The third experiment is to show that \equal-width
aries of each variable. That is, the intersection points
bins" such as ten-bin can be e ective in many cases.
that are also the peak values of all classes.
In fact, if the number of bins is not too \extreme," the
In the rst experiment, we investigated how the dis-
performance will not be signi cantly di erent. Our
tance of the cut points of S2 and S4 to the intersection
analysis predicts that when the number of intervals is
points a ect the performance. S2 and S4 are initialized
two, it is likely that these intervals will contain inter-
with a small width and then their boundaries are ex-
section points and the performance will be generally
tended with a constant percentage of the initial width
poor. As the number of bins increases, the accuracies
of S3. The intersection points within S2 and S4 were
will improve and reach a plateau. When the number
kept in the middle such that the regions at both sides
of bins is extremely large, the widths of the bins will
of the intersection points within the intervals always
be too small for all intervals and the accuracies will
have the same width. For each resulting discretiza-
drop gradually. That is, the curve of the accuracy as
tion, we ran a ve-fold cross-validation to obtain an
a function of the number of bins will appear like an
average accuracy. Figure 2 plots the results. As we
up-side-down \J". Moreover, how fast the curve drops
predicted, when S2 and S4 become wide, the accura-
will depend on the size of the training data set. A
cies drop rapidly.
small data set will cause the accuracy to drop faster.
The second experiment is to see if there are other fac-
We conducted the experiment on two data sets: our
tors that may result in signi cant changes of the accu-
synthesized data set (with 1500 examples) and its sub-
racies in addition to the width of S2 and S4. We rst
set (with 300 examples). The resulting curves are
kept S2 and S4 unchanged and randomly partitioned
given in Figure 3, which con rms our prediction.
S1, S3, and S5 into at most fty intervals. We repeated
the random partition thirty times and ran a ve-fold
6.2 Experiments on Real Data sets
cross-validation for each repetition. The variance of
the resulting accuracies is 0.27. The average accuracy We select ten real data sets from UCI ML reposi-
is 96.25% with the maximum and minimum accuracies tory (Blake & Merz, 1998) for our experiments. All
being 96.67% and 95.47%, respectively. We repeated attributes of those data sets are continuous. The rst
the same experiment, but this time, the entire domains experiment is similar to the third experiment described
are randomly partitioned. By contrast, the variance of in Section 6.1. In this experiment,we partitioned each
the resulting accuracies jumps to 3.84. The average ac- continuous variable into two equal-width bins at rst,
curacy is 93.79% with the maximum and minimum ac- and then increased the number of the equal-width bins
curacies being 96.80% and 82.73%, respectively. The by one until 50. Figure 4 shows the results.
Accuracy
1 1
0.98
0.9
0.96
0.8
0.94
0.92
0.7
0.9
0.6
0.88
0.86
0.5
german
0.84 glass
vehicle
0.4
waveform
0.82
1500
300 wine
0.8
0.3
5 10 15 20 25 30 35 40 45 50 55 60
0 5 10 15 20 25 30 35 40 45 50
number of bins
number of bins
1
Figure 3. Experimental results match our prediction; also,
0.9
with fewer data, accuracies drop faster as the number of
bins increases.
0.8
0.7
Just like our expectation, the curves for most data
0.6
sets match the pattern given in Figure 3, especially
0.5
for large data sets. All accuracy gures are obtained
by using ve-fold cross-validation. The variances of
0.4
the accuracies for most data sets are found to be sur- breast-w
iris
prisingly small. As long as the numbers of equal-width 0.3
liver
pima
bins are not too large or too small, the accuracies are
yeast
0.2
approximately the same. The curves of some data sets
0 5 10 15 20 25 30 35 40 45 50
(glass, iris, liver, wine) are rougher than others, number of bins
since the variances of the accuracy appear to depend
Figure 4. Accuracy results by increasing the number of
on the size of the data sets. In general, small data
equal-width bins.
sets (with less than 300 examples) have relatively large
variances. As discussed earlier, this is because the ex-
amples are not su cient for accurate probability es-
timation. Data set glass has a particularly zigzag However, the results presented above does not exclude
curve because the number of examples in this data set the possibility that by carefully determining the cut
is small. In addition, this data set has seven classes, points, the performance will be improved drastically.
which produce too many intersection points and the Therefore, we conducted a similar experiment with
accuracies become sensitive to discretization. the entropy-based method. Since the entropy-based
method always selects a cut point from training data
In the above experiment, we focus on domains that are
at the boundary where the classes switch (Fayyad &
entirely continuous. We also performed the same ex-
Irani, 1993), it is more likely for this method to select
periment for some data sets with mixed domains. The
a cut point near an intersection point.
curves of those mix-domain data sets again generally
match the pattern. We examined these data sets that For each data set, we recursively called the entropy
do not follow the pattern and found that the contin- heuristic to determine the cut points and reported the
uous variables in those exceptional data sets have al- ve-fold cross-validation accuracy. In each recursion,
most no impact in classifying an example. We remove cut points are selected in the resulting intervals of the
continuous variable from exceptional data sets, and previous calls without applying MDL stopping crite-
their accuracies for classi cation are almost as good rion. Figure 5 shows the resulting curves. The hori-
as before. When a continuous variable in a data set zontal axis indicates how many levels of the recursion
is nearly dormant in classi cation, the distributions of have been invoked. At the lth level of the recursion,
di erent classes of that variable will be similar every- a continuous variable will have at most 2l cut points.
where. Therefore, a naive Bayes with only two bins However, when all of the data containing in an inter-
can still achieve a high accuracy. val have the same class, no cut point will be selected.
Accuracy
Accuracy
Accuracy
1 1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.4
0.5
german breast-w
glass iris
0.3
vehicle liver
0.4
waveform pima
wine yeast
0.2
0.3
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Level Level
Figure 5. Accuracy results by using entropy to determine cut points.
Therefore, the number of the levels of recursions does tribution. Suppose that the partition independence as-
not correspond to the number of the intervals created, sumption holds. Then by the sum-of-subset lemma of
and each variable may be partitioned into a di erent the Dirichlet distribution, \ xjc" will have a Beta prior
R R
number of intervals. with parameters p(xjc)dx and (1 ? p(xjc)dx),
I I
where c is a class and is an equivalent sample size.
The resulting curves show that the accuracies with dif-
By perfect aggregation, we can estimate p(x 2 Ijc) by
^
ferent levels of recursions are surprisingly similar for
counting how many c examples with x 2 I and how
all data sets. Their shapes match the pattern of the
many are not. In other words, there is no need to check
curves in Figure 3. This again con rms that discretiza-
the exact value of x for those examples whose value of
tion methods can perform similarly even though they
x is not in I and we may simplify the training e ort.
select radically di erent sets of cut points.
We empirically compared lazy ten-bin, ten-bin,
entropy-based, and a normal version of naive Bayes on
7. Lazy Discretization
the real data sets. We ran a ve-fold cross-validation
The next question is whether any possible improve- for each data set and reported the average and the
ment can be made from this analysis. This section de- standard deviation of the accuracies. Table 1 gives the
scribes our preliminary investigation on this question. results, which reveal that the lazy method can perform
We propose a lazy discretization method based on per- as well as the well-known discretization methods.
fect aggregation. This method waits until one or more
test data are given to determine the cut points for each
8. Conclusions and Future Work
continuous variable. Moreover, this method produces
only a pair of cut points surrounding the value of each In this paper, we presented an explanation of why dis-
test datum for each variable. That is, it creates only cretization works for naive Bayes based on the perfect
one interval (denoted as I) and leaves the other region aggregation property of the Dirichlet distributions,
untouched. From the training data, we can estimate and why well-known discretization methods perform
p(x 2 Ijc) by Equation (3) and use this estimate to about the same regardless of their complexities. The
^
classify the test datum. This method can invoke dif- explanation is based on perfect aggregation of Dirich-
ferent instantiations to determine the cut points. For lets, which ensures that naive Bayes with discretiza-
example, we can select a pair of cut points such that tion can e ectively approximate the distribution of a
the value of x is in the middle and the width of the continuous variable. We also presented a new lazy dis-
interval is the same as the width of the intervals cre- cretization method and showed that it works equally
ated by ten-bin. We will call this instantiation \lazy well. We note that our result may not be applicable
ten-bin." Similarly, we can have \lazy entropy," \lazy to other learning algorithms such as decision trees.
bin-log," etc.
In our future work, we plan to investigate whether our
This discretization method is derived from the perfect analysis can shed some light on handling continuous
aggregation and other properties of the Dirichlet dis- variables in general Bayesian networks.
Accuracy
Accuracy
Table 1. Accuracies of naive Bayes with di erent discretization methods.
Data set Lazy Ten-bin Ten Bins Entropy Continuous
breast-w 96.00 3.08 94.37 2.16 94. 37 1.93 92.79 2.61
german 74.00 2.98 73.80 2.77 74. 70 2.84 70.00 2.49
glass 64.43 5.29 62.34 6.03 67. 12 3.60 47.58 4.96
iris 96.00 2.49 94.67 3.40 95. 33 1.63 96.00 3.89
liver 64.64 4.26 64.35 5.84 65. 22 2.25 56.81 5.09
pima 76.18 4.84 75.01 4.64 74. 47 2. 92 74.34 2.57
vehicle 61.23 1.75 62.06 1.39 62. 29 2.15 43.26 3.82
waveform 81.75 3.22 81.08 2.87 82. 17 2. 87 82.00 2.92
wine 97.19 1.81 97.71 2.14 97. 14 1. 81 97. 14 2.52
yeast 48.44 2.77 47.30 2.85 44. 82 3. 76 39. 39 5.31
Average 75. 99 75. 27 75.76 69.93
Win:Loss { 8: 2 5:5 9:1
Acknowledgements classi cation learning. Proceedings of the Thirteenth
International Joint Conference on Arti cial Intelli-
The research reported here was supported in part by
gence (pp. 1022{1027).
the National Science Council in Taiwan under Grant
No. NSC 89-2213-E-001-031. Heckerman, D. (1998). A tutorial on learning with
Bayesian networks. In M. I. Jordan (Ed.), Learn-
ing in graphical models, 301{354. Boston: Kluwer
References
Academic Publishers.
Almond, R. (1995). Graphical belief modelling. New
Iwasa, Y., Levin, S., & Andreasen, V. (1987). Ag-
York: Chapman and Hall.
gregation in model ecosystem: Perfect aggregation.
Azaiez, M. N. (1993). Perfect aggregation in reliability
Ecological Modeling, 37, 287{302.
models with Bayesian updating. Doctoral disserta-
tion, Department of Industrial Engineering, Univer- John, G., & Langley, P. (1995). Estimating continuous
distributions in Bayesian classi ers. In Proceedings
sity of Wisconsin-Madison, Madison, WI.
of the Eleventh Annual Conference on Uncertainty
Blake, C., & Merz, C. (1998). UCI repository of ma-
in Arti cial Intelligence (pp. 338{345).
chine learning databases.
Kohavi, R., & Sahami, M. (1996). Error-based and
Cestnik, B., & Bratko, I. (1991). On estimating prob-
entropy-based discretization of continuous features.
abilities in tree pruning. In Machine learning {
Proceedings of the Second International Conference
EWSL-91, European working session on learning,
on Knowledge Discovery and Data Mining (pp. 114{
138{150. Berlin, Germany: Springer-Verlag.
119). Portland, OR.
Domingos, P., & Pazzani, M. (1997). On the optimal-
Langley, P., Iba, W., & Thompson, K. (1992). An
ity of the simple Bayesian classi er under zero-one
analysis of bayesian classi ers. Proceedings of the
loss. Machine Learning, 29, 103{130.
tenth national conference on Arti cial Intelligence
Dougherty, J., Kohavi, R., & Sahami, M. (1995). Su- (pp. pp. 223{228). San Jose, CA: AAAI Press.
pervised and unsupervised discretization of continu-
Parsa, I. (1997). KDD-CUP 1997 presentation.
ous features. Machine Learning: Proceedings of the
Twelfth International Conference. San Francisco,
Wilks, S. S. (1962). Mathematical statistics. New York:
CA: Morgan Kaufmann.
Wiley and Sons.
Duda, R. O., & Hart, P. E. (1973). Pattern classi-
Wong, T.-T. (1998). Perfect aggregation in dependent
cation and scene analysis. New York: Wiley and
Bernoulli systems with Bayesian updating. Doctoral
Sons.
dissertation, Department of Industrial Engineering,
University of Wisconsin-Madison, Madison, WI.
Fayyad, U. M., & Irani, K. B. (1993). Multi-interval
discretization of continuous valued attributes for


Wyszukiwarka

Podobne podstrony:
10 (34)
02 10 09 (34)
Stromlaufplan Passat 34 Sitzheizung nur für Ledersitze ab 10 1996
34 (10)
34) TSiP 10 ćw13
WSM 10 52 pl(1)
4434
VA US Top 40 Singles Chart 2015 10 10 Debuts Top 100

więcej podobnych podstron