11 a Ovarian Cancer

background image

Jacek Kluska

Department of Electrical and Computer Engineering

Rzeszow University of Technology, Poland

RBF-based ANN and SVM models in

the ovarian cancer data classification

background image

Contents

2

Introduction

Theory of Support Vector Machines (SVM)

Radial-Basis Function Network (RBFN) approach

The ovarian cancer input data

Comparative study of SVM and RBFN on ovarian
cancer data classification

Conclusions

background image

Introduction

3

Ovarian cancer database represents the population
of women who suffered from ovarian cancer.

Ovarian cancer is the most common form of cancer
found in female genitalia in Poland.

The reason is gene mutation and the existence of this
disease among the forebears of a sick person.

Ovarian cancer's morbidity has increased.

Occurrence of the ovarian cancer is estimated with
the frequency of:

3/100000 of registered women under the age of 37, and

40/100000 over the age of 37.

background image

Ovarian cancer input data

4

The medical data set was provided as the result
of a cooperation between Rzeszów University of
Technology and Prof. Andrzej Skret and Dr.
Tomasz Lozinski from Obstetrics and
Gynecology Department, State Hospital
Rzeszow, Poland.

The input data is represented by the population
of 199 women who were treated for ovarian
cancer with 17 different medical parameters
registered for each patient during treatment
process.

The details (later on).

background image

Motivation

5

What are classification capabilities of SVM and
RBFN on medical data ?

Which model generalizes better using Cross-
Validation procedure ?

Which model is more accurate determining:

Sensitivity

Specificity

background image

Theory of SVM

6

Find GSH by solving QP problem

- Hessian matrix

- design parameter

Problem solution:

1

1

max

,

2

,

0

0

,

1, ,

l

i

i

i

C

i

l

 









H

Y

α

α α

α

x x

( , )

i j

i

j

l l

y y K

H

0

C

α

α

background image

SVM – problem solution

7

Let us assume ERBF

Solution: Optimal Classifier

where

x x

*

sign

,

i i

i

i SV

y

y K

b



α

α

*

1

,

,

,

2

,

:

1,

1,

|

0

i i

i

r

i

s

i SV

r

s

i

b

y K

K

r s

SV y

y

SV

i

 

 

x x

x x

 

2

,

exp

2

K

u

v

u v

background image

RBF network architecture

8

solverb function (Matlab’s Neural Networks Toolbox )

Chen, S., C.F.N.Cowan, and P.M. Grant, Orthogonal
Least Squares Learning Algorithm for Radial Basis
Function Networks, IEEE Transactions on Neural
Networks,
Vol. 2, No. 2, March 1991, pp. 302–309.

x

1

x

17

RBF

Linear

background image

RBF network (Matlab)

9

net = newrb(P,T,GOAL,SPREAD)

P – matrix of input vectors,

T – matrix of target vectors.

newrb iteratively creates a radial basis network
one neuron at a time.

Neurons are added to the network until the sum-
squared error falls beneath an error goal or a
maximum number of neurons has been reached.

The larger the input space, i.e. the number of
inputs, and the ranges those inputs vary over, the
more radbas neurons required.

background image

Disease parameters

10

The medical data represent a group of 199 women,
which were treated for ovarian cancer. The entire set
of 199 cases is divided into two subgroups. It stems
from the fact that there is a survival threshold of 60
months of treatment among patients meaning that if a
patient survives beyond this period of time, she will
recover from this particular disease.

This allows to assume that the classification problem
can be considered as a data separation for two
classes.

17 parameters were registered for each patients:

68 of patients survived past 60 months

131 died before this period.

background image

11

1.

figo staging of ovarian cancer

∈ {1,2,3,4}

2.

observation-examination

∈ {0,1}; 60 months-complete (1), less than 60 – incomplete (0)

3.

hospital

∈ {0,1}; state clinical hospitals (1), others (0)

4.

age of hospitalized women

∈ {22,25,...,81}

5.

hysterectomy – removal of uterus

∈ {0,1}; performed (1), not performed (0)

6.

adnexectomy – complete removal of ovary and salpinx

∈ {0,1}; performed (1), not

performed (0)

7.

full exploration of abdomen

∈ {0,1}; possible (0), performed (1)

8.

type of surgery

∈ {1,2,3}; hysterectomy (1), adnexectomy (2), exploration only (3)

9.

appendectomy – removal of appendix

∈ {0,1}; performed (1), not performed (0)

10. removal of intestine

∈ {0,1}; performed (1), not performed (0)

11. degree of debulking

∈ {1,2,3}; entire (3), up to 2cm (1), more than 2cm (2)

12. mode of surgery

∈ {1,2,3}; intraperitoneal (1), extraperitoneal (2), trans-tumor resection (3)

13. histological type of tumor

∈ {1,2,3,4,5}

14. grading of tumor

∈ {1,2,3,4}; GI (1), GII (2), GIII (3), data unavailable (4)

15. type of chemotherapy

∈ {1,2,3}

16. radiotherapy

∈ {0,1}; performed (1), not performed (0)

17. “second look” surgery

∈ {0,1}; performed (1), not performed (0)

Disease parameters in details

background image

Comparative study of SVM and RBFN

on ovarian cancer data classification

12

Input space:

Cardinality of training data set:

Labels in SVM

If a patient survived past 60 months, then the label is "1"

If a patient died before 60 months, then the label is "-1"

Labels in NN

If a patient survived past 60 months, then RBFN output
signal is "1"

If a patient died before 60 months, then RBFN output
signal is "0"

17

n

199

l

background image

Generalization performance

13

Software based on J. Platt’s SMO used to
classify data by means of SVM

Matlab’s Neural Networks Toolbox

solverb

function implemented to classify data by means
of RBFN

10 - fold Cross-Validation

procedure applied to

determine which model generalizes better to
unknown data

background image

Generalization parameters

14

SVM

SVM parameter:

--------------------------

RBFN

R

BFN parameter:

 

2

,

exp

2

K

u

v

u v

sc

(1)

(1)

2

exp

2( )

j

j

y

sc



x

w

background image

Cross Validation procedure

15

1.

All input patterns (l = 199) are divided into K subsets
(10). The points for the subsets should be selected
randomly.

2.

The model is trained on all subsets except from one.

3.

The model is tested on the subset left out.

Trial 1

Trial 2

Trial K

background image

Cross Validation error

16

Cross Validation

error

- definition

- the number of misclassified examples

within a single

m

th

separation

L - the number of validating data

CV error should be as low as possible.

1

100

K

CV

i

i

E

L

L

K

m

i

background image

Data Normalization

17

Data normalization is a scaling of attributes values to
the specific range. This may improve the quality of
classification (decrease CV error) and enhance the
generalization ability of the classifiers. Such a
mechanism is applied in the cases when the values of
the feature parameters differ significantly one from
another. It relies on scaling of the original data to
specific range, e.g. [0,1].

Ovarian cancer data is a very good example of a data
set to which the normalization should be applied. To
justify this, it is good to analyze the values of feature
parameters.

background image

Normalization of the input patterns

18

9 attributes can get values from {0,1}: observation-
examination, kind of hospital, hysterectomy, adnexectomy,
exploration of abdomen, appendectomy, removal of
intestine, radiotherapy, second-look surgery.

4 attributes can get values from {1,2,3}: kind of surgery,
degree of debulking, mode of surgery, kind of
chemioteraphy.

2 attributes can get values from {1,2,3,4}: figo staging and
grading of tumor can get values from the interval.

Histological type of tumor is from {1,2,3,4,5}.

The greatest range of values is assigned to the attribute
named age of patient: {22,25,...,81}.

There is a wide range of the attribute values for the input
vectors.

background image

Normalization examples

19

Let

be the

j

th coordinate of the ith data

vector

1.

max normalization

2.

min-max normalization

where

( )

{

}

( )

[

]

_

0,1

_

i

i

x

j

MAX

norm x

j

MAX

j

=

( )

( )

1,...,

1,...,

_

max

,

_

min

i

i

j

n

j

n

MAX

j

x

j

MIN

j

x

j

=

=

=

=

( )

{

}

( )

[

]

_

_

_

0,1

_

_

i

i

x

j

MIN

j

MIN

MAX

norm x

j

MAX

j

MIN

j

=

( )

i

x

j

background image

RBFN Cross Validation error

20

2

3

4

20

40

60

80

100

10

20

30

40

50

60

70

80

90

100

E

CV

[%]

S1

sc

background image

SVM Cross Validation error

21

0.1

1

10

100

1000

10000

15

20

25

30

35

E

CV

[%]

σ

C = 100
C = 1000
C = 10 000
C = 100 000
C = 1000 000

background image

Generalization results

22

RBFN - the lowest percentage of misclassified
examples:

max normalization: E

CV

=14.58% (S1=k=11, sc=2.5),

min-max normalization: E

CV

=14.11% (S1=k=11,

sc=3.5),

no normalization: E

CV

=20.08% (S1=k=19, sc=4).

SVM - the lowest percentage of misclassified
examples:

max normalization: E

CV

=14.61[%] (C=10³,

σ=55.7),

min-max normalization: E

CV

=15.13[%] (C=10³,

σ=55.7),

no normalization: E

CV

=17.61[%] (C=10

⁵, σ=290).

background image

23

Generalization results –

summary

SVM

RBFN

Optimal
parameter

14.61 %

14.11 %

*

4.7

*

2.75

sc

CV

E

background image

Other criterion: diagnostic accuracy

24

According to (Sboner et al. 2003) for we compute:

TP - properly classified patients (True), who are sick (Pos),

FN - incorrectly classified patients (False), who are healthy
(Neg),

TN - correctly classified patients (True) with negative test result
(Neg),

FP - identifies incorrectly classified (False) ill patients (Pos).

TruePos

TP

sensitivity

TruePos

FalseNeg

TP

FN

TrueNeg

TN

specificity

TrueNeg

FalsePos

TN

FP

*

*

,sc

background image

Diagnostic accuracy – simple abstract

example

25

Given 5 medical cases:
3 sick (marked with S) and 2 healthy (marked with
H) patients represented by the set:

[S S S H H]

On the basis of positive and negative
observation, and the number of correctly and
incorrectly classified patients, calculate the
diagnostic accuracy parameters:

sensitivity

specificity

for all possible 2^5 = 32 combinations.

background image

Diagnostic accuracy – simple abstract

example – cont.

26

combination sensitivity specificity sum

1.

H H H H H

inf

0.4

inf

2.

S H H H H

1

0.5

1.5

3.

H S H H H

1

0.5

1.5

4.

S S H H H

1

0.67

1.67

5.

H H S H H

1

0.5

1.5

6.

S H S H H

1

0.67

1.67

7.

H S S H H

1

0.67

1.67

8.

S S S H H

1

1

2

9.

H H H S H

0

0.25

0.25

10.

S H H S H

0.5

0.33

0.83

11.

H S H S H

0.5

0.33

0.83

12.

S S H S H

0.67

0.5

1.17

13.

H H S S H

0.5

0.33

0.83

14.

S H S S H

0.67

0.5

1.17

15.

H S S S H

0.67

0.5

1.17

16.

S S S S H

0.75

1

1.75

combination sensitivity specificity sum

17.

H H H H S

0

0.25

0.25

18.

S H H H S

0.5

0.33

0.83

19.

H S H H S

0.5

0.33

0.83

20.

S S H H S

0.67

0.5

1.17

21.

H H S H S

0.5

0.33

0.83

22.

S H S H S

0.67

0.5

1.17

23.

H S S H S

0.67

0.5

1.17

24.

S S S H S

0.75

1

1.75

25.

H H H S S

0

0

0

26.

S H H S S

0.33

0

0.33

27.

H S H S S

0.33

0

0.33

28.

S S H S S

0.5

0

0.5

29.

H H S S S

0.33

0

0.33

30.

S H S S S

0.5

0

0.5

31.

H S S S S

0.5

0

0.5

32.

S S S S S

0.6

inf

inf

background image

27

Diagnostic accuracy results -

ovarian cancer data classification

SVM

(

σ = 4.7)

RBFN

(sc = 2.75)

sensitivity

0.88

0.91

specificity

0.78

0.78

background image

Conclusions

28

Both RBFN and SVM have an acceptable
generalization ability and diagnostic accuracy on
comparable level of acceptability.

Sometimes RBFN seems to outperform SVM.

Generalization performances of SVM and RBFN
and medical measures (sensitivity and specificity)
are low, because of:

low cardinality of data set,

complexity of input space,

imbalance between the classes: similar input data
– different labelling.

background image

Conclusions – cont.

29

Unfortunately there is a disproportion in the
number of instances; in the entire set consisting
of 199 cases, there are

68 negative samples and

131 positive samples.

It is recommended to use both models: RBFN
and SVM to predict a condition of a hospitalized
patient, but the outcomes of the systems have to
be appraised by the doctor to be taken into
account.

Acknowledgements
The author is grateful to PhD student Maciej Kusy for his valuable help with data
preparation and calculations.


Document Outline


Wyszukiwarka

Podobne podstrony:
11 b Breast Cancer
RAD51C Germline Mutations in Breast and Ovarian Cancer Cases from High Risk Families
Zarz[1] finan przeds 11 analiza wskaz
11 Siłowniki
11 BIOCHEMIA horyzontalny transfer genów
PKM NOWY W T II 11
wyklad 11
R1 11
CALC1 L 11 12 Differenial Equations
Prezentacje, Spostrzeganie ludzi 27 11
zaaw wyk ad5a 11 12
budzet ue 11 12
EP(11)
W 11 Leki działające pobudzająco na ośrodkowy układ
Zawal serca 20 11 2011
11 Resusc 2id 12604 ppt
11 pomiay dlugosci tasma

więcej podobnych podstron