5441337057

2) Resulls

The algorithm was run on a set of 287198 address records. The data records are tuples defined as {street, location, zip codę, county, State}.

The rule-generation part of the algorithm was performed on the whole data set. The attribute correction part was per-formed on a random sample of the datasel consisting of 2851 records. During a review 399 attribute values were identified as incorrect for tliis set of records.

To verify the performance of the algorithm, measures p, the percentage of correctly altered values, p, - the percentage of incorrectly altered values, and po - the percentage of values not altered. as defined in previous section are used.

Table V and Fig 2 show the relationsliip between the measures and the dislThresh parameter. The minSup paramctcr was aibitrarily set to 10.

Table V.

Dependency between dislThresh parameters and measures for

Fig 2: The dependency between the distThiesh parameter and measures for contctt-depcndent algorithm.

The results show tliat the number of values tnarked as incorrect and altered is growing with the increase of the dislThresh parameter. Contrary to the same observation in case of context-indcpendcnt cleaning, this number nevcr reaches 100%. This proves that sonie attributes tliat may at first glance seem incorrect, are correct in the context of other attributes within the same rccord. The percentage of conectly tnarked entries reaches its peak for the dislThresh parameter equal to 0.05. The result is better than in the case of context-independent cleaning as the number of conectly altered val-ues for this value of the parameter is equal to the total number of altered values. This also proves tliat context-depen-dent cleaning algorithm performs better at identifying incorrect entries. The number of incorrectly altered values is growing with increase of the parameter. However, a value of tlie dislThresh parameter can be identified tliat gives optimal results, i.e. the number of correctly altered values is high and the number of incorrectly altered values is Iow. In case of tliis experiment the value of the parameter is 0.15.

Some areas of improvement for this method may be identified. A possiblc cliangc in the algorithm could involve adding one morę parameter - the minConf for generated rules. This parameter lias the same meaning as the minConf parameter of the Apriori algorithm. This would enable prun-ing the "improbable'' niles and limit the number of incorrectly altered values. Also generating the mles using clcancd data would result in better algoritlun pcrfonnancc.

III. Conclusion

The results of the experiments verifying correcmess of both algorithms for attribute correction prove that using data rnining methods for data cleaning is an area that needs inore attention and should be a subjcct of further rcsearch.

Data rnining methodologies applied in the area of data cleaning may be useful in situations where no reference data is provided. In such cases this data can be inferred directly from the dataset.

Expcrimcntal results of both algorithms ercated by the author show tliat attribute correction is possiblc without an extemal reference data and can give good results. However, all of the methods dcscribed here definitely necessitate morę rcsearch in order to raise the ratio of conectly identified and cleaned values. As it was discovered in the cxperiments. the effec-tiveness of a method depends strongly on its parameters. The optimal parameters discovered here may give optimal results only for the data cxamincd and it is very likely that different data sets w ould need different values of the parameters to achieve a liigh ratio of conectly cleaned data.

The abovc experiments utilized only one string matching distance (Levenshtein distance) was used. It is possible that other funclions could result in better output and this should be cxplored in futurę cxperimcnts.

Moreover, further research on application of other data min-ing tecluiiques in the area of data cleaning is planned.

References

[1] R, Agrawal, T. Imielinski. A. Swami “Mining Association Rules betw een sets of Items in Large Databases” in Proceedings of ACM SIGMOD Inlernalional Conference on Management of Dala, pp.207-216

[2] B. Beal "Bad Data Haunts the Enterprise' in Search CRM, http://searchcmuechtarget.eom/news/article/0.289142. sid 1 l_gci9651

28.00.html

[3] M. Bilcnko. R. Mooney "Adaptive Duplicate Detection Using Leamable String Similarity Measures" in Proceedings of llie Ninth ACM SIGKDD Inlernalional Conference on Knowledge Discovery and Dala Mining (KDD-2003)

Wyszukiwarka

Podobne podstrony:
Then a large mass of mortar and rubble was placed on top of them. Large fiat stones were placed
While Mr. Henkle was superintendent, also tlie Office of principal was created, on account of the si
IMGP5577 FAP • The average age of on set of polyposis in FAP is 16 years. • &nbs
00085 ?5c54cc53a0b9e32369adfc9c63114c 84Hurwitz & Mathur factors of complexity. On the other ha
107 distribution) was taken on consideration. The finał effect of the geobotanical regionalization o
Deep Space 1 (1998), Dawn (2007) The spacecraft Deep Space 1 was launched October 24,1998 on top of
to recite poetry and converse). The salon was run by two ladies, and on on occassion a flea liappene
shoes&pattens7 77 Shoemaking and cobbling emphasis was placed on embroidery alone and in the 12th c
About the Agency tARR The Lodź Agency of Regional Development Ltd. was established on 29th November
Absolute chronology of the tumulus in Kolosy 157 reasons to ąuestion the radiocarbon dating, which w
SIJMMARY ANO CONCUJS TONS V 86230 fragreentatlon of dlffarant. poi 1 dfts . On ona sidft t.hftrft wa
CITY OF NOWY SĄCZ Nowy Sącz was founded on 8 November 1292 by the Bohemian king Wenceslaus II, on th
280 WILLIAM T. 8HARPE described in termu of a smaller set of comcr portfolios. Any point on the B, V

więcej podobnych podstron