2) Resulls
The algorithm was run on a set of 287198 address records. The data records are tuples defined as {street, location, zip codę, county, State}.
The rule-generation part of the algorithm was performed on the whole data set. The attribute correction part was per-formed on a random sample of the datasel consisting of 2851 records. During a review 399 attribute values were identified as incorrect for tliis set of records.
To verify the performance of the algorithm, measures p, the percentage of correctly altered values, p, - the percentage of incorrectly altered values, and po - the percentage of values not altered. as defined in previous section are used.
Table V and Fig 2 show the relationsliip between the measures and the dislThresh parameter. The minSup paramctcr was aibitrarily set to 10.
Table V.
Dependency between dislThresh parameters and measures for
Fig 2: The dependency between the distThiesh parameter and measures for contctt-depcndent algorithm.
The results show tliat the number of values tnarked as incorrect and altered is growing with the increase of the dislThresh parameter. Contrary to the same observation in case of context-indcpendcnt cleaning, this number nevcr reaches 100%. This proves that sonie attributes tliat may at first glance seem incorrect, are correct in the context of other attributes within the same rccord. The percentage of conectly tnarked entries reaches its peak for the dislThresh parameter equal to 0.05. The result is better than in the case of context-independent cleaning as the number of conectly altered val-ues for this value of the parameter is equal to the total number of altered values. This also proves tliat context-depen-dent cleaning algorithm performs better at identifying incorrect entries. The number of incorrectly altered values is growing with increase of the parameter. However, a value of tlie dislThresh parameter can be identified tliat gives optimal results, i.e. the number of correctly altered values is high and the number of incorrectly altered values is Iow. In case of tliis experiment the value of the parameter is 0.15.
Some areas of improvement for this method may be identified. A possiblc cliangc in the algorithm could involve adding one morę parameter - the minConf for generated rules. This parameter lias the same meaning as the minConf parameter of the Apriori algorithm. This would enable prun-ing the "improbable'' niles and limit the number of incorrectly altered values. Also generating the mles using clcancd data would result in better algoritlun pcrfonnancc.
III. Conclusion
The results of the experiments verifying correcmess of both algorithms for attribute correction prove that using data rnining methods for data cleaning is an area that needs inore attention and should be a subjcct of further rcsearch.
Data rnining methodologies applied in the area of data cleaning may be useful in situations where no reference data is provided. In such cases this data can be inferred directly from the dataset.
Expcrimcntal results of both algorithms ercated by the author show tliat attribute correction is possiblc without an extemal reference data and can give good results. However, all of the methods dcscribed here definitely necessitate morę rcsearch in order to raise the ratio of conectly identified and cleaned values. As it was discovered in the cxperiments. the effec-tiveness of a method depends strongly on its parameters. The optimal parameters discovered here may give optimal results only for the data cxamincd and it is very likely that different data sets w ould need different values of the parameters to achieve a liigh ratio of conectly cleaned data.
The abovc experiments utilized only one string matching distance (Levenshtein distance) was used. It is possible that other funclions could result in better output and this should be cxplored in futurę cxperimcnts.
Moreover, further research on application of other data min-ing tecluiiques in the area of data cleaning is planned.
References
[1] R, Agrawal, T. Imielinski. A. Swami “Mining Association Rules betw een sets of Items in Large Databases” in Proceedings of ACM SIGMOD Inlernalional Conference on Management of Dala, pp.207-216
[2] B. Beal "Bad Data Haunts the Enterprise' in Search CRM, http://searchcmuechtarget.eom/news/article/0.289142. sid 1 l_gci9651
28.00.html
[3] M. Bilcnko. R. Mooney "Adaptive Duplicate Detection Using Leamable String Similarity Measures" in Proceedings of llie Ninth ACM SIGKDD Inlernalional Conference on Knowledge Discovery and Dala Mining (KDD-2003)