5441337056

ASSOCIATION METHODS IN DATA CLEANING

ŁUKASZ CISZAK: APPLICATION OF CLUSTERING.

79.52

67.56

47.23

29.34

avoiding a large number of incorrect values. In the case of the data examined the optimal value of the attribute was 0.2 where 20% of incorrect entries were altered, 79.52% of which were altered correctly. Wliat is still a subject to furthcr cxperiments is the optiinal value of tlić occRel parameter. Also the clustering method used here could be replaced by other clustering methods which could improve the number of correctly altered values, but might have ncgative influence on the cleaning execution time. as method used here does not reąuire comparison between all the values froin the cleaned data set.

The algorithms displays better performance for long strings as shoit strings would reąuire higher value of the parameter to discover a correct reference value. However, as it was noted in the previous paragraphs. high values of the distThresh parameter results in larger number of ineorrectly altered elements.

This method produces as 92% of correctly altered elements which is an acceptable value. The rangę of the appli-cations of this method is limited to elements that can be stan-dardized for which reference data may exist. Comersely. us-ing this method for cleaning last names could end with a fail-

The major drawback of this method is that may classify as incorrect a value that is correct in contcxt of otlicr attributes of this record. but does not have enough occurrences within the cleaned data set.

Table IV.

Dependency between the ineasures and the distThresh parameter for context-independent algorithm

B. Context-dependent attribute correction

The contcxt-dcpcndcnt attribute correction algorithm is the sccond of the algorithms dcsigncd by the author that uti-lizes data mining methods. Context-dependent means that attribute values are corrected with regard not only to the reference data value it is most similar to, but also takes into con-sidcration values of other attributes within a given record. The idea of the algorithm is based on assumption that within the data itself there are relationships and correlations that can be used as validation checks. Tlte algorithm generates asso-ciation rales from the dataset and uses thein as a source of validity constraints and reference data.

I) Algorithm defmition

The algorithm uses association rales methodology to dis-cover validation rales for the data set. To generate freąuent itemsets the Apriori[17] algorithm is utilized.

The algorithm described in this chapter has two parame-ters minSup and distThresh. Tlte first of the parameters -minSup. is defined analogically to the parameter of the same name for the Apriori algoritlun used here. The other parameter - distThresh, is the minimum distance between the value of the "suspicious" attribute and the proposed value being a successor of a nile it violates in order to make a correction.

Although association rales algorithms normally have two parameters: minSup and minConf, i.e. minimal confidence for generated rales, the algorithm used here does not use the latter of the two. However. the algoritlun can be inodificd to complcte the missing attribute values. In such cases the minConf can be used to determine the minimal confidence the nile inust have in order to fili in the missing value.

For calculating distanccs between textual attributes the modified Levenshtein distance described in previous section is used.

The algorithm ltas following steps:

1. Generate all the freąuent sets, 2-sets, 3-sets and 4-

Generate all tlie association rales from tlte sets generated in the previous step. The rales generated may have 1,2, or 3 predecessors and only one successor. The association rales generated form the set of vali-dation rales.

The algorithm discovers records whose attribute values are the predecessors of the rales generated with an attribute whose value is different from tlte successor of a given nile. Thesc records are marked "suspicious'.

The value of tlte attribute for a "suspicious " row is compared to all the successors of all the rales it vio-lates. If the relative Levenshtein distance is lower than the distance threshold, the value may be corrected. If there are morę values within the acceptable rangę of tlte parameter. a value most similar to tlte value of the record is chosen.

Wyszukiwarka

Podobne podstrony:
ASSOCIATION METHODS IN DATA CLEANING ŁUKASZ CISZAK: APPLICATION OF CLUSTERING. corrcction is conccme
ASSOCIATION METHODS IN DATA CLEANING ŁUKASZ CISZAK: APPLICATION OF CLUSTERING. [4]
274 1. Sublevel stoping method in brief Sublevel stoping is one of the most appropriate underground
00215 ?0e621dbf006c49e21a7837633685a1 217 Applications of the EWMA This paper (1) describes a metho
00219 a6d333666646e915e39fbc117115f5 221 Applications of the EWMA The Control Algorithm In this al
00233 ?994b263585678f9e7929f08dc27281 235 Applications of the EWMA Closed-loop sigma (ar) is calcul
case, Quantitative Methods in Economics, Vydavatelstvo EKONOM, Bratislava, 141-160.Michalski Grzegor
Granty Europejskie FP5 Research Training NetWork Projectdama nr csDevelopment and Application of Met
The Nobel Prize in Chemistry 2005"for the development of the metathesis method in organie
This provides space for the application of AHP method that offers the possibility of pairwise compar
Katarzyna Maruszewska The Application of Correspondence Analysis in Research on Determinants of IWig
497 on SWB. The analysis was conducted with the application of an ordered logit model. The dala used
E L LE R I Prolessor P 4 H ł O H S Medium ais dessen Schiilerin ahmt seine Methoden in der Voraussag
2009: Third International Conference in Combinatorics, Graph Theory and Applications, 23-27 marzec 2
APPLICATION OF FLAYOUR COMPOUNDS IN FOOD Lecture: dr Aneta Jastrzębska Seminar: dr Aneta Jastrzębska

więcej podobnych podstron