ASSOCIATION METHODS IN DATA CLEANING
ŁUKASZ CISZAK: APPLICATION OF CLUSTERING.
corrcction is conccmed. dala cleaning Solutions are only as good as the reference data they use. Reference data is a set of yalues which are considcrcd to be valid for a givcn attribute. e.g. list of States, countries.car modcls. etc.
The aim of rcscarch led by the author of this article was to examine if data mining methods may be used to solve tasks mentioned in the above paragraph. As data inining is the process of discovcring previously unknown knowledge from data. it can be used to discover data validation niles and reference data directly from the data set. Table I con-tains possible application of data mining methods within the area of data cleaning..
The main focus of this article is attribute correction. Other data cleaning tasks are subject to fuither research. This article presents two applications of data mining techniques in the area of attribute correction: conte.\t-independent attribute correction implemented using clustering tecltniques and con-te\t-dependent attribute correction using associations. The next sections of this cliapter present two algorithms created by the author tliat are supposed to exainine the possibility of application of data mining tcchniqucs in data cleaning. Both algorithms are intcndcd for use in the application of text-based address and personnel data cleaning.
A. Context independent attribute correction
As mentioned in the prcvious section. attribute correction Solutions require reference data in order to provide satisfying results. The algorithm described in tliis section is used to ex-amine if a data set may be a source of reference data tliat could be used to identify incorrect entries and enable to correct them. The algorithm aims at correcting text attributes in address and personal data.
Context-independent attribute correction means that all the record attributes are examined and cleaned in isolation without regard to values of other attributes of a given record.
The key idea of the algorithm is based on an observation tliat in most data sets there is a certain number of values hav-ing a large number of occurrences within the data set and a very large number of attributes with a veiy Iow number of occurrences. Therefore, the most-representative values may be the source of reference data. The values with Iow number of occurrences are noise or misspelled instances of the reference data. Table D shows an excerpt from the distribution with values sortcd alphabctically and thcn dcsccnding ac-cording to the number of occurrences. The attribute being examined comes from the sample data set used in this experi-mcnt and is derived from an Internet survey. As it may be noticed. there are many (6160) rccords that have the value "Warszawa'' and veiy few that have a value tliat closely re-sembles “Warszawa”. Therefore, the value “Warszawa” is most likely the correct value of the attribute and should entcr the reference data set while the remaining values are misspelled and should be corrected to “Warszawa".
The aim of the algorithm is to create a reference data set and a set of substitution niles that could be used to correct misspelled values.
Table II.
Location attribute distribution
1) Algorithm definition
The algorithm utilizes two data mining tcchniqucs: clustering and classification using the “k-nearest neiglibours" method.
The algorithm lias two parameters: distance tlireshold -distThresh being the minimum distance between two values allowing them to be maiked as similar, and occurrence rela-tion - occRel, used to determine whether both compared val-ues belong to the reference data set. This parametcr was in-troduced because it is possible that there are two values tliat may seem similar terms of the distance but that may repre-sent two different objects of the real world that should create separate entries in the reference data set.
To measure the distance between two values a modified Levenshtein distance [5] was used. Levenshtein distance for
Association rulcs
Data standardization/attrihutc co
Duplicatc mat. Yalidation rulcs g.