100
PROCEEDDMGS OFTHEIMCSIT. YOLUME 3, 2008
1 lLev(sl s2)
NI
Lev(sx s2)
—nr—) o>
wheie Z,ev(.vi a) denotes a Levenshtein distance between strings Si and S2. The modified Levenshtein distance for two strings may be inteipreted as an average fraction of one string tliat has to be modified to be Iransformcd into the other. For instance. the Levenshtein distance between “Warszawa” and "Warzsawa” is 2, while tlić modified Lev-enshtcin distance for "Warszawa” and “Warzsawa” is 0.25.
two strings is thc number of text edit operalions (insertion, delelion. cxcliangc) needed to transform one string into an-other.
The algoritlim for attribute correction describcd licrc uti-lizes a modified Lcvcnshtein distance defincd a;
The modification was introduced to be independent of the string length during the comparison.
The algorithm consists of following steps:
1. Prcliminary cleaning - all attributes arc transfonned into upper case and all the non-alphanumeric cltar-actcrs are removed
2. The number of occurrences for all the valucs of the clcancd data set is calculated.
3. Each values is assigned to a separatc cluster. The eluster element liaving the highest number of occur-renccs is denoted as the cluster represcntative.
4. The cluster list is sorted descending according to thc number of occurrences for each cluster repre-sentative.
5. Starting with first cluster. each cluster is compared with the other clusters from the list in the order defincd by the number of cluster representative occurrences. The distance between two clusters is defined as the modified Levenshtein distance between cluster representatives
6. If the distance is lower than the distThresh parame-ter and the ratio of occurrences of cluster represen-tative is greater or equal the occRel parameter. the clusters are merged.
7. If all thc clustered pairs are compared. the clusters arc cxamined whether they contain values having distance between tliem and the cluster representa-tive abovc the threshold value. If so. they are re-moved from the cluster and added to thc cluster list as separate clusters.
8. Steps 4-7 are repeated until tliere are no changes in the cluster list i.e. no clusters are merged and no clusters are ercated.
9. The cluster representatives form thc reference data set. and the clusters dcfinc transformation niles -for a given cluster cluster elements valucs should be rcplaccd with the value of the cluster representative.
As far as tlie reference dictionaiy' is conccmcd. it may happen tliat it will contain values wherc thc number of occurrences is veiy smali. These values may be maiked as noise and trimmed in order to preserve the compactncss of tlie dic-tionaiy'.
2) Restilts
The algorithm was tested using a samplc data set derived from an Internet survey. The data record is definc as a tupie {First Narne, Last Name, Occupation, Location}. There were about 30057 records divided into 6 batchcs of 5 thou-sand records. Tlie attribute that was the source of thc data for cleaning was "Location". During review 1166 elements -3.88% of the whole set. were identified as incorrect and hence subject to altcration.
Table III contains example transfonnation niles discov-ered during the execution of the algorithm Table III.
Example transformation mles Original valuc Correct valuc
Waiszwa Warszawa
Waiszaw Warszawa
Warsawa Warszawa
Warzsawa Warszawa
In order to test the correctness of the algorithm. following ineasures are used:
■ P:~ percentage of correctly altered values
■ p, - percentage of incorrectly altered values
■ po - percentage of values marked during the re-view as incorrect. but not altered during thc clcan-ing proccss.
The measures arc defined as
p=~ 100
p =—100 (2)
na
where n, is thc number of correctly altered values. «, the number of incorrectly altered values, na the total number of altered valucs, n,,,, the number of elements initially marked as incorrect tliat were not altered during the cleaning process and iio the number of values identified as incorrect.
Fig 1 and Table IV show the dependency between the distThresh threshold parameter and defined measures.
It can be obsen ed that tlie percentage of values altered is growing with thc incrcase of the distThresh parameter. It is caused by dccreasing tlie restrieth eness of thc algorithm in assigning valucs as similar. The percentage of correctly altered valucs rcaclies its higliest values of 92.63% at 0.1, however. for tliat value of the parameter only about 7% of previously identified incorrect entries are corrccted. The percentage of incorrectly altered values is also growing due to the greater tolcrance for duplicate matching criteria. How-ever, it is possible to define a value of the parameter to achieve the optimal ratio of tlie correctly altered values