5441337053

PROCEEDINGS OFTHEIMCSIT. YOLUME 3. 2008

we may havc (wo records referring to separatc rcal world ob-jects bcaring the same identifier. If a domain check con-straint is missing, a column may contain values outside a specificd rangę. e.g. morę tlian two values dcscribing se.\. If a not-null constraint does not exist. a record may lack a value for a mandatoiy attribute.

If a source schema does not have refercncc constraint. we may have records in one table referencing non-existent records in another table.

2) One data source - record level issues

This lype of issues is related to the crrors that occur at the point of data cntiy into the source system. Typical errors in-volve:

■ misspellings - caused either by OCR imperfec-tions, e.g. ‘1’ (digit one) may be replaced with T (lower case L), typos. or phonctic crrors.

■ default values for mandatoiy attributes.

■ inconsistencies (e.g. incorrect ZIP codę for a city).

■ misfielded values. i.e. correct values but placed in wrong attributes, e.g. countiy=”Warsaw",

■ duplicate records: morę tlian one record referring the same real world object.

3) Multiplc data sources - schema level issues

Multiplc source schema level errors are causcd by differ-

ent data modcls in cach of the source Systems. For csample. there may be homonyms or synonyms at the attribute level (attributes with the same name. but different meaning or attributes with different names but the same meaning). different levels of normalization, different constraints. or different data types.

4) Multiplc data sources - record level issues

This kind of data quality issues is the cross-producl of all the aforcmcntioned issues. In addition, there may be otlier problems:

■ different irnits of measure for the same attribute (meters vs. inches, $ vs. €, etc.),

■ Different constraints/domains: e.g. Sex= {M, F) ; Sex= {0, 1} ,etc„

■ Different levels of aggregation: daily vs. weekly, monthly vs. annually,

■ Duplicate records - the same objcct from the real world may have different representations in different source systems.

C. Data cleaning areas and related work The following primary areas of data cleaning may be dis-tinguished. namcly:

■ Duplicate matching: in case of integrating multiple sources it may happen that one or morę sources contain records denoting the same rcal world object. The records may have various degrees of data quality. Thcrefore, one of the tasks of the data cleaning solution is to identify duplicatcs and join them into a single record wliosc data qualily would be high. This problem is known as “merge/purge” or record linkage problem. Duplicate matching is also uscd to discover duplicates on the liighcr lcvcl, e.g. records of people that share the same address. This problem is know n as household dctcction |14). The rcscarch in this area (e.g., [3|[ 14| [15|[18|) is focuscd on devising methods tliat are both cffective. i.e. result in high number of correct matchcs and Iow number of incorrect matches, and efficicnt. i.e. perfonning within the time constraints defined in the system rcquirements.

■ Data standardization and correction: in case of different domains used in different source systems the data cleaning solution should transform all the val-ues uscd in those system into one correct set of val-ues uscd in the target system. Moreovcr. if any incorrect values appear. the role of the data cleaning is to identify those values and alter them. Works that concem this issue use various mctliods ranging from statistical data cleaning[8] to machinę leam-ing|4|| 12|.

■ Schema translation: as source systems may utilize different data models, the task of data cleaning solution is to provide a mapping from those data models to the target data model. This may rcquire split-ting frcc-fonn fields (e.g. “address linę 1”) into an atomie attribute set (“street”,” homc no”, “zip codę”). The rcsearch in this field focuscs on dcvis-ing methods tliat are capable of perfonning this proccss automatically [6]

n. Application of Data Mining Methods in Data Cleaning All current data cleaning Solutions are highly dependent on human input. The deliverable of the profiling phasc - the first phase of a data quality assessment [13], is a set of meta-data dcscribing the source data which is then uscd as an input for the creation of data validation and transfonnation rules. Howevcr, the validation mles have to be confirmed or designed by a business user who is an expcrt in the business area being assessed. It is not always easy or straightforward to create such a set of business rules. The situation is very similar w herc duplicate matching is conccmcd. Even if business niles for record matching are provided. e.g. "Equal SS-N‘s and dates of birth '. it may be impossiblc to match dupli-catc records. as any of the data quality issues may occur thus preventing from cxact matching. Therefore. if incorrect SS-N's or dates of birth stored in different positional systems occur, exact-malching business rules may not mark the records as duplicates. As far as attribute standardization and

Wyszukiwarka

Podobne podstrony:
100 PROCEEDDMGS OFTHEIMCSIT. YOLUME 3, 2008 1 lLev(sl s2) NI Lev(sx s2) —nr—) o> wheie Z,ev(.vi
NOKIA We cre=te tne techno:ogy to connect the world Wrocławskie Centrum Technologiczne firmy Nokia o
/X/X; ATR 220 to pojazd eksploatowany od 2008 we Włoszech a później w Polsce. ATR 220 Tr - to jego
77030 mikroekonomia wykład (3) o r f - WO -Os Q f f-0,S<3<we -OSQ«.- 0,sqSL - wo_p* -- yi(>
81520 s11 GłU62<We/ l^(96MfnióćU soCcw (Wo^ otueul) c o/ildojdojM1® izoloj^ , - &nb
P1190350 Ml 25 _fwtncc^. PP»W ftfliwzc A§r. we may JL7 ic»of fai Mrfim which ttrcm
we may have to consider some means of restricting theatten-dance. Whether this will be necessary at
tok tematy f f International* BaccalaureateTheory of knowledge prescribed titles November 2008 and M
jbonaam A player who commits thc same otfcnce a sccond timc in thc same gamc may havc to forfeit thc
charter 8 rvvards- If tl s is just thc right moment to point out a ianguage featurc, we may olfer t
Circulation of a Vector When a closed path C is defined in a vector field F as shown Figurę 1.6 we m
Sources of Loss In a disipative medium the sources of loss we may encounter are dielectric loss,
they are smaller and are not systematic. In this case, instead of differences, we may talk about tre

więcej podobnych podstron