Version 1.4 A brief introduction to database theory Written by Keld Marple Page 1 Keld Marple, February 2010 Version 1.4 Table of Contents Introduction..........................................................................................................................................3 A database is a program...................................................................................................................3 A database is a concept....................................................................................................................3 Data..................................................................................................................................................4 What do you use them for?..............................................................................................................4 The relational database.........................................................................................................................5 Uniqueness...........................................................................................................................................5 The primary key..........................................................................................................................5 On choosing a good primary key................................................................................................6 What's in a database?............................................................................................................................6 Entity................................................................................................................................................6 Attribute...........................................................................................................................................6 Tuples...............................................................................................................................................7 Values..............................................................................................................................................7 Data types........................................................................................................................................7 Number datatypes.......................................................................................................................7 Integer.....................................................................................................................................7 Float and double.....................................................................................................................8 Text datatypes (strings)...............................................................................................................8 Char........................................................................................................................................8 Varchar...................................................................................................................................8 Text.........................................................................................................................................8 Date, time, year...........................................................................................................................8 NULL..........................................................................................................................................9 Relationships........................................................................................................................................9 Foreign keys.....................................................................................................................................9 The one-to-many relationship.....................................................................................................9 The many-to-many relationship................................................................................................10 Design issues......................................................................................................................................12 The data model ..............................................................................................................................12 E/R-diagram (logical)....................................................................................................................13 Cardinality.................................................................................................................................13 Physical E/R diagram.....................................................................................................................16 Normalisation.................................................................................................................................16 1. normal form...........................................................................................................................17 2. normal form...........................................................................................................................17 3. normal form...........................................................................................................................17 Redundancy...................................................................................................................................18 Data dictionary...............................................................................................................................18 Page 2 Keld Marple, February 2010 Version 1.4 Introduction A digital database is a system in which data can be stored and retrieved. In most cases where you want a system to remember something you will need a database. If you create a webpage with a member registration, you will need to store some information about the registered members. These information must be stored somewhere and the best solution by far is a database. There are different database systems (DBMS) available. One of the biggest systems is Oracle, which is an expensive high-end DBMS. Another one is MySQL, which is an open source DBMS and free of charge for the user. MySQL is very popular as the DBMS used by dynamic web pages. It is a very efficient database and it works well with the scripting language php. When a system which needs a database has to be created, it is crucial that the database model is well designed. The database model is the layout of the tables, and therefore all the tables should be included in the model. A model is a small part of the real world. The model describes that part sufficiently. The model describes nothing but the things relevant to one particular system: The system you're currently designing. If a system needs to store some information about a member, the table could have a username, an email and a password. In reality these three information surely does not descibe a person satisfactory, but in this particular system it suffices. The database model must be well structured for several reasons. The DBMS guarantees one thing: It will store your data and find them again - IF you follow the specified rules for database design. Another good reason is to avoid redundancy (= dual representation of data). A database is a program Just like anything else on a computer the database is software: A program that you install on your hard disk and that has to be running for you to use it. I doesn't matter if the database program has been installed on your own computer or another computer located far away as long as you have the appropriate network software to connect to the database. A database is a concept The idea of storing data is not a new one. Neither is the idea of a database. People have figured ways of storing items, information and even knowledge in many different ways throughout the human history. But storing information isn't enough. Not if there is a lot of it. If you need only to handle the names of your kids or your siblings, it is (for most people) sufficient to use the memory conveniently located in the brain, but when it comes to all the people in a country (and their phone numbers), we need some storing device, that we can use to find the information we want. Telephone books are used for that. But the printed telephone book has one big disadvantage: You cannot list all the entries alphabetically and by number at the same time. So the phone book is good if you know the name and want to find the phone number, but it is utterly useless if you want to find a name from a number. To solve this problem we could print another version of the same data, this time listed in numerical order. But what if you want to search for people with certain professions, people from the same street or the same part of town? And what if all you know is the first name of that person? This is where the digital database comes to the rescue. When using a digital database you are able Page 3 Keld Marple, February 2010 Version 1.4 to store your data in such a way that every piece of data is written once, but can be used in many different ways. This paper is only concerning the digital database. Data As suggested above, many things can be considered data. Any information you would like to store for later use can be data. If you think, that you might one day want to be able to make a log of your customers' whereabouts, you would have to store their actions somehow you would need to store some data. If you sell used cars, and you want to keep track of your business, you could create a database and type in everything you want to store about your cars and your business. These information are data. Knowledge can also be thought of as data, but it is harder to define what knowledge is. Often people prefer to think of knowledge as the thing you achive from combining information. In that respect, we only store information in our digital database. It is when we use it, we change it into knowledge. What do you use them for? Now for the big question: Is it really necessary if you are creating a small website? To be honest: Yes it is. If you're doing anything but an online business card, you will almost certainly enhance the flexibility and efficiency of the site and you will cut down the hours spend on updating. Here's a few examples: Website Without database With database An internet page about your hobby You could have a site showing The same as without, but now you could be stamps. pictures of all your stamps put into a could include different or even category like nationality. Whenever multiple categories (like year printed, you get a new stamp you'll upload the condition, motive). When you acquire picture and edit the html-page for that a new stamp, you upload the picture country to include the new image. and include it in your database. It instantly works on all pages. A weblog. You write your thoughts every day in You write your daily comments a text editor of your choice. Then you directly into a web form. People can upload that file using a predefined both browse and search your texts. file-name. Remember to either delete You can include search criteria like or rename any old files. The website headlines, subjects etc. You can also shows the newest one (e.i. The one see how many times your article have with the right filename). People can been read. browse your daily comments if you provide link to older pages. A small business website: A used cars You have big lumps of information on You can include user comments and salesman. the website including pictures of cars. auctions. And naturally uploading and New cars come in daily, and you maintaining the horde of information update by uploading pages and images gets much easier in new pages using templates. Page 4 Keld Marple, February 2010 Version 1.4 The relational database The most common type of database is the relational database. The database is called relational because data is ordered in relations (that is: tables). Each table is meant to describe one thing or category, like a car for instance. Everything you need to know in relation to a car will be part of the car-table. So the model, car-colour, number-of-doors, motor-volume etc. are all properties related to a car. Even though different cars can have different colours, each car will have one colour. Put in a schematic way, you could write it like this: MODEL: Opel Kadett MODEL: Ford Mondeo MODEL: Nissan Primera COLOUR: Red COLOUR: Blue COLOUR: Blue DOORS: 2 DOORS 5 DOORS: 4 VOLUME: 1400 cm3 VOLUME: 1600 cm3 VOLUME: 1400 cm3 As you see, we have 3 cars registered in this list. They are different in model and they all have the properties listed. The actual value of the property it could be the value blue of the property colour does not make sense on its own. None of the properties do. To use the information provided it is necessary to relate the values with the actual car. The relation between the actual car and the properties is what have given name to the relational database. NOTE: You might have heard of relations between tables and then you might think that it is these relations that gave name to the relational database. This is a common mistake. To be precise, we call this type of relations for relationships. We will discuss the relationships between tables later on. Uniqueness Now let's say we add one car to the garage. A blue Ford Mondeo with 5 doors and a 1600 cm3 motor. Then the list looks like this: MODEL: Opel Kadett MODEL: Ford Mondeo MODEL: Nissan Primera MODEL: Ford Mondeo COLOUR: Red COLOUR: Blue COLOUR: Blue COLOUR: Blue DOORS: 2 DOORS 5 DOORS: 4 DOORS 5 VOLUME: 1400 cm3 VOLUME: 1600 cm3 VOLUME: 1400 cm3 VOLUME: 1600 cm3 As you might have noticed, there are two identical cars in the list. It is no strange matter in the garage, but it's bad news in a database. One of the things to achieve when using databases is the uniqueness of every single item present. T h e p r i m a r y k e y This problem leads us to a very important concept: The primary key. The primary key is one information or attribute, that must be guaranteed unique. In this way every new entry will be unique - even if several of the values are similar to those of other entries. Whenever you are searching for a specific row, you are sure that you can always address it precisely using it's primary key. Page 5 Keld Marple, February 2010 Version 1.4 O n c h o o s i n g a g o o d p r i m a r y k e y There are at least three different approaches when choosing a primary key. The first and perhaps most obvious way, is to look through the attributes and find one which is some sort of unique value already. The personal registration number is such a value. No two persons can have the same personal registration number. The ISBN for books is also a unique value, that only applies to one title. So unless you are doing a registration of specific volumes, the ISBN will be a suitable primary key. Sometimes you don't have such a unique attribute at hand. A movie database can suffer from this, if all you have is a title, a plot description, production year and so on. In such cases an identifying attribute is created. The only purpose of this attribute is to act as primary key. One can even ask the database system to take care of the numbering of the primary key. Each time someone adds a row to the database, the database system will make sure that the next number in line will be added as the primary key of this row. The third way one can create a primary key is to choose two or more attributes that together create a guaranteed unique combination. When an attribute is defined as a primary key, the database will make sure, that nobody insert more rows with the same primary key value. What's in a database? To make sure that everybody knows what everybody's talking about, certain words have come into play during the years. To be completely honest, I might add, that not everybody uses all the words in exactly the same way, and they don't always use them consistently and unambiguously. But one has to start somewhere when you know your way around the database theory, you will not get confused because someone uses a different word. Here is some of the important words and their meaning. Entity The first thing we create, when we develop a database model is the entities. Entities are things or ideas from the physical world, that we need in our model. A database with members could choose person as an entity. Or even member . The entity is meant as an abstraction of what it represents. Is is naturally not the thing itself. It is only a representation of the thing. And it is only meant to represent the thing or idea in one particular model. Therefore we might include things like name , address and telephone number in our entity called member , even though another database designer doing another database might include login , password and email in his member-entity. Often you will see the words table and relation as more or less synonymous with the word entity. Developers use the word entity when designing the model and the word table when referring to the actual tables that have been created in the database. Attribute When we have found an entity for our database model, we must describe it in detail. This is done by specifying exactly what type of information the entity can hold. It has already been mentioned as name , address etc. and these are the attributes of the entity. The attributes are the properties of Page 6 Keld Marple, February 2010 Version 1.4 the entity. If an entity was a chest of drawers the attributes would be the drawers. All the attributes together describes the entity. (At least technically - we might still need a description in a humanoid language somewhere). Finding the right attributes for the entities (and the right entities for the model) is a major part of the database design process. We will get back to this issue later. When you move from the model design level to the implementation level, attribute becomes column or field . Tuples When the database is finished it needs to be populated e.g. filled with data. Whenever you enter a new member into the member entity, you will have to enter one specific value that applies to this particular member for each attribute in the entity. After you have done that, you have a collection of information about one particular member. Or you could say: You have one instance of the entity. This collection is called a tuple. Other words for typle that you might encounter are row or record . Values When we get to actually inserting data into the database, we talk about values. A value is something concrete in opposition to an attribute, which is an abstraction. Each new tuple in a table is to be thought of as one specific instance of what the entity represents. A physical representation. If it is a car-entity, one tuple would represent one physical car. This car is described in this particular tuple and nowhere else. It might use stuff from other entities and refer to this using a foreign key. For example a car could have airbags installed as extra equipment, but that equipment itself is described elsewhere (in the equipment-entity) and is as such not part of the description of the car. One value exists in one cell. Remember that each cell can contain one value no more. They are said to be atomic, which means they cannot be split up into several values. Data types When storing data in a database it is important to know what you are actually storing. If you store a name, you know by instinct, that it is not a number, and if you're storing a number, you know you can't use letters. But what about phone numbers, dates or even binary files? N u m b e r d a t a t y p e s Numbers are special in that the computer uses the bits directly as numbers, and uses mathematics to change it into decimal numbers. Numbers are quick to use one could say, that they are the native language of the computer. Integer One of the most common types of data is the integer. That is whole numbers like 1, 2, 3, 4 and so on. Integers can also have a negative value (unless you specify otherwise). The integer is usually a 32 bit number, which means that 32 bits are used to store that value. This gives you the ability to store 232 (= 4 294 967 296) different values. Page 7 Keld Marple, February 2010 Version 1.4 Float and double If you need to store a number's decimal fractions, you cannot use the integer datatype. You'll have to use one of the floating point types: Float which is 32 bit and double which is (surprisingly) 64 bit. T e x t d a t a t y p e s ( s t r i n g s ) Using the binary system to represent text is fundamentally different from the representation of numbers. While the base10 numbers can be calculated from their binary representation, letters and signs have to be matched with some bit combination from a list. That list could be the ascii character list in which the number 97 represents the letter a . To represent one letter 8 or 7 bits are usually needed. With 8 bits you are able to represent 256 different values and therefore you could have a list of 256 different characters. Char Char means characters . Chose the char datatype if your field contains a fixed length string. If you for example want to store a phone number using the +XX XX XX XX XX format, you cannot use a numeric datatype (as the + sign and the white spaces are not numbers), but you do know exactly how many characters you'll need as every phone number has the same format (assuming you're storing phone numbers from one country only). The number of characters needed in this example is 15: Ten numbers, one plus sign and four white spaces. One could also imagine that you'd want to store an encrypted password. If you've used the MD5 encryption algorithm, you'll know for sure, that your encrypted result contains 32 characters. You'd chose the char datatype, set to 32 characters. Varchar Another common type of text is the varchar datatype. The word varchar is an abbreviation of variable character , meaning a string of characters of variable length. The varchar is special in that the database will cut off unused characters. So if you define a field to be a varchar of 50 characters and you only store the word ice cream in it, the database will only use as much space, as the number of characters in the word plus a byte for the length of the field. Use that varchar field for names, emails, adresses and any other type of field, where you dont know the length of the values stored in it. Text The text datatype is another string-based datatype. It can hold very big amounts of characters. A text field can hold 65,535 characters, but like the varchar it will cut of empty spaces, though not as effectively as the varchar. D a t e , t i m e , y e a r These datatypes are actually string types as well. The difference lies in the ways they can be used. Most database systems offers quite a few date and time handling functionalities. Page 8 Keld Marple, February 2010 Version 1.4 N U L L One special kind of data is the value of nothing. How can you represent nothing in registers that can only contain ones or zeros? A memory register is never empty, so to represent emptiness you need a special value. This value can only be used for one purpose: The representation of nothing. NULL is that value. Relationships There are two fundamental organisational features of the relational database. One is, that you can order entities in relations (tables) with their attributes. This one has been dealt with already. The other main feature is, that you can create relationships between these entities. A relationship exists between to entities. Foreign keys When a relationship is established between two tables it is done using special values called foreign keys (FK). A foreign key is nothing but an extra attribute in one of the tables containing a value that is actually identical to the primary key of one particular row in the other table. It is by comparing these values that the relationship is realised. You can also think of the foreign key as a reference or a link. T h e o n e - t o - m a n y r e l a t i o n s h i p There are three different types of relationships: The one-to-one relationship, which is the type of relationship that exists between the entity and its attributes. The second type of relationship is the one-to-many relationship. This type of relationship is actually the one described above. It exists between to entities and you only need two tables to create it physically. Let's say we expand the car database commenced earlier. Instead of having the colours of the cars written out directly in the car-table, we'll create a separate table for colours. This way the value blue will only be represented one time in the entire database even if a thousand cars are blue. The colour table could look like this: COLOUR: (1)Blue COLOUR: (2)Red COLOUR: (3)White COLOUR: (4)Black COLOUR: (5)Green COLOUR: (6)Yellow So now, if you need to store a red car in the database, you could just refer to this table, stating that the car has colour number 2. For all the cars in the table you'd change the content of the colour attribute from the actual value ( blue , red etc.) to the number, that the colour has in the colour table. All the colour attributes of the blue cars will contain the number 1 , because this is the value of the blue colour. It is the primary key of blue. The colour attribute of the car-entity has changed into a foreign key. Page 9 Keld Marple, February 2010 Version 1.4 The car table from before, now looks like this: MODEL: Opel Kadett MODEL: Ford Mondeo MODEL: Nissan Primera MODEL: Ford Mondeo COLOUR: 2 COLOUR: 1 COLOUR: 1 COLOUR: 1 DOORS: 2 DOORS 5 DOORS: 4 DOORS 5 VOLUME: 1400 cm3 VOLUME: 1600 cm3 VOLUME: 1400 cm3 VOLUME: 1600 cm3 Note that the colours exists only as numbers references to the colour table: As foreign keys. Let's examine this relationship further to find out what it means, that it is a one-to-many relationship. Take one car: The Nissan for example. This particular car can only have one colour: Blue. The Ford Mondeo (both of them) can also (each of them) have one colour: Blue. There are three cars which are blue, but each of the cars only has one colour. One could say that: One car can have only one colour, but one colour can be applied to many cars. The one-to-many relationship. Note that one of the tables in a one-to-many relationship contains a foreign key. The other doesn't. The colour table doesn't have any keys that points to the car-table, and as a result one cannot say anything about cars by looking at the colour table. You can't even figure out how many cars are blue, because it is not part of the description of a colour to know how many cars exist in that colour. But it is part of the description of the car to say which colour it is. Therefore the foreign key exists in the car table. T h e m a n y - t o - m a n y r e l a t i o n s h i p The many-to-many relationship is a bit more complicated than the one-to-many. It turns out, that it is impossible to create a many-to-many relationship using only two tables. Even though we might insist, that a many-to-many relationship exists between two entities, we cannot implement it physically with just two tables. An extra table has to be created. The example here will be the relationship between a car and extra equipment. Here is a list of equipment: 1 Car radio dkr. 699,- 2 Battery charger dkr. 469,- 3 GPS-system dkr. 2499,- 4 Spoilers dkr. 789,- The people who make car radios want to sell as many radios as possible. Therefore they have made the radio to fit into most cars. On the other hand: Most cars can actually have more than one piece of extra equipment. You can install a radio in a car, even though you put on the spoilers only last week. The relationship between cars and extra equipment described is the many-to-many relationship. The question is now how to establish this relationship. Neither in the car table or the extra equipment table is there anything that indicates a connection between the two. This is not a coincidence. To make the relationship an extra table must be created. This new tables has one purpose of existence: To hold two foreign keys one for each table. Let's try it to see how it works: Page 10 Keld Marple, February 2010 Version 1.4 The car_extra_equipment_table Car Equipment Ford Mondeo (1) Car radio Ford Mondeo (1) GPS-system Nissan Primera Car radio Nissan Primera Spiolers Ford Mondeo (2) Car radio Opel Kadett Car radio Open Kadett Battery Charger Open Kadett Spoilers As you can see any car can be represented more than once in this table. The Opel Kadett for instance is represented three times. On the other hand, any piece of equipment can also be represented more than once. All the cars have a car radio, so that piece is represented four times. So any car and any equipment can exist as many times as is needed what happened to uniqueness of a row? Actually it exist in the combination of the two columns. The Ford Mondeo (1) only exist one time combined with the car radio. The car itself is represented two times, but with two different pieces of equipment. The car radio exists four times, but in combination with four different cars. With this table the many-to-many relationship comes to life: One car can have many pieces of equipment (and it will appear in the table exactly that many times), and any particular piece of equipment can be applied to many cars (and will appear in the table exactly as many times, as it is applied). This new table that is created only consists of two foreign keys. It is only there to make the many- to-many relationship between two other tables possible. To indicate, that it doesn't contain any attributes that are not keys, it is called a weak entity. The primary key of the weak entity could very well be the two foreign keys combined. This is called a composite primary key and is very common. When this table is implemented it will contain only numbers. Not words like the example above. The physical table in the database would look like this: car_equipment car equipment 2 1 2 3 3 1 3 4 4 1 1 1 1 2 1 4 Page 11 Keld Marple, February 2010 Version 1.4 Design issues Any database that's created, is created to serve a purpose. When you decide, that your project needs a database, it it because you have figured out that either the project cannot be done without a database or maybe that if you add the database to the project, you can make it much more efficient. Whatever the reason, you want your database to be reliable. You want to take advantage of the guaranties given by the DBMS, that if you follow the rules of database design for relational databases, the DBMS will guaranty that your data wont get messed up. Before you go on implementing your database, you must go through the database design phase. That is the phase where you figure out what you need to solve your problem. The data model The first thing you must do is to make a model of your system. That is a model of what you need to be able to store. In the car database used as the example of this paper, you would have to be able to store cars and extra equipment. You know this because you have tried to describe the system to yourself (and preferably others) in normal words. Something like: The application is a website, where the users can search for different used cars and what extra equipment fits the car. The user will also be able to sign up for a newsmail, whenever a new (used) car comes into stock. Reading this paragraph carefully you find several important words: users, cars, extra equipment, newsmail and stock. These words are interesting because they are words closely connected to the requirements of the system. If the system cannot do, what's in the description it is not the right system (but some other system), and the system development has failed. Any system is a little world in itself. It cannot deal with anything that it is not created to deal with, but it must be able to handle the things for which it has been created. Reading the description this system must be made to handle users, cars, extra equipment, newsmails and the stock. The rule of thumb says that the nouns of a description makes good candidates for entities. It might not be all of the nouns that actually end up as tables, and new ones might show up in the process, but we need a place to start, and this is a good one. We start therefore with the following entities: user car extra equipment newsmail stock Page 12 Keld Marple, February 2010 Version 1.4 E/R-diagram (logical) With the entities found it is possible to make the E/R-diagram. E/R stand for entity/relationship , so this is a diagram that shows us which entities we have in the system, and what relationships exist between them. We have already found one of them: The many-to-many relationship between the car-entity and the extra equipment-entity. When doing the E/R-diagram the entities are usually represented as boxed, and the relationships are represented with lines between the boxes. The lines are drawn with a diamond shaped box on top of it. Like this: car equipment C a r d i n a l i t y To make this diagram really useful one extra information is included in the diagram: The cardinality of the relationship. Between the car-entity and the equipment-entity a many-to-many relationship exist. So at each end of the relationship-line a number is written to signify many . We don't have any restriction on that number, so we just call it many. It could be one, two, fifty or even none at all. The symbol for many in database modelling is the letter n (lower case). Now we need a many symbol at each end of the relationship, but to prevent anyone from misunderstanding our model and assume that the two n's actually means the same number, we use the letter 'm' for the second part of the many-to-many relationship. This is how it looks with cardinalities: n m car equipment That part of the E/R-diagram is now finished. The only thing missing is the rest of the entities. Here is a suggestion of the five entities: stock 1 n n m car equipment n m user newsmail As you might have noticed, not all entities are currently connected. There is no rule saying that all tables must be connected. In fact in this model, there is one entity, which might end up totally Page 13 Keld Marple, February 2010 Version 1.4 isolated. Take the newsmail entity. When you think about it, no natural relationship between a user and the news exists. The newsmail entity may contain the actual newsmails that the company sends out, and the user might have an attribute stating whether they are interested in the news or not. But no relationship. When new news are written, they are stored in the newsmail table and the system will run through all the users to see who want a newsmail. It will then send the mail. The relationship between user and newsmail must be removed. On the other hand. If you're thinking ahead, it could very well be of some interest to the car salesman to be able to find out who bought which car six months ago. So maybe there should have been a relationship between car and user. The E/R-diagram changes into this: stock 1 n m n car equipment n newsmail m user Notice that it is often difficult to create the perfect E/R diagram in one attempt. But after you have made a diagram, you should go through it while thinking about all relations: Why are they there? What does it mean, that there's a relationship between these two entities. Sometimes it becomes clear, that the relationship should be removed or that the cardinality should be changed. And sometimes you'd find, that you have missed an important relationship between two entities. This work is on the logical level. We are at this time thinking about what our model represents. There's a logical relationship between car and user: A car can be bought by many users and a user can buy many cars. As you'll see shortly, this is not a precise physical representation of the database. For this reason, we call this level of the E/R diagram the Logical E/R diagram. Page 14 Keld Marple, February 2010 Version 1.4 You can add attributes to the E/R-diagram to get an idea of what your entities contains. model colour stock doors 1 volume type price n delivery m n car equipment n newsmail m user name email password want_news Doing this might reveal attributes, that are actually entities. Like the colour-attribute. This would be much better as a separate entity. The final E/R-diagram then looks like this: stock 1 n m n colour car equipment n 1 n newsmail m user Page 15 Keld Marple, February 2010 Version 1.4 Physical E/R diagram After finishing the logical E/R-diagram, it is time to move on to the physical E/R diagram. The physical E/R diagram is meant to be an exact resemblence of the database. Seven boxes in the diagram, means seven tables in the database. This is different from the logical E/R-diagram. The physical E/R diagram for the used cars company would look like this: stock colour car car_equip equipment car_user newsmail user The real big difference between the logical and the physical E/R-diagram is that you can't have many-to-many relationships in the physical E/R diagram. All many-to-many diagrams will have to be split up into two one-to-many relationships, because this is how a many-to-many relationship is actually (physically) build. So in the physical E/R diagram you'd add all the weak entities so that the diagram shows the exact amount of tables in the database. Normalisation It is time to get hold of the attributes of the entities. Entities can be written as a list like this: car colour stock user equipment car_user car_id colour_id stock_id user_id equip_id car_id model colour street name type user_id colour number street price buying_date doors house no. delivery year_of_car volume postal no. district To make sure a database is correctly constructed, a few rules have been developed which can help clean up inconsistent design. They are called the rules of normalisation and there are six of them. We only concern ourselves with the first three rules, as they are the ones with practical significance. The rules are called forms and we will now go through normal form 1, 2 and 3. Page 16 Keld Marple, February 2010 Version 1.4 1 . n o r m a l f o r m For a relational database to be in 1. normal form one must assure that all attributes are dependant on the primary key. That means that you can't have an attribute where the content does not describe the instance represented in the tuple, but actually could be described better on its own. The colour attribute in the car-entity is an example. Before it was separated into its own table, the colour was written directly in the car-entity, but we don't want to have to go to the car-entity to learn about a colour. We want to be able to find everything regarding the colour (like it's name) in a colour entity. Imagine that you had to find a certain car to find out what orange was like. Better to go to the tuple "orange" in the colour table. The 1. normal form also says, that the entities must be free of repeating fields. Repeating fields means numerous columns describing the same type of information. Imagine that the car salesman wanted to store a list of former owners of his car. One could then be tempted to include 3 or 4 extra columns in the car-table called owner 1 , owner 2 , owner 3 and owner 4 . But this is not good. If a car were to have more than 4 owners, there would not be enough place in the database. That's messy design, and it has severe structural limitations. One obvious limitation is, that a car can have a maximum of four owners. It is much better to have a table called user (as in the model above) and then construct a relationship between the two. 2 . n o r m a l f o r m The second normal form says that if you have tables with composite primary keys, then all other attributes must be equally dependant on both of them. A violation of the 2. normal form can be seen in the car_user table above. The attribute called year_of_car has been wrongly placed in this entity for several reasons. One of them is, that the attribute was thought of while creating the entity, and it seemed perfectly sound to include an attribute about the age of the car when it is bought. But this is not the right place to store the value. If the car is bought back by the used car company, and then sold again, the value will be stored again. And even worse: The age of the car can only be stored when a car is sold. This is what the 2. normal form is meant to prevent. The year-of-car attribute is more dependant on the car_id foreign key than the user_id foreign key, for which it is totally irrelevant. As the two foreign keys together form a composite primary key, the year_of_car attribute can not be said to be equally dependant on both part of the primary key. The attribute must be moved to the entity where it belong: The car entity. 3 . n o r m a l f o r m The 3. normal form says that no attribute must be more dependant on another attribute than the primary key. There is an example of that in the tables above as well. In the user-table the attribute called district is part of the users address and for that reason one might think that it should be placed inside the user-table like the rest of the address. But this is a mistake. Because the district is actually not directly bound to the user as much as it is bound to the postal number. There is never a case where the same postal number is combined with two different districts. The district follow the number in all cases. If a user moves the address changes. The district also changes, but it does so mainly because the postal code is changed. For this reason the postal number and the district make up their own table, where the postal number is the primary key and the district is the only attribute. That way potential errors are avoided. Page 17 Keld Marple, February 2010 Version 1.4 Redundancy Generally speaking the normalisation phase is about avoiding errors. The worst error in a database is repeating data. The technical term for this is redundancy. The reason why this has to be avoided, is that it is the biggest source of serious errors. If data which means the same (like a postal district) is repeated many times throughout the database, it can be difficult to guaranty consistence. If for some reason the district is spelled wrongly in one instance, that error will be very hard to find. Data dictionary When the database model is finished, you can move on to the data dictionary. The data dictionary is an important document for the programmer who's implementing the database. It is also an invaluable document when it comes to maintaining the database. If you for some reason want to change the database months after the launch of the application, you will need a document that tells you exactly what's what. Don't assume that you'll be able to remember anything in half a year. You wont. Databases quickly get very big, and they are structured in a way not necessarily suited for the human mind. Documentation is the answer, and the data dictionary is the most detailed document when it comes to attributes. Here is an example of a data dictionary for one of the entities in the used car data model: car Datatype Range Default NULL unique indexed Foreign Auto- value Key (FK) increment car_id int No Yes model varchar 30 No colour int No To colour doors tinyint 4 No volume int No year year Yes A documentation like this tells the technician that implements the database how to program the database. You should not expect the technicians to figure out for themselves the datatypes, range, default values etc. for the attributes. Not because technicians are less intelligent, but because nobody can maintain a clear picture of the whole system while typing at the keyboard. How should a technician know if you've decided your varchars to be 30 or 40 characters in length. How would you know, if you took over from somebody else? Page 18 Keld Marple, February 2010