Skip to main content

Designing a Normalized Database

Designing a Normalized Database from Microsoft

Tables representing propositions about entities of one type (that is, representing a single set) are fully normalized. Correct and complete mapping of a conceptual ORM model to a logical model yields fully normalized tables. Properly designed entities in an ER model lead to fully normalized tables as well. However, both ORM and ER modeling start with the business description of a problem; it is possible to miss some dependencies between entities and leave some tables denormalized. Of course, there could also be a bug in the tool that produces the DDL script from the ORM and ER models. However, any denormalization can lead to update anomalies. Data integrity and consistency are fundamental for databases. Remember that a database holds propositions, and propositions are facts. If propositions are not true, they are not facts; they are falsehoods. You need a logical method that yields a fully normalized database.
Normalization is the process of redesigning the model to unbundle any overlapping entities. The process involves decomposition; however, decomposition cannot yield a loss of information. You perform the decomposition by applying a linear progression of rules called normal forms. Normalization eliminates redundancy and incompleteness. Note the part that designers frequently overlook: normalization eliminates incompleteness, not just redundancy. Many normal forms (NFs) are defined; the first six are called first NF, second NF, third NF, Boyce-Codd NF, fourth NF, and fifth NF. If a database is in fifth NF, the database is fully normalized. Only the first three NFs are important; usually, if a database is in third NF, it is in fifth NF as well. You should understand the normalization form and use it to perform a final check of your database design, checking the model you created by using other methods.

First Normal Form

Imagine a table such as the one that Table 2-2 shows. The table holds information about sales. In this case, only the OrderId column is part of the primary key.
Cc505842.table_C02623422_2(en-us,TechNet.10).png
Table 2-2 Table Before First NF
With a design like this, you can have the following anomalies:
  • Insert How do you insert a customer without an order?
  • Update If item Bo is renamed, how do you perform an update?
  • Delete If order 3 is deleted, the data for customer 2 is lost.
  • Select How do you calculate the total quantity of bolts?
Note that only update and select anomalies deal with redundancy: they are problematic because the table contains redundant data. Insert and delete anomalies deal with incompleteness of the model. The rule for first NF is, “A table is in first normal form if all columns are atomic.” This means there can be no multi-valued columns—columns that would hold a collection such as an array or another table. First NF is somewhat redundant with the definition of a relational table or of a relation. A table is a relation if it fulfills the following conditions:
  • Values are atomic. The columns in a relational table are not a repeating group or arrays.
  • Columns are of the same kind. All values in a column come from the same domain.
  • Rows are unique. There is at least one column or set of columns, the values of which uniquely identify each row in the table.
  • The order of columns is insignificant. You can share the same table without worrying about table organization.
  • The sequence of rows is insignificant. A relational table can be retrieved in a different order and sequence.
  • Each column must have a unique name. This is required because the order of columns is not significant.
You can see in the example in Table 2-2 that the last column is multi-valued; it holds an array of items. Before starting with decomposition, let us briefly review the textual notation of a relational table. Remember the earlier example proposition, “Lubor Kollar was employed by Tail-spin Toys on March 19, 2004.” In a general form, you can write “Employee with (Name) was employed by (Company) on (EmploymentDate).” This generalized form of a proposition is a predicate. Terms in parentheses are value placeholders (entity attributes). A predicate defines the structure of a table. You can write the structure briefly as:
Employees(EmpId, EmployeeName, CompanyId, EmploymentDate) 
Underlined columns form the primary key. Actually, they form a candidate key, and a table can have multiple candidate keys. You could underline all candidate keys and double underline the primary key.
You decompose the table shown in Table 2-2 on the Items column. Every item leads to a new row, and every atomic piece of data of a single item (ProductId, ProductName, Quantity) leads to a new column. After the decomposition, you have multiple rows for a single order; therefore, you have to expand the primary key. You can compose the new primary key from the OrderId and ProductId columns. However, suppose you can allow multiple products on a single order, each time with a different discount, for example. Thus, you cannot use ProductId as part of the primary key. However, you can add the ItemId attribute and use it as a part of the new primary key. A decomposed table in first normal form would look like this:
Orders(OrderId, CustomerId, OrderDate, ItemId, ProductId, Quantity, 
ProductName)
Before moving to second NF, you have to understand a common misconception about first NF. You might have heard or read that you should not have a repeating group of columns. However, this advice is incorrect; repeating groups means you should not have a repeating group (that is, a collection) in a single column. For example, imagine this table:
Employees(EmployeeId, EmployeeName, Child1, Child2)
This table is perfectly in first NF. This design has a built-in constraint, allowing only employees who have two children. If you do not allow unknown (NULL) values for the Child1 and Child2 attributes, then you allow employees with exactly two children. This kind of constraint is not typical for business; nevertheless, it is a constraint built into the model, which is in first NF. Such constraints are rare, and a repeating group of columns typically represents a hidden collection. Take care not to decompose such groups automatically before checking whether this is a special constraint.

Second Normal Form

After achieving first NF, the decomposed table from Table 2-2 looks like Table 2-3.
Cc505842.table_C02623422_3(en-us,TechNet.10).png
Table 2-3 Table in First NF
You still have the following anomalies:
  • Insert How do you insert a customer without an order?
  • Update If customer 1 changes the order date for order 1, how do you perform the update? (In many places, possible inconsistencies could exist.)
  • Delete If you delete order 3, the data for customer 2 is lost.
To achieve second NF, a table must be in first NF, and every non-key column must be fully functionally dependent on the entire primary key. This means that no column can depend on part of the primary key only. In the example in Table 2-3, you know the customer and the order date if you know the value of the OrderId column; you do not need to know anything about ProductId, which is part of the primary key. The CustomerId and OrderDate columns depend on part of the primary key only—OrderId. To achieve second NF, you need to decompose the table into two tables:
Orders(OrderId, CustomerId, OrderDate)
OrderDetails(OrderId, ItemId, ProductId, Quantity, ProductName)
In the Orders table, you leave attributes that depend on OrderId only; then you introduce a new table, OrderDetails, to hold the other attributes. When achieving first NF, you are converting values from a multi-valued attribute to rows and changing the primary key; for second and all other NFs, you decompose tables into more tables. Second NF deals with relationships between columns that are part of a key and other columns.
After decomposing to multiple tables, you must have some common value that enables you to join the tables in queries; otherwise, you would lose some information. The decomposition has to be lossless. Of course, you need relationships between tables. A relationship is an association between two or more tables. Relationships are expressed in the data values of the primary and foreign keys. A primary key is a column or columns in a table whose values uniquely identify each row in the table. A foreign key is a column or columns whose values are the same as the primary key of another table—in other words, a copy of the primary key from another relational table. The relationship is made between two relational tables by matching the values of the foreign key with the values of the primary key.

Third Normal Form

After achieving second NF, the decomposed tables from Table 2-3 look like the tables in Table 2-4 and Table 2-5. Note that in the Orders table (Table 2-4), another attribute, CustomerName, is added to show that normalization violations can appear in any table.
Cc505842.Table_C02623422_4(en-us,TechNet.10).png
Table 2-4 Orders Table in Second NF
Cc505842.Table_C02623422_5(en-us,TechNet.10).png
Table 2-5 OrderDetails Table in Second NF
Second NF solves the update anomaly (if customer 1 changes the order date for order 1); however, you still have the following anomalies:
  • Insert How do you insert a customer without an order?
  • Delete If you delete order 3, the data for customer 2 is lost.
To achieve third NF, a table must be in second NF, and every non-key column must be non-transitively dependent on the primary key. For example, in Table 2-4, from OrderId, you can find CustomerId; then from CustomerId, you can get transitively to the CustomerName attribute value. Similarly, in Table 2-5, you can get transitively to ProductName through Pro-ductId from OrderId and ItemId. If you think of the rule for third NF from the non-key attributes point of view, it simply means you should have no functional dependencies between non-key columns. Non-key columns must depend on keys only. In the examples in Table 2-4 and Table 2-5, CustomerName depends on CustomerId, and ProductName depends on Pro-ductId. Thus, to achieve third NF, you must create new tables for dependencies between non-key columns:
Customers(CustomerId, CustomerName) Orders(OrderId, CustomerId, OrderDate) 
Products(ProductId, ProductName) 
OrderDetails(OrderId, ItemId, ProductId, Quantity)
This schema is free from all the update anomalies you had before normalization. However, it is not free from all update anomalies. For example, the schema itself cannot prevent you from inserting an unreasonable order date. (You will learn more about additional constraints in Chapter 3, “Designing a Physical Database.”) Note that this schema is also essentially the same (except for a couple of attributes omitted for the sake of brevity) as you received by using the ORM and ER approach. As mentioned earlier, use normalization for final checking and refining of your model.

Practice: Normalizing the Database

You are developing a database model that will support an application for managing projects (as in the Quick Check in Lesson 1, “Systematically Approaching Design Stages”). You collect the following information: each project has a single customer, each project can have many activities, and each project can have many employees assigned to it. You want to follow time spent (in hours) on projects by specific employee by activity for each day. Your initial design is:
Projects(ProjectId, ProjectName, CustomerId, CustomerName,
Activities(Activity1Id, Activity1Name, …, ActivityNId, 
ActivityNName), Employees(Employee1Id, Employee1Name, …, 
EmployeeNId, EmployeeNName), WorkDate, TimeSpent)

Exercise 1: Achieve the First Normal Form

In this exercise, you will bring this model to first NF. To achieve first NF, you need to eliminate all attributes that are collections.
  1. Check the Activities part of the table. Is this a collection?
  2. Check the Employees part of the table. Is this a collection? Your model should look like this:
Projects(ProjectId, ItemId, ProjectName, CustomerId, CustomerName, 
        ActivityId, ActivityName, EmployeeId, EmployeeName, WorkDate, 
        TimeSpent)

Exercise 2: Achieve the Second Normal Form

In this exercise, you will bring this model to second NF. To achieve second NF, you must make sure your model does not contain attributes that depend on only part of the primary key.
  1. The complete primary key in the table you created in Exercise 1, “Achieve the First Normal Form,” consists of ProjectId and ItemId.
  2. Do you really need both columns to find CustomerId and CustomerName associated with a project?
Your model should look like this:
Projects(ProjectId, ProjectName, CustomerId, CustomerName) 
        ProjectDetails(ProjectId, ItemId, ActivityId, ActivityName, 
        EmployeeId, EmployeeName, WorkDate, TimeSpent)

Exercise 3: Achieve the Third Normal Form

In this exercise, you will bring this model to third NF. To achieve third NF, you need to look at dependencies between non-key attributes.
  1. Is there any dependency between CustomerId and CustomerName?
  2. Is there any dependency between ActivityId and ActivityName?
  3. Is there any dependency between EmployeeId and EmployeeName? Your model should now look like this:
Projects(ProjectId, ProjectName, CustomerId)
       ProjectDetails(ProjectId, ItemId, ActivityId, EmployeeId, 
       WorkDate, TimeSpent) 
Customers(CustomerId, CustomerName) 
Activities(ActivityId, ActivityName)
Employees(EmployeeId, EmployeeName) 

Comments

Popular posts from this blog

Microstrategy Custom number formatting symbols

Custom number formatting symbols If none of the built-in number formats meet your needs, you can create your own custom format in the Number tab of the Format Cells dialog box. Select  Custom  as the Category and create the format using the number format symbols listed in the table below. Each custom format can have up to four optional sections, one each for: Positive numbers Negative numbers Zeros Text Each section is optional. Separate the sections by semicolons, as shown in the example below: #,###;(#,###);0;"Error: Entry must be numeric" For more examples, see  Custom number formatting examples . To jump to a section of the formatting symbol table, click one of the following: Numeric symbols Character/text symbols Date and time symbols Text color symbols Currency symbols Conditional symbols Numeric symbols For details on how numeric symbols apply to the Big Decimal data type, refer to the  Project Design Guide . ...

Reduce Intelligent Cube Size By Finding Intelligent Cube Objects Which Are Not In Use

Reduce Intelligent Cube Size By Finding Intelligent Cube Objects Which Are Not In Use If the i-cubes can potentially be reduced in size an audit can be performed on the cube objects to see which cube objects are not being used by any of the view reports, documents, or dossiers.   The below are examples for a few of the common metadata database platforms . NOTE: To perform this audit, queries are run against the MicroStrategy metadata database. Ensure a metadata backup is taken prior to performing the below actions. Steps: 1) Identify the object ID of the Intelligent cube to be audited by checking the objects Property window 2) Identify the object ID of the project this cube exists within by opening the Project Configuration Sample Cube ID =   CFAF1E9B4D53990698C42E87C7AF2EB5 Sample Project ID =  B7CA92F04B9FAE8D941C3E9B7E0CD754   3) Run the below SQL against the metadata database by replacing the Cube ID and Project ID within the respective ...

Compound key attribute

Compound key attribute A compound key attribute is an attribute whose primary key is made up by the combination of two or more columns. The multiple columns are joined with other attributes, typically in a many-to-many relationship . To create a compound key, users must create multiple attribute forms, group them together and set the form group as the key for the attribute. Use the same steps as specified in the help menu: Open attribute editor (right-click on attribute and select 'Edit') Select the forms that will make up the compound key From the 'Edit' menu, choose 'Group' NOTE: Modifying the key form will trigger required updates when saving. This may make related application objects (reports, filters and metrics) unusable. Click on 'Yes' to continue, when prompted, in a dialog box, to confirm this action Save the changes. Choose 'Update Schema' from the Schema menu

Algorithm to calculate Logical Table Size in Microstrategy

How are the fact tables determined using the logical table size for SQL generation in MicroStrategy The logical table size is an integer number that represents the granularity or level of aggregation of a particular table. It is called 'logical' because it is not related to the physical size of the tables (number of rows). It is calculated according to the attribute IDs that are present in the table and their level in the system hierarchy.   Even though, the number does not reveal the actual number of rows in the table, it is an accurate way of measuring a table size without having to access its contents.   IMPORTANT:   The system hierarchy is defined by the parent-child relationships between attributes of the same family (formerly known as a dimension), not by user-defined hierarchies (i.e., drilling hierarchies).   MicroStrategy Engine utilizes an algorithm based on attribute keys to calculate the Logical Table Size (LTS): Given the following tables: ...

Microstrategy Removing sections that do not have metric data

Removing sections that do not have metric data This is an interesting feature which might not be explored by many of us and it comes us handy. A  cross join between datasets can result in rows or Group Header/Footer sections that do not have metric data. For example, a document contains two datasets. Dataset 1 contains Year and Revenue, with data for three years (2007-2009). Dataset 2 contains Year and Profit, filtered to return data for only two years (2008 and 2009). If you place Year and Profit in the Details and execute the document, it displays three rows, although no profit data exists for 2007. This is a product of the cross join between the two datasets. You do not want to see the blank line for 2007 since it does not give you any data for profit. You can select the  Trim sections for which no metric value data is available  check box. This removes the row for 2007, since no metric data for Profit is available for 2007. The results are shown below: ...

Create a transaction services photo uploader

Create a transaction services photo uploader   1.  Create a new table "photo_upload" in Tutorial warehouse database (the default location: C:\Program Files\MicroStrategy\Tutorial Reporting\TUTORIAL_DATA_7200.mdb), as shown below:    2. The 'photo_upload' table has to be pre-populated with *exactly* 10 rows of data, the values for the 'ID' column should be 1-10 and the values for the 'uploaded' column should all be 0 3.  In MicroStrategy Desktop, create a freeform report "R1" based on the new table "photo_upload" in Tutorial data created at step 1, as shown below:   SELECT Location, Description, ID, uploaded, numbers FROM PHOTO_UPLOAD 4.  Create another table for transaction insert SQL. Make sure to create an 'autonumber' type ID as primary key for this table, or auto_increment ID for different DBs.                     5. Create...

Predictive modelling in Data Science using Microstrategy

Creating a predictive modelling in MicroStrategy MicroStrategy Data Mining Services has been evolving to include more data mining algorithms and functionality. One key feature is MicroStrategy Developer’s Training Metric Wizard. The Training Metric Wizard can be used to create several different types of predictive models including linear and exponential regression, logistic regression, decision tree, cluster, time series, and association rules. Linear and exponential regression The linear regression data mining technique should be familiar to you if you have ever tried to extrapolate or interpolate data, tried to find the line that best fits a series of data points, or used Microsoft Excel’s LINEST or LOGEST functions. Regression analyzes the relationship between several predictive inputs, or independent variables, and a dependent variable that is to be predicted. Regression finds the line that best fits the data, with a minimum of error. For example, you have a dataset ...

Scheduling a report or document to be sent to an FTP in MSTR

Scheduling a report or document to be sent to an FTP server You can have a report or document automatically delivered to a location on your FTP server on a specific schedule. To do so, you must subscribe to the report or document, as described in the steps below. You can customize your subscription by typing macros in the  File Name ,  Sub-folder , or  Zip File Name  fields. These macros are automatically replaced with the appropriate text when the report or document is delivered. For example, you create a subscription to a document. If you type  {&Project}  in the  File Name field, the name of the project in which the document is saved is displayed in the name of the document when it is delivered. • This procedure assumes that an administrator has already added your FTP server as a new device in Developer. Steps to do so are included in the  System Administrator Help . To send a report or document to an FTP server on a schedule ...

Settings for Outer Join between metrics in MicroStrategy

Settings for Outer Join between metrics in MicroStrategy MicroStrategy adopts multi-pass logic to determine the execution plan for a report. This means that every metric is evaluated in separate SQL passes. Outer Joins come into play when MicroStrategy Engine merges the results from all SQL passes into one report. For a multi-pass report, different Outer Join behaviors can give the user completely different results. In addition, report metrics can be of different types which can, in some cases, influence the result of the outer join. In MicroStrategy, there are two settings that users can access to control Outer Join behavior : Formula Join Type and Metric Join Type . Metric Join Type: VLDB Setting at Database Instance Level Report and Template Levels Report Editor > Data > Report Data Options Metric Level   Metric editor > Tools > Metric Join Type Control Join between Metrics Formula Join Type: Only at Compound/Split...