Database Design - 2nd Edition by Adrienne Watt and Nelson Eng is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.
Unless otherwise noted within this book, this book is released under a Creative Commons Attribution 4.0 International License also known as a CC-BY license. This means you are free to copy, redistribute, modify or adapt this book. Under this license, anyone who redistributes or modifies this textbook, in whole or in part, can do so for free providing they properly attribute the book.
Additionally, if you redistribute this textbook, in whole or in part, in either a print or digital format, then you must retain on every physical and/or electronic page the following attribution:
Download this book for free at http://open.bccampus.ca
Cover image: Spiral Stairs In Milano old building downtown by Michele Ursino used under a CC-BY-SA 2.0 license .
The primary purpose of this text is to provide an open source textbook that covers most introductory database courses. The material in the textbook was obtained from a variety of sources. All the sources are found in at the end of each chapter. I expect, with time, the book will grow with more information and more examples.
I welcome any feedback that would improve the book. If you would like to add a section to the book, please let me know.
Adrienne Watt
Database Design – 2nd Edition is a remix and adaptation, based on Adrienne Watt’s book, Database Design. Works that are part of the remix for this book are listed at the end of each chapter. For information about what was changed in this adaptation, refer to the Copyright statement at the bottom of the home page.
This adaptation was a part of the B.C. Open Textbook project.
In October 2012, the B.C. Ministry of Advanced Education announced its support for the creation of open textbooks for the 40 highest-enrolled first and second year subject areas in the province’s public post-secondary system.
Open textbooks are open educational resources (OER); they are instructional resources created and shared in ways so that more people have access to them. This is a different model than traditionally copyrighted materials. OER are defined as âteaching, learning, and research resources that reside in the public domain or have been released under an intellectual property license that permits their free use and re-purposing by others (Hewlett Foundation).
Our open textbooks are openly licensed using a Creative Commons license, and are offered in various e-book formats free of charge, or as printed books that are available at cost.
For more information about the origin of this project, please contact opentext@bccampus.ca.
If you are an instructor who is using this book for a course, please let eCampusOntario know through its adoption form
This book has been a wonderful experience in the world of open textbooks. It’s amazing to see how much information is available to be shared. I would like to thank Nguyen Kim Anh of OpenStax College, for her contribution of database models and the relational design sections. I would also like to thank Dr. Gordon Russell for the section on normalization. His database resources were wonderful. Open Learning University in the UK provided me with a great ERD example. In addition, Tom Jewet provided some invaluable UML contributions.
I would also like to thank my many students over the years and a special instructor, Mitra Ramkay (BCIT). He is fondly remembered for the mentoring he provided when I first started teaching relational databases 25 years ago. Another person instrumental in getting me started in creating an open textbook is Terrie MacAloney. She was encouraging and taught me think outside the box.
A special thanks goes to my family for the constant love and support I received throughout this project.
I would like to thank the many people who helped in this edition including my students at Douglas College and my colleague Ms. Adrienne Watt.
We would like to particularly thank Lauri Aesoph at BCcampus for her perseverance and hard work while editing the book. She did an amazing job.
If you adopt this book, as a core or supplemental resource, please report your adoption in order for us to celebrate your support of students’ savings. Report your commitment at www.openlibrary.ecampusontario.ca.
We invite you to further adapt this book to meet your and your students’ needs. Please let us know if you do! If you would like to use Pressbooks, the platform used to make this book, contact eCampusOntario for an account using open@ecampusontario.ca.
If this text does not meet your needs, please check out our full library at www.openlibrary.ecampusontario.ca. If you still cannot find what you are looking for, connect with colleagues and eCampusOntario to explore creating your own open education resource (OER).
eCampusOntario is a not-for-profit corporation funded by the Government of Ontario. It serves as a centre of excellence in online and technology-enabled learning for all publicly funded colleges and universities in Ontario and has embarked on a bold mission to widen access to post-secondary education and training in Ontario. This textbook is part of eCampusOntario’s open textbook library, which provides free learning resources in a wide range of subject areas. These open textbooks can be assigned by instructors for their classes and can be downloaded by learners to electronic devices or printed for a low cost by our printing partner, The University of Waterloo. These free and open educational resources are customizable to meet a wide range of learning needs, and we invite instructors to review and adopt the resources for use in their courses.
The way in which computers manage data has come a long way over the last few decades. Today’s users take for granted the many benefits found in a database system. However, it wasn’t that long ago that computers relied on a much less elegant and costly approach to data management called the file-based system.
One way to keep information on a computer is to store it in permanent files. A company system has a number of application programs; each of them is designed to manipulate data files. These application programs have been written at the request of the users in the organization. New applications are added to the system as the need arises. The system just described is called the file-based system.
Consider a traditional banking system that uses the file-based system to manage the organization’s data shown in Figure 1.1. As we can see, there are different departments in the bank. Each has its own applications that manage and manipulate different data files. For banking systems, the programs may be used to debit or credit an account, find the balance of an account, add a new mortgage loan and generate monthly statements.
Using the file-based system to keep organizational information has a number of disadvantages. Listed below are five examples.
Often, within an organization, files and applications are created by different programmers from various departments over long periods of time. This can lead to data redundancy, a situation that occurs in a database when a field needs to be updated in more than one table. This practice can lead to several problems such as:
Data isolation is a property that determines when and how changes made by one operation become visible to other concurrent users and systems. This issue occurs in a concurrency situation. This is a problem because:
Problems with data integrity is another disadvantage of using a file-based system. It refers to the maintenance and assurance that the data in a database are correct and consistent. Factors to consider when addressing this issue are:
Security can be a problem with a file-based approach because:
Concurrency is the ability of the database to allow multiple users access to the same record without adversely affecting transaction processing. A file-based system must manage, or prevent, concurrency by the application programs. Typically, in a file-based system, when an application opens a file, that file is locked. This means that no one else has access to the file at the same time.
In database systems, concurrency is managed thus allowing multiple users access to the same record. This is an important difference between database and file-based systems.
The difficulties that arise from using the file-based system have prompted the development of a new approach in managing large amounts of organizational information called the database approach.
Databases and database technology play an important role in most areas where computers are used, including business, education and medicine. To understand the fundamentals of database systems, we will start by introducing some basic concepts in this area.
Everybody uses a database in some way, even if it is just to store information about their friends and family. That data might be written down or stored in a computer by using a word-processing program or it could be saved in a spreadsheet. However, the best way to store data is by using database management software. This is a powerful software tool that allows you to store, manipulate and retrieve data in a variety of different ways.
Most companies keep track of customer information by storing it in a database. This data may include customers, employees, products, orders or anything else that assists the business with its operations.
Data are factual information such as measurements or statistics about objects and concepts. We use data for discussions or as part of a calculation. Data can be a person, a place, an event, an action or any one of a number of things. A single fact is an element of data, or a data element.
If data are information and information is what we are in the business of working with, you can start to see where you might be storing it. Data can be stored in:
All of these items store information, and so too does a database. Because of the mechanical nature of databases, they have terrific power to manage and process the information they hold. This can make the information they house much more useful for your work.
With this understanding of data, we can start to see how a tool with the capacity to store a collection of data and organize it, conduct a rapid search, retrieve and process, might make a difference to how we can use data. This book and the chapters that follow are all about managing information.
concurrency: the ability of the database to allow multiple users access to the same record without adversely affecting transaction processing
data element: a single fact or piece of information
data inconsistency: a situation where various copies of the same data are conflicting
data isolation: a property that determines when and how changes made by one operation become visible to other concurrent users and systems
data integrity: refers to the maintenance and assurance that the data in a database are correct and consistent
data redundancy: a situation that occurs in a database when a field needs to be updated in more than one table
database approach: allows the management of large amounts of organizational information
database management software: a powerful software tool that allows you to store, manipulate and retrieve data in a variety of ways
file-based system: an application program designed to manipulate data files
This chapter of Database Design (including its images, unless otherwise noted) is a derivative copy of Database System Concepts by Nguyen Kim Anh licensed under Creative Commons Attribution License 3.0 license
The following material was written by Adrienne Watt:
A database is a shared collection of related data used to support the activities of a particular organization. A database can be viewed as a repository of data that is defined once and then accessed by various users as shown in Figure 2.1.
A database has the following properties:
A database can contain many tables. For example, a membership system may contain an address table and an individual member table as shown in Figure 2.2. Members of Science World are individuals, group homes, businesses and corporations who have an active membership to Science World. Memberships can be purchased for a one- or two-year period, and then renewed for another one- or two-year period.
In Figure 2.2, Minnie Mouse renewed the family membership with Science World. Everyone with membership ID#100755 lives at 8932 Rodent Lane. The individual members are Mickey Mouse, Minnie Mouse, Mighty Mouse, Door Mouse, Tom Mouse, King Rat, Man Mouse and Moose Mouse.
A database management system (DBMS) is a collection of programs that enables users to create and maintain databases and control all access to them. The primary goal of a DBMS is to provide an environment that is both convenient and efficient for users to retrieve and store information.
With the database approach, we can have the traditional banking system as shown in Figure 2.3. In this bank example, a DBMS is used by the Personnel Department, the Account Department and the Loan Department to access the shared corporate database.
data elements: facts that represent real-world information
database: a shared collection of related data used to support the activities of a particular organization
database management system (DBMS): a collection of programs that enables users to create and maintain databases and control all access to them
table: a combination of fields
This chapter of Database Design (including images, except as otherwise noted) is a derivative copy of Database System Concepts by Nguyen Kim Anh licensed under Creative Commons Attribution License 3.0 license
The following material was written by Nelson Eng:
The following material was written by Adrienne Watt:
Managing information means taking care of it so that it works for us and is useful for the tasks we perform. By using a DBMS, the information we collect and add to its database is no longer subject to accidental disorganization. It becomes more accessible and integrated with the rest of our work. Managing information using a database allows us to become strategic users of the data we have.
We often need to access and re-sort data for various uses. These may include:
The processing power of a database allows it to manipulate the data it houses, so it can:
Because of the versatility of databases, we find them powering all sorts of projects. A database can be linked to:
There are a number of characteristics that distinguish the database approach from the file-based system or approach. This chapter describes the benefits (and features) of the database system.
A database system is referred to as self-describing because it not only contains the database itself, but also metadata which defines and describes the data and relationships between tables in the database. This information is used by the DBMS software or database users if needed. This separation of data and information about the data makes a database system totally different from the traditional file-based system in which the data definition is part of the application programs.
In the file-based system, the structure of the data files is defined in the application programs so if a user wants to change the structure of a file, all the programs that access that file might need to be changed as well.
On the other hand, in the database approach, the data structure is stored in the system catalogue and not in the programs. Therefore, one change is all that is needed to change the structure of a file. This insulation between the programs and data is also called program-data independence.
A database supports multiple views of data. A view is a subset of the database, which is defined and dedicated for particular users of the system. Multiple users in the system might have different views of the system. Each view might contain only the data of interest to a user or group of users.
Current database systems are designed for multiple users. That is, they allow many users to access the same database at the same time. This access is achieved through features called concurrency control strategies. These strategies ensure that the data accessed are always correct and that data integrity is maintained.
The design of modern multiuser database systems is a great improvement from those in the past which restricted usage to one person at a time.
In the database approach, ideally, each data item is stored in only one place in the database. In some cases, data redundancy still exists to improve system performance, but such redundancy is controlled by application programming and kept to minimum by introducing as little redudancy as possible when designing the database.
The integration of all the data, for an organization, within a database system has many advantages. First, it allows for data sharing among employees and others who have access to the system. Second, it gives users the ability to generate more information from a given amount of data than would be possible without the integration.
Database management systems must provide the ability to define and enforce certain constraints to ensure that users enter valid information and maintain data integrity. A database constraint is a restriction or rule that dictates what can be entered or edited in a table such as a postal code using a certain format or adding a valid city in the City field.
There are many types of database constraints. Data type, for example, determines the sort of data permitted in a field, for example numbers only. Data uniqueness such as the primary key ensures that no duplicates are entered. Constraints can be simple (field based) or complex (programming).
Not all users of a database system will have the same accessing privileges. For example, one user might have read-only access (i.e., the ability to read a file but not make changes), while another might have read and write privileges, which is the ability to both read and modify a file. For this reason, a database management system should provide a security subsystem to create and control different types of user accounts and restrict unauthorized access.
Another advantage of a database management system is how it allows for data independence. In other words, the system data descriptions or data describing data (metadata) are separated from the application programs. This is possible because changes to the data structure are handled by the database management system and are not embedded in the program itself.
A database management system must include concurrency control subsystems. This feature ensures that data remains consistent and valid during transaction processing even if several users update the same information.
By its very nature, a DBMS permits many users to have access to its database either individually or simultaneously. It is not important for users to be aware of how and where the data they access is stored
Backup and recovery are methods that allow you to protect your data from loss. The database system provides a separate process, from that of a network backup, for backing up and recovering data. If a hard drive fails and the database stored on the hard drive is not accessible, the only way to recover the database is from a backup.
If a computer system fails in the middle of a complex update process, the recovery subsystem is responsible for making sure that the database is restored to its original state. These are two more benefits of a database management system.
concurrency control strategies: features of a database that allow several users access to the same data item at the same time
data type: determines the sort of data permitted in a field, for example numbers only
data uniqueness: ensures that no duplicates are entered
database constraint: a restriction that determines what is allowed to be entered or edited in a table
metadata: defines and describes the data and relationships between tables in the database
read and write privileges: the ability to both read and modify a file
read-only access: the ability to read a file but not make changes
self-describing: a database system is referred to as self-describing because it not only contains the database itself, but also metadata which defines and describes the data and relationships between tables in the database
view: a subset of the database
This chapter of Database Design is a derivative copy of Database System Concepts by Nguyen Kim Anh licensed under Creative Commons Attribution License 3.0 license
The following material was written by Adrienne Watt:
High-level conceptual data models provide concepts for presenting data in ways that are close to the way people perceive data. A typical example is the entity relationship model, which uses main concepts like entities, attributes and relationships. An entity represents a real-world object such as an employee or a project. The entity has attributes that represent properties such as an employee’s name, address and birthdate. A relationship represents an association among entities; for example, an employee works on many projects. A relationship exists between the employee and each project.
Record-based logical data models provide concepts users can understand but are not too far from the way data is stored in the computer. Three well-known data models of this type are relational data models, network data models and hierarchical data models.
hierarchical model: represents data as a hierarchical tree structure
instance: a record within a table
network model: represents data as record types
relation: another term for table
relational model: represents data as relations or tables
set type: a limited type of one to many relationship
This chapter of Database Design is a derivative copy of Database System Concepts by Nguyen Kim Anh licensed under Creative Commons Attribution License 3.0 license
The following material was written by Adrienne Watt:
Data modelling is the first step in the process of database design. This step is sometimes considered to be a high-level and abstract design phase, also referred to as conceptual design. The aim of this phase is to describe:
In the second step, the data items, the relationships and the constraints are all expressed using the concepts provided by the high-level data model. Because these concmepts do not include the implementation details, the result of the data modelling process is a (semi) formal representation of the database structure. This result is quite easy to understand so it is used as reference to make sure that all the user’s requirements are met.
The third step is database design. During this step, we might have two sub-steps: one called database logical design, which defines a database in a data model of a specific DBMS, and another called database physical design, which defines the internal database storage structure, file organization or indexing techniques. These two sub-steps are database implementation and operations/user interfaces building steps.
In the database design phases, data are represented using a certain data model. The data model is a collection of concepts or notations for describing data, data relationships, data semantics and data constraints. Most data models also include a set of basic operations for manipulating data in the database.
In this section we will look at the database design process in terms of specificity. Just as any design starts at a high level and proceeds to an ever-increasing level of detail, so does database design. For example, when building a home, you start with how many bedrooms and bathrooms the home will have, whether it will be on one level or multiple levels, etc. The next step is to get an architect to design the home from a more structured perspective. This level gets more detailed with respect to actual room sizes, how the home will be wired, where the plumbing fixtures will be placed, etc. The last step is to hire a contractor to build the home. That’s looking at the design from a high level of abstraction to an increasing level of detail.
The database design is very much like that. It starts with users identifying the business rules; then the database designers and analysts create the database design; and then the database administrator implements the design using a DBMS.
The following subsections summarize the models in order of decreasing level of abstraction.
The three best-known models of this kind are the relational data model, the network data model and the hierarchical data model. These internal models:
In a pictorial view, you can see how the different models work together. Let’s look at this from the highest level, the external model.
The external model is the end user’s view of the data. Typically a database is an enterprise system that serves the needs of multiple departments. However, one department is not interested in seeing other departments’ data (e.g., the human resources (HR) department does not care to view the sales department’s data). Therefore, one user view will differ from another.
The external model requires that the designer subdivide a set of requirements and constraints into functional modules that can be examined within the framework of their external models (e.g., human resources versus sales).
As a data designer, you need to understand all the data so that you can build an enterprise-wide database. Based on the needs of various departments, the conceptual model is the first model created.
At this stage, the conceptual model is independent of both software and hardware. It does not depend on the DBMS software used to implement the model. It does not depend on the hardware used in the implementation of the model. Changes in either hardware or DBMS software have no effect on the database design at the conceptual level.
Once a DBMS is selected, you can then implement it. This is the internal model. Here you create all the tables, constraints, keys, rules, etc. This is often referred to as the logical design.
The physical model is simply the way the data is stored on disk. Each database vendor has its own way of storing the data.
A schema is an overall description of a database, and it is usually represented by the entity relationship diagram (ERD). There are many subschemas that represent external models and thus display external views of the data. Below is a list of items to consider during the design process of a database.
Data independence refers to the immunity of user applications to changes made in the definition and organization of data. Data abstractions expose only those items that are important or pertinent to the user. Complexity is hidden from the database user.
Data independence and operation independence together form the feature of data abstraction. There are two types of data independence: logical and physical.
A logical schema is a conceptual design of the database done on paper or a whiteboard, much like architectural drawings for a house. The ability to change the logical schema, without changing the external schema or user view, is called logical data independence. For example, the addition or removal of new entities, attributes or relationships to this conceptual schema should be possible without having to change existing external schemas or rewrite existing application programs.
In other words, changes to the logical schema (e.g., alterations to the structure of the database like adding a column or other tables) should not affect the function of the application (external views).
Physical data independence refers to the immunity of the internal model to changes in the physical model. The logical schema stays unchanged even though changes are made to file organization or storage structures, storage devices or indexing strategy.
Physical data independence deals with hiding the details of the storage structure from user applications. The applications should not be involved with these issues, since there is no difference in the operation carried out against the data.
conceptual model: the logical structure of the entire database
conceptual schema: another term for logical schema
data independence: the immunity of user applications to changes made in the definition and organization of data
data model: a collection of concepts or notations for describing data, data relationships, data semantics and data constraints
data modelling: the first step in the process of database design
database logical design: defines a database in a data model of a specific database management system
database physical design: defines the internal database storage structure, file organization or indexing techniques
entity relationship diagram (ERD): a data model describing the database showing tables, attributes and relationships
external model: represents the user’s view of the database
external schema: user view
internal model: a representation of the database as seen by the DBMS
logical data independence: the ability to change the logical schema without changing the external schema
logical design: where you create all the tables, constraints, keys, rules, etc.
logical schema: a conceptual design of the database done on paper or a whiteboard, much like architectural drawings for a house
operating system (OS): manages the physical level of the physical model
physical data independence: the immunity of the internal model to changes in the physical model
physical model: the physical representation of the database
schema: an overall description of a database
Also see Appendix A: University Registration Data Model Example
This chapter of Database Design is a derivative copy of Database System Concepts by Nguyen Kim Anh licensed under Creative Commons Attribution License 3.0 license
The following material was written by Adrienne Watt:
Database management systems can be classified based on several criteria, such as the data model, user numbers and database distribution, all described below.
The most popular data model in use today is the relational data model. Well-known DBMSs like Oracle, MS SQL Server, DB2 and MySQL support this model. Other traditional models, such as hierarchical data models and network data models, are still used in industry mainly on mainframe platforms. However, they are not commonly used due to their complexity. These are all referred to as traditional models because they preceded the relational model.
In recent years, the newer object-oriented data models were introduced. This model is a database management system in which information is represented in the form of objects as used in object-oriented programming. Object-oriented databases are different from relational databases, which are table-oriented. Object-oriented database management systems (OODBMS) combine database capabilities with object-oriented programming language capabilities.
The object-oriented models have not caught on as expected so are not in widespread use. Some examples of object-oriented DBMSs are O2, ObjectStore and Jasmine.
A DBMS can be classification based on the number of users it supports. It can be a single-user database system, which supports one user at a time, or a multiuser database system, which supports multiple users concurrently.
There are four main distribution systems for database systems and these, in turn, can be used to classify the DBMS.
With a centralized database system, the DBMS and database are stored at a single site that is used by several other systems too. This is illustrated in Figure 6.1.
In the early 1980s, many Canadian libraries used the GEAC 8000 to convert their manual card catalogues to machine-readable centralized catalogue systems. Each book catalogue had a barcode field similar to those on supermarket products.
In a distributed database system, the actual database and the DBMS software are distributed from various sites that are connected by a computer network, as shown in Figure 6.2.
Homogeneous distributed database systems use the same DBMS software from multiple sites. Data exchange between these various sites can be handled easily. For example, library information systems by the same vendor, such as Geac Computer Corporation, use the same DBMS software which allows easy data exchange between the various Geac library sites.
In a heterogeneous distributed database system, different sites might use different DBMS software, but there is additional common software to support data exchange between these sites. For example, the various library database systems use the same machine-readable cataloguing (MARC) format to support library record data exchange.
centralized database system: the DBMS and database are stored at a single site that is used by several other systems too
distributed database system: the actual database and the DBMS software are distributed from various sites that are connected by a computer network
heterogeneous distributed database system: different sites might use different DBMS software, but there is additional common software to support data exchange between these sites
homogeneous distributed database systems: use the same DBMS software at multiple sites
multiuser database system: a database management system which supports multiple users concurrently
object-oriented data model: a database management system in which information is represented in the form of objects as used in object-oriented programming
single-user database system: a database management system which supports one user at a time
traditional models: data models that preceded the relational model
This chapter of Database Design (including images, except as otherwise noted) is a derivative copy of Database System Concepts by Nguyen Kim Anh licensed under Creative Commons Attribution License 3.0 license
The following material was written by Adrienne Watt:
The relational data model was introduced by C. F. Codd in 1970. Currently, it is the most widely used data model.
The relational model has provided the basis for:
The relational data model describes the world as “a collection of inter-related relations (or tables).”
A relation, also known as a table or file, is a subset of the Cartesian product of a list of domains characterized by a name. And within a table, each row represents a group of related data values. A row, or record, is also known as a tuple. The columns in a table is a field and is also referred to as an attribute. You can also think of it this way: an attribute is used to define the record and a record contains a set of attributes.
The steps below outline the logic between a relation and its domains.
A database is composed of multiple tables and each table holds the data. Figure 7.1 shows a database that contains three tables.
A database stores pieces of information or facts in an organized way. Understanding how to use and get the most out of databases requires us to understand that method of organization.
The principal storage units are called columns or fields or attributes. These house the basic components of data into which your content can be broken down. When deciding which fields to create, you need to think generically about your information, for example, drawing out the common components of the information that you will store in the database and avoiding the specifics that distinguish one item from another.
Look at the example of an ID card in Figure 7.2 to see the relationship between fields and their data.
A domain is the original sets of atomic values used to model data. By atomic value, we mean that each value in the domain is indivisible as far as the relational model is concerned. For example:
In summary, a domain is a set of acceptable values that a column is allowed to contain. This is based on various properties and the data type for the column. We will discuss data types in another chapter.
Just as the content of any one document or item needs to be broken down into its constituent bits of data for storage in the fields, the link between them also needs to be available so that they can be reconstituted into their whole form. Records allow us to do this. Records contain fields that are related, such as a customer or an employee. As noted earlier, a tuple is another term used for record.
Records and fields form the basis of all databases. A simple table gives us the clearest picture of how records and fields work together in a database storage project.
The simple table example in Figure 7.3 shows us how fields can hold a range of different sorts of data. This one has:
You can command the database to sift through its data and organize it in a particular way. For example, you can request that a selection of records be limited by date: 1. all before a given date, 2. all after a given date or 3. all between two given dates. Similarly, you can choose to have records sorted by date. Because the field, or record, containing the data is set up as a Date field, the database reads the information in the Date field not just as numbers separated by slashes, but rather, as dates that must be ordered according to a calendar system.
The degree is the number of attributes in a table. In our example in Figure 7.3, the degree is 4.
atomic value: each value in the domain is indivisible as far as the relational model is concerned
attribute: principle storage unit in a database
column: see attribute
degree: number of attributes in a table
domain: the original sets of atomic values used to model data; a set of acceptable values that a column is allowed to contain
field: see attribute
file: see relation
record: contains fields that are related; see tuple
relation: a subset of the Cartesian product of a list of domains characterized by a name; the technical term for table or file
row: see tuple
structured query language (SQL): the standard database access language
table: see relation
tuple: a technical term for row or record
Several of the terms used in this chapter are synonymous. In addition to the Key Terms above, please refer to Table 7.1 below. The terms in the Alternative 1 column are most commonly used.
Formal Terms (Codd) | Alternative 1 | Alternative 2 |
---|---|---|
Relation | Table | File |
Attribute | Column | Field |
Use Table 7.2 to answer questions 1-4.
EMPID | EMPLNAME | EMPINIT | EMPFNAME | EMPJOBCODE |
---|---|---|---|---|
123455 | Friedman | A. | Robert | 12 |
123456 | Olanski | D. | Delbert | 18 |
123457 | Fontein | G. | Juliette | 15 |
123458 | Cruazona | X. | Maria | 18 |
This chapter of Database Design (including images, except as otherwise noted) is a derivative copy of Relational Design Theory by Nguyen Kim Anh licensed under Creative Commons Attribution License 3.0 license
The following material was written by Adrienne Watt:
The entity relationship (ER) data model has existed for over 35 years. It is well suited to data modelling for use with databases because it is fairly abstract and is easy to discuss and explain. ER models are readily translated to relations. ER models, also called an ER schema, are represented by ER diagrams.
ER modelling is based on two concepts:
Here is an example of how these two concepts might be combined in an ER d<ata model: Prof. Ba (entity) teaches (relationship) the Database Systems course (entity).
For the rest of this chapter, we will use a sample database called the COMPANY database to illustrate the concepts of the ER model. This database contains information about employees, departments and projects. Important points to note include:
An entity is an object in the real world with an independent existence that can be differentiated from other objects. An entity might be
Entities can be classified based on their strength. An entity is considered weak if its tables are existence dependent.
An entity is considered strong if it can exist apart from all of its related entities.
Another term to know is entity type which defines a collection of similar entities.
An entity set is a collection of entities of an entity type at a particular point of time. In an entity relationship diagram (ERD), an entity type is represented by a name in a box. For example, in Figure 8.1, the entity type is EMPLOYEE.
An entity’s existence is dependent on the existence of the related entity. It is existence-dependent if it has a mandatory foreign key (i.e., a foreign key attribute that cannot be null). For example, in the COMPANY database, a Spouse entity is existence -dependent on the Employee entity.
You should also be familiar with different kinds of entities including independent entities, dependent entities and characteristic entities. These are described below.
Independent entities, also referred to as kernels, are the backbone of the database. They are what other tables are based on. Kernels have the following characteristics:
If we refer back to our COMPANY database, examples of an independent entity include the Customer table, Employee table or Product table.
Dependent entities, also referred to as derived entities, depend on other tables for their meaning. These entities have the following characteristics:
Characteristic entities provide more information about another table. These entities have the following characteristics:
Each entity is described by a set of attributes (e.g., Employee = (Name, Address, Birthdate (Age), Salary).
Each attribute has a name, and is associated with an entity and a domain of legal values. However, the information about attribute domain is not presented on the ERD.
In the entity relationship diagram, shown in Figure 8.2, each attribute is represented by an oval with a name inside.
There are a few types of attributes you need to be familiar with. Some of these are to be left as is, but some need to be adjusted to facilitate representation in the relational model. This first section will discuss the types of attributes. Later on we will discuss fixing the attributes to fit correctly into the relational model.
Simple attributes are those drawn from the atomic value domains; they are also called single-valued attributes. In the COMPANY database, an example of this would be: Name = {John} ; Age = {23}
Composite attributes are those that consist of a hierarchy of attributes. Using our database example, and shown in Figure 8.3, Address may consist of Number, Street and Suburb. So this would be written as → Address = {59 + ‘Meek Street’ + ‘Kingsford’}
Multivalued attributes are attributes that have a set of values for each entity. An example of a multivalued attribute from the COMPANY database, as seen in Figure 8.4, are the degrees of an employee: BSc, MIT, PhD.
Derived attributes are attributes that contain values calculated from other attributes. An example of this can be seen in Figure 8.5. Age can be derived from the attribute Birthdate. In this situation, Birthdate is called a stored attribute, which is physically saved to the database.
An important constraint on an entity is the key. The key is an attribute or a group of attributes whose values can be used to uniquely identify an individual entity in an entity set.
There are several types of keys. These are described below.
A candidate key is a simple or composite key that is unique and minimal. It is unique because no two rows in a table may have the same value at any time. It is minimal because every column is necessary in order to attain uniqueness.
From our COMPANY database example, if the entity is Employee(EID, First Name, Last Name, SIN, Address, Phone, BirthDate, Salary, DepartmentID), possible candidate keys are:
A composite key is composed of two or more attributes, but it must be minimal.
Using the example from the candidate key section, possible composite keys are:
The primary key is a candidate key that is selected by the database designer to be used as an identifying mechanism for the whole entity set. It must uniquely identify tuples in a table and not be null. The primary key is indicated in the ER model by underlining the attribute.
In the following example, EID is the primary key:
Employee(EID, First Name, Last Name, SIN, Address, Phone, BirthDate, Salary, DepartmentID)
A secondary key is an attribute used strictly for retrieval purposes (can be composite), for example: Phone and Last Name.
Alternate keys are all candidate keys not chosen as the primary key.
A foreign key (FK) is an attribute in a table that references the primary key in another table OR it can be null. Both foreign and primary keys must be of the same data type.
In the COMPANY database example below, DepartmentID is the foreign key:
Employee(EID, First Name, Last Name, SIN, Address, Phone, BirthDate, Salary, DepartmentID)
A null is a special symbol, independent of data type, which means either unknown or inapplicable. It does not mean zero or blank. Features of null include:
NOTE: The result of a comparison operation is null when either argument is null. The result of an arithmetic operation is null when either argument is null (except functions that ignore nulls).
Use the Salary table (Salary_tbl) in Figure 8.6 to follow an example of how null can be used.
Salary_tbl
emp# | jobName | salary | commission |
E10 | Sales | 12500 | |
E11 | Null | 25000 | 8000 |
E12 | Sales | 44000 | 0 |
E13 | Sales | 44000 | Null |
To begin, find all employees (emp#) in Sales (under the jobName column) whose salary plus commission are greater than 30,000.
This result does not include E13 because of the null value in the commission column. To ensure that the row with the null value is included, we need to look at the individual fields. By adding commission and salary for employee E13, the result will be a null value. The solution is shown below.
Relationships are the glue that holds the tables together. They are used to connect related information between tables.
Relationship strength is based on how the primary key of a related entity is defined. A weak, or non-identifying, relationship exists if the primary key of the related entity does not contain a primary key component of the parent entity. Company database examples include:
A strong, or identifying, relationship exists when the primary key of the related entity contains the primary key component of the parent entity. Examples include:
Below are descriptions of the various types of relationships.
A one to many (1:M) relationship should be the norm in any relational database design and is found in all relational database environments. For example, one department has many employees. Figure 8.7 shows the relationship of one of these employees to the department.
A one to one (1:1) relationship is the relationship of one entity to only one other entity, and vice versa. It should be rare in any relational database design. In fact, it could indicate that two entities actually belong in the same table.
An example from the COMPANY database is one employee is associated with one spouse, and one spouse is associated with one employee.
For a many to many relationship, consider the following points:
Figure 8.8 shows another another aspect of the M:N relationship where an employee has different start dates for different projects. Therefore, we need a JOIN table that contains the EID, Code and StartDate.
Example of mapping an M:N binary relationship type
A unary relationship, also called recursive, is one in which a relationship exists between occurrences of the same entity set. In this relationship, the primary and foreign keys are the same, but they represent two entities with different roles. See Figure 8.9 for an example.
For some entities in a unary relationship, a separate column can be created that refers to the primary key of the same entity set.
A ternary relationship is a relationship type that involves many to many relationships between three tables.
Refer to Figure 8.10 for an example of mapping a ternary relationship type. Note n-ary means multiple tables in a relationship. (Remember, N = many.)
alternate key: all candidate keys not chosen as the primary key
candidate key: a simple or composite key that is unique (no two rows in a table may have the same value) and minimal (every column is necessary)
characteristic entities: entities that provide more information about another table
composite attributes: attributes that consist of a hierarchy of attributes
composite key: composed of two or more attributes, but it must be minimal
dependent entities: these entities depend on other tables for their meaning
derived attributes: attributes that contain values calculated from other attributes
derived entities: see dependent entities
EID: employee identification (ID)
entity: a thing or object in the real world with an independent existence that can be differentiated from other objects
entity relationship (ER) data model: also called an ER schema, are represented by ER diagrams. These are well suited to data modelling for use with databases.
entity relationship schema: see entity relationship data model
entity set: a collection of entities of an entity type at a point of time
entity type: a collection of similar entities
foreign key (FK): an attribute in a table that references the primary key in another table OR it can be null
independent entity: as the building blocks of a database, these entities are what other tables are based on
kernel: see independent entity
key: an attribute or group of attributes whose values can be used to uniquely identify an individual entity in an entity set
multivalued attributes: attributes that have a set of values for each entity
n-ary: multiple tables in a relationship
null: a special symbol, independent of data type, which means either unknown or inapplicable; it does not mean zero or blank
recursive relationship: see unary relationship
relationships: the associations or interactions between entities; used to connect related information between tables
relationship strength: based on how the primary key of a related entity is defined
secondary key an attribute used strictly for retrieval purposes
simple attributes: drawn from the atomic value domains
SIN: social insurance number
single-valued attributes: see simple attributes
stored attribute: saved physically to the database
ternary relationship: a relationship type that involves many to many relationships between three tables.
unary relationship: one in which a relationship exists between occurrences of the same entity set.
DIRNUM | DIRNAME | DIRDOB |
100 | J. Broadway | 01/08/39 |
101 | J. Namath | 11/12/48 |
102 | W. Blake | 06/15/44 |
Play
PLAYNO | PLAYNAME | DIRNUM |
1001 | Cat on a cold bare roof | 102 |
1002 | Hold the mayo, pass the bread | 101 |
1003 | I never promised you coffee | 102 |
1004 | Silly putty goes to Texas | 100 |
1005 | See no sound, hear no sight | 101 |
1006 | Starstruck in Biloxi | 102 |
1007 | Stranger in parrot ice | 101 |
TNUM | BASENUM | TYPENUM | TMILES | TBOUGHT | TSERIAL |
1001 | 501 | 1 | 5900.2 | 11/08/90 | aa-125 |
1002 | 502 | 2 | 64523.9 | 11/08/90 | ac-213 |
1003 | 501 | 2 | 32116.0 | 09/29/91 | ac-215 |
1004 | 2 | 3256.9 | 01/14/92 | ac-315 |
Base
BASENUM | BASECITY | BASESTATE | BASEPHONE | BASEMGR |
501 | Dallas | TX | 893-9870 | J. Jones |
502 | New York | NY | 234-7689 | K. Lee |
Type
TYPENUM | TYPEDESC |
1 | single box, double axle |
2 | tandem trailer, single axle |
Customer
CustID | CustName | AccntNo. |
100 | Joe Smith | 010839 |
101 | Andy Blake | 111248 |
102 | Sue Brown | 061544 |
BookOrders
OrderID | Title | CustID | Price |
1001 | The Dark Tower | 102 | 12.00 |
1002 | Incubus Dreams | 101 | 19.99 |
1003 | Song of Susannah | 102 | 23.00 |
1004 | The Time Traveler’s Wife | 100 | 21.00 |
1005 | The Dark Tower | 101 | 12.00 |
1006 | Tanequil | 102 | 15.00 |
1007 | Song of Susannah | 101 | 23.00 |
Figure 8.15. ERD of school database for questions 7-10, by A. Watt.
Also see Appendix B: Sample ERD Exercises
This chapter of Database Design (including images, except as otherwisse noted) is a derivative copy of Data Modeling Using Entity-Relationship Model by Nguyen Kim Anh licensed under Creative Commons Attribution License 3.0 license
The following material was written by Adrienne Watt:
Constraints are a very important feature in a relational model. In fact, the relational model supports the well-defined theory of constraints on attributes or tables. Constraints are useful because they allow a designer to specify the semantics of data in the database. Constraints are the rules that force DBMSs to check that data satisfies the semantics.
Domain restricts the values of attributes in the relation and is a constraint of the relational model. However, there are real-world semantics for data that cannot be specified if used only with domain constraints. We need more specific ways to state what data values are or are not allowed and which format is suitable for an attribute. For example, the Employee ID (EID) must be unique or the employee Birthdate is in the range [Jan 1, 1950, Jan 1, 2000]. Such information is provided in logical statements called integrity constraints.
There are several kinds of integrity constraints, described below.
To ensure entity integrity, it is required that every table have a primary key. Neither the PK nor any part of it can contain null values. This is because null values for the primary key mean we cannot identify some rows. For example, in the EMPLOYEE table, Phone cannot be a primary key since some people may not have a telephone.
Referential integrity requires that a foreign key must have a matching primary key or it must be null. This constraint is specified between two tables (parent and child); it maintains the correspondence between rows in these tables. It means the reference from a row in one table to another table must be valid.
Examples of referential integrity constraint in the Customer/Order database of the Company:
To ensure that there are no orphan records, we need to enforce referential integrity. An orphan record is one whose foreign key FK value is not found in the corresponding entity – the entity where the PK is located. Recall that a typical join is between a PK and FK.
The referential integrity constraint states that the customer ID (CustID) in the Order table must match a valid CustID in the Customer table. Most relational databases have declarative referential integrity. In other words, when the tables are created the referential integrity constraints are set up.
Here is another example from a Course/Class database:
The referential integrity constraint states that CrsCode in the Class table must match a valid CrsCode in the Course table. In this situation, it’s not enough that the CrsCode and Section in the Class table make up the PK, we must also enforce referential integrity.
When setting up referential integrity it is important that the PK and FK have the same data types and come from the same domain, otherwise the relational database management system (RDBMS) will not allow the join. RDBMS is a popular database system that is based on the relational model introduced by E. F. Codd of IBM’s San Jose Research Laboratory. Relational database systems are easier to use and understand than other database systems.
In Microsoft (MS) Access, referential integrity is set up by joining the PK in the Customer table to the CustID in the Order table. See Figure 9.1 for a view of how this is done on the Edit Relationships screen in MS Access.
When using Transact-SQL, the referential integrity is set when creating the Order table with the FK. Listed below are the statements showing the FK in the Order table referencing the PK in the Customer table.
Additional foreign key rules may be added when setting referential integrity, such as what to do with the child rows (in the Orders table) when the record with the PK, part of the parent (Customer), is deleted or changed (updated). For example, the Edit Relationships window in MS Access (see Figure 9.1) shows two additional options for FK rules: Cascade Update and Cascade Delete. If these are not selected, the system will prevent the deletion or update of PK values in the parent table (Customer table) if a child record exists. The child record is any record with a matching PK.
In some databases, an additional option exists when selecting the Delete option called Set to Null. In this is chosen, the PK row is deleted, but the FK in the child table is set to NULL. Though this creates an orphan row, it is acceptable.
Enterprise constraints – sometimes referred to as semantic constraints – are additional rules specified by users or database administrators and can be based on multiple tables.
Here are some examples.
Business rules are obtained from users when gathering requirements. The requirements-gathering process is very important, and its results should be verified by the user before the database design is built. If the business rules are incorrect, the design will be incorrect, and ultimately the application built will not function as expected by the users.
Some examples of business rules are:
Business rules are used to determine cardinality and connectivity. Cardinality describes the relationship between two data tables by expressing the minimum and maximum number of entity occurrences associated with one occurrence of a related entity. In Figure 9.2, you can see that cardinality is represented by the innermost markings on the relationship symbol. In this figure, the cardinality is 0 (zero) on the right and 1 (one) on the left.
The outermost symbol of the relationship symbol, on the other hand, represents the connectivity between the two tables. Connectivity is the relationship between two tables, e.g., one to one or one to many. The only time it is zero is when the FK can be null. When it comes to participation, there are three options to the relationship between these entities: either 0 (zero), 1 (one) or many. In Figure 9.2, for example, the connectivity is 1 (one) on the outer, left-hand side of this line and many on the outer, right-hand side.
Figure 9.3. shows the symbol that represents a one to many relationship.
In Figure 9.4, both inner (representing cardinality) and outer (representing connectivity) markers are shown. The left side of this symbol is read as minimum 1 and maximum 1. On the right side, it is read as: minimum 1 and maximum many.
The line that connects two tables, in an ERD, indicates the relationship type between the tables: either identifying or non-identifying. An identifying relationship will have a solid line (where the PK contains the FK). A non-identifying relationship is indicated by a broken line and does not contain the FK in the PK. See the section in Chapter 8 that discusses weak and strong relationships for more explanation.
In an optional relationship, the FK can be null or the parent table does not need to have a corresponding child table occurrence. The symbol, shown in Figure 9.6, illustrates one type with a zero and three prongs (indicating many) which is interpreted as zero OR many.
For example, if you look at the Order table on the right-hand side of Figure 9.7, you’ll notice that a customer doesn’t need to place an order to be a customer. In other words, the many side is optional.
The relationship symbol in Figure 9.7 can also be read as follows:
Figure 9.8 shows another type of optional relationship symbol with a zero and one, meaning zero OR one. The one side is optional.
Figure 9.9 gives an example of how a zero to one symbol might be used.
In a mandatory relationship, one entity occurrence requires a corresponding entity occurrence. The symbol for this relationship shows one and only one as shown in Figure 9.10. The one side is mandatory.
See Figure 9.11 for an example of how the one and only one mandatory symbol is used.
Figure 9.12 illustrates what a one to many relationship symbol looks like where the many side is mandatory.
Refer to Figure 9.13 for an example of how the one to many symbol may be used.
So far we have seen that the innermost side of a relationship symbol (on the left-side of the symbol in Figure 9.14) can have a 0 (zero) cardinality and a connectivity of many (shown on the right-side of the symbol in Figure 9.14), or one (not shown).
However, it cannot have a connectivity of 0 (zero), as displayed in Figure 9.15. The connectivity can only be 1.
The connectivity symbols show maximums. So if you think about it logically, if the connectivity symbol on the left side shows 0 (zero), then there would be no connection between the tables.
The way to read a relationship symbol, such as the one in Figure 9.16, is as follows.
business rules: obtained from users when gathering requirements and are used to determine cardinality
cardinality: expresses the minimum and maximum number of entity occurrences associated with one occurrence of a related entity
connectivity: the relationship between two tables, e.g., one to one or one to many
constraints: the rules that force DBMSs to check that data satisfies the semantics
entity integrity: requires that every table have a primary key; neither the primary key, nor any part of it, can contain null values
identifying relationship: where the primary key contains the foreign key; indicated in an ERD by a solid line
integrity constraints: logical statements that state what data values are or are not allowed and which format is suitable for an attribute
mandatory relationship: one entity occurrence requires a corresponding entity occurrence.
non-identifying relationship: does not contain the foreign key in the primary key; indicated in an ERD by a dotted line
optional relationship: the FK can be null or the parent table does not need to have a corresponding child table occurrence
orphan record: a record whose foreign key value is not found in the corresponding entity – the entity where the primary key is located
referential integrity: requires that a foreign key must have a matching primary key or it must be null
relational database management system (RDBMS): a popular database system based on the relational model introduced by E. F. Codd of IBM’s San Jose Research Laboratory
relationship type: the type of relationship between two tables in an ERD (either identifying or non-identifying); this relationship is indicated by a line drawn between the two tables.
Read the following description and then answer questions 1-5 at the end.
The swim club database in Figure 9.17 has been designed to hold information about students who are enrolled in swim classes. The following information is stored: students, enrollment, swim classes, pools where classes are held, instructors for the classes, and various levels of swim classes. Use Figure 9.17 to answer questions 1 to 5.
The primary keys are identified below. The following data types are defined in the SQL Server.
tblLevels
Level – Identity PK
ClassName – text 20 – nulls are not allowed
tblPool
Pool – Identity PK
PoolName – text 20 – nulls are not allowed
Location – text 30
tblStaff
StaffID – Identity PK
FirstName – text 20
MiddleInitial – text 3
LastName – text 30
Suffix – text 3
Salaried – Bit
PayAmount – money
tblClasses
LessonIndex – Identity PK
Level – Integer FK
SectionID – Integer
Semester – TinyInt
Days – text 20
Time – datetime (formatted for time)
Pool – Integer FK
Instructor – Integer FK
Limit – TinyInt
Enrolled – TinyInt
Price – money
tblEnrollment
LessonIndex – Integer FK
SID – Integer FK (LessonIndex and SID) Primary Key
Status – text 30
Charged – bit
AmountPaid – money
DateEnrolled – datetime
tblStudents
SID – Identity PK
FirstName – text 20
MiddleInitial – text 3
LastName – text 30
Suffix – text 3
Birthday – datetime
LocalStreet – text 30
LocalCity – text 20
LocalPostalCode – text 6
LocalPhone – text 10
Implement this schema in SQL Server or access (you will need to pick comparable data types). Submit a screenshot of your ERD in the database.
Figures 9.3, 9.4, 9.6, 9.8, 9.10, 9.12, 9.14 and 9.15 by A. Watt.
One important theory developed for the entity relational (ER) model involves the notion of functional dependency (FD). The aim of studying this is to improve your understanding of relationships among data and to gain enough formalism to assist with practical database design.
Like constraints, FDs are drawn from the semantics of the application domain. Essentially, functional dependencies describe how individual attributes are related. FDs are a kind of constraint among attributes within a relation and contribute to a good relational schema design. In this chapter, we will look at:
Generally, a good relational database design must capture all of the necessary attributes and associations. The design should do this with a minimal amount of stored information and no redundant data.
In database design, redundancy is generally undesirable because it causes problems maintaining consistency after updates. However, redundancy can sometimes lead to performance improvements; for example, when redundancy can be used in place of a join to connect data. A join is used when you need to obtain information based on two related tables.
Consider Figure 10.1: customer 1313131 is displayed twice, once for account no. A-101 and again for account A-102. In this case, the customer number is not redundant, although there are deletion anomalies with the table. Having a separate customer table would solve this problem. However, if a branch address were to change, it would have to be updated in multiple places. If the customer number was left in the table as is, then you wouldn’t need a branch table and no join would be required, and performance is improved .
An insertion anomaly occurs when you are inserting inconsistent information into a table. When we insert a new record, such as account no. A-306 in Figure 10.2, we need to check that the branch data is consistent with existing rows.
If a branch changes address, such as the Round Hill branch in Figure 10.3, we need to update all rows referring to that branch. Changing existing information incorrectly is called an update anomaly.
A deletion anomaly occurs when you delete a record that may contain attributes that shouldn’t be deleted. For instance, if we remove information about the last account at a branch, such as account A-101 at the Downtown branch in Figure 10.4, all of the branch information disappears.
The problem with deleting the A-101 row is we don’t know where the Downtown branch is located and we lose all information regarding customer 1313131. To avoid these kinds of update or deletion problems, we need to decompose the original table into several smaller tables where each table has minimal overlap with other tables.
Each bank account table must contain information about one entity only, such as the Branch or Customer, as displayed in Figure 10.5.
Following this practice will ensure that when branch information is added or updated it will only affect one record. So, when customer information is added or deleted, the branch information will not be accidentally modified or incorrectly recorded.
Figure 10.6 shows an example of an employee project table. From this table, we can assume that:
Next, let’s look at some possible anomalies that might occur with this table during the following steps.
The best approach to creating tables without anomalies is to ensure that the tables are normalized, and that’s accomplished by understanding functional dependencies. FD ensures that all attributes in a table belong to that table. In other words, it will eliminate redundancies and anomalies.
By keeping data separate using individual Project and Employee tables:
deletion anomaly: occurs when you delete a record that may contain attributes that shouldn’t be deleted
functional dependency (FD): describes how individual attributes are related
insertion anomaly: occurs when you are inserting inconsistent information into a table
join: used when you need to obtain information based on two related tables
update anomaly: changing existing information incorrectly
Also see Appendix B: Sample ERD Exercises
This chapter of Database Design (including images, except as otherwise noted) is a derivative copy of Relational Design Theory by Nguyen Kim Anh licensed under Creative Commons Attribution License 3.0 license
The following material was written by Adrienne Watt:
A functional dependency (FD) is a relationship between two attributes, typically between the PK and other non-key attributes within a table. For any relation R, attribute Y is functionally dependent on attribute X (usually the PK), if for every valid instance of X, that value of X uniquely determines the value of Y. This relationship is indicated by the representation below :
X ———–> Y
The left side of the above FD diagram is called the determinant, and the right side is the dependent. Here are a few examples.
In the first example, below, SIN determines Name, Address and Birthdate. Given SIN, we can determine any of the other attributes within the table.
For the second example, SIN and Course determine the date completed (DateCompleted). This must also work for a composite PK.
The third example indicates that ISBN determines Title.
Consider the following table of data r(R) of the relation schema R(ABCDE) shown in Table 11.1.
As you look at this table, ask yourself: What kind of dependencies can we observe among the attributes in Table R? Since the values of A are unique (a1, a2, a3, etc.), it follows from the FD definition that:
A → B, A → C, A → D, A → E
Since the values of E are always the same (all e1), it follows that:
A → E, B → E, C → E, D → E
However, we cannot generally summarize the above with ABCD → E because, in general, A → E, B → E, AB → E.
Other observations:
Looking at actual data can help clarify which attributes are dependent and which are determinants.
Armstrong’s axioms are a set of inference rules used to infer all the functional dependencies on a relational database. They were developed by William W. Armstrong. The following describes what will be used, in terms of notation, to explain these axioms.
Let R(U) be a relation scheme over the set of attributes U. We will use the letters X, Y, Z to represent any subset of and, for short, the union of two sets of attributes, instead of the usual X U Y.
This axiom says, if Y is a subset of X, then X determines Y (see Figure 11.1).
For example, PartNo —> NT123 where X (PartNo) is composed of more than one piece of information; i.e., Y (NT) and partID (123).
The axiom of augmentation, also known as a partial dependency, says if X determines Y, then XZ determines YZ for any Z (see Figure 11.2 ).
The axiom of augmentation says that every non-key attribute must be fully dependent on the PK. In the example shown below, StudentName, Address, City, Prov, and PC (postal code) are only dependent on the StudentNo, not on the StudentNo and Grade.
StudentNo, Course —> StudentName, Address, City, Prov, PC, Grade, DateCompleted
This situation is not desirable because every non-key attribute has to be fully dependent on the PK. In this situation, student information is only partially dependent on the PK (StudentNo).
To fix this problem, we need to break the original table down into two as follows:
The axiom of transitivity says if X determines Y, and Y determines Z, then X must also determine Z (see Figure 11.3).
The table below has information not directly related to the student; for instance, ProgramID and ProgramName should have a table of its own. ProgramName is not dependent on StudentNo; it’s dependent on ProgramID.
StudentNo —> StudentName, Address, City, Prov, PC, ProgramID, ProgramName
This situation is not desirable because a non-key attribute (ProgramName) depends on another non-key attribute (ProgramID).
To fix this problem, we need to break this table into two: one to hold information about the student and the other to hold information about the program.
However we still need to leave an FK in the student table so that we can identify which program the student is enrolled in.
This rule suggests that if two tables are separate, and the PK is the same, you may want to consider putting them together. It states that if X determines Y and X determines Z then X must also determine Y and Z (see Figure 11.4).
For example, if:
You may want to join these two tables into one as follows:
SIN –> EmpName, SpouseName
Some database administrators (DBA) might choose to keep these tables separated for a couple of reasons. One, each table describes a different entity so the entities should be kept apart. Two, if SpouseName is to be left NULL most of the time, there is no need to include it in the same table as EmpName.
Decomposition is the reverse of the Union rule. If you have a table that appears to contain two entities that are determined by the same PK, consider breaking them up into two tables. This rule states that if X determines Y and Z, then X determines Y and X determines Z separately (see Figure 11.5).
A dependency diagram, shown in Figure 11.6, illustrates the various dependencies that might exist in a non-normalized table. A non-normalized table is one that has data redundancy in it.
The following dependencies are identified in this table:
Armstrong’s axioms: a set of inference rules used to infer all the functional dependencies on a relational database
DBA: database administrator
decomposition: a rule that suggests if you have a table that appears to contain two entities that are determined by the same PK, consider breaking them up into two tables
dependent: the right side of the functional dependency diagram
determinant: the left side of the functional dependency diagram
functional dependency (FD): a relationship between two attributes, typically between the PK and other non-key attributes within a table
non-normalized table: a table that has data redundancy in it
Union: a rule that suggests that if two tables are separate, and the PK is the same, consider putting them together
See Chapter 12.
This chapter of Database Design (including images, except as otherwise noted) is a derivative copy of Armstrong’s axioms by Wikipedia the Free Encyclopedia licensed under Creative Commons Attribution-ShareAlike 3.0 Unported
The following material was written by Adrienne Watt:
Normalization should be part of the database design process. However, it is difficult to separate the normalization process from the ER modelling process so the two techniques should be used concurrently.
Use an entity relation diagram (ERD) to provide the big picture, or macro view, of an organization’s data requirements and operations. This is created through an iterative process that involves identifying relevant entities, their attributes and their relationships.
Normalization procedure focuses on characteristics of specific entities and represents the micro view of entities within the ERD.
Normalization is the branch of relational theory that provides design insights. It is the process of determining how much redundancy exists in a table. The goals of normalization are to:
Normalization theory draws heavily on the theory of functional dependencies. Normalization theory defines six normal forms (NF). Each normal form involves a set of dependency properties that a schema must satisfy and each normal form gives guarantees about the presence and/or absence of update anomalies. This means that higher normal forms have less redundancy, and as a result, fewer update problems.
All the tables in any database can be in one of the normal forms we will discuss next. Ideally we only want minimal redundancy for PK to FK. Everything else should be derived from other tables. There are six normal forms, but we will only look at the first four, which are:
BCNF is rarely used.
In the first normal form, only single values are permitted at the intersection of each row and column; hence, there are no repeating groups.
To normalize a relation that contains a repeating group, remove the repeating group and form two new relations.
The PK of the new relation is a combination of the PK of the original relation plus an attribute from the newly created relation for unique identification.
We will use the Student_Grade_Report table below, from a School database, as our example to explain the process for 1NF.
StudentCourse (StudentNo, CourseNo, CourseName, InstructorNo, InstructorName, InstructorLocation, Grade)
For the second normal form, the relation must first be in 1NF. The relation is automatically in 2NF if, and only if, the PK comprises a single attribute.
If the relation has a composite PK, then each non-key attribute must be fully dependent on the entire PK and not on a subset of the PK (i.e., there must be no partial dependency or augmentation).
To move to 2NF, a table must first be in 1NF.
To be in third normal form, the relation must be in second normal form. Also all transitive dependencies must be removed; a non-key attribute may not be functionally dependent on another non-key attribute.
At this stage, there should be no anomalies in third normal form. Let’s look at the dependency diagram (Figure 12.1) for this example. The first step is to remove repeating groups, as discussed above.
Student (StudentNo, StudentName, Major)
StudentCourse (StudentNo, CourseNo, CourseName, InstructorNo, InstructorName, InstructorLocation, Grade)
To recap the normalization process for the School database, review the dependencies shown in Figure 12.1.
The abbreviations used in Figure 12.1 are as follows:
When a table has more than one candidate key, anomalies may result even though the relation is in 3NF. Boyce-Codd normal form is a special case of 3NF. A relation is in BCNF if, and only if, every determinant is a candidate key.
Consider the following table (St_Maj_Adv).
Student_id | Major | Advisor |
111 | Physics | Smith |
111 | Music | Chan |
320 | Math | Dobbs |
671 | Physics | White |
803 | Physics | Smith |
The semantic rules (business rules applied to the database) for this table are:
The functional dependencies for this table are listed below. The first one is a candidate key; the second is not.
Anomalies for this table include:
Note: No single attribute is a candidate key.
PK can be Student_id, Major or Student_id, Advisor.
To reduce the St_Maj_Adv relation to BCNF, you create two new tables:
St_Adv table
Student_id | Advisor |
111 | Smith |
111 | Chan |
320 | Dobbs |
671 | White |
803 | Smith |
Adv_Maj table
Advisor | Major |
Smith | Physics |
Chan | Music |
Dobbs | Math |
White | Physics |
Consider the following table (Client_Interview).
ClientNo | InterviewDate | InterviewTime | StaffNo | RoomNo |
CR76 | 13-May-02 | 10.30 | SG5 | G101 |
CR56 | 13-May-02 | 12.00 | SG5 | G101 |
CR74 | 13-May-02 | 12.00 | SG37 | G102 |
CR56 | 1-July-02 | 10.30 | SG5 | G102 |
FD1 – ClientNo, InterviewDate –> InterviewTime, StaffNo, RoomNo (PK)
FD2 – staffNo, interviewDate, interviewTime –> clientNO (candidate key: CK)
FD3 – roomNo, interviewDate, interviewTime –> staffNo, clientNo (CK)
FD4 – staffNo, interviewDate –> roomNo
A relation is in BCNF if, and only if, every determinant is a candidate key. We need to create a table that incorporates the first three FDs (Client_Interview2 table) and another table (StaffRoom table) for the fourth FD.
Client_Interview2 table
ClientNo | InterviewDate | InterViewTime | StaffNo |
CR76 | 13-May-02 | 10.30 | SG5 |
CR56 | 13-May-02 | 12.00 | SG5 |
CR74 | 13-May-02 | 12.00 | SG37 |
CR56 | 1-July-02 | 10.30 | SG5 |
StaffRoom table
StaffNo | InterviewDate | RoomNo |
SG5 | 13-May-02 | G101 |
SG37 | 13-May-02 | G102 |
SG5 | 1-July-02 | G102 |
During the normalization process of database design, make sure that proposed entities meet required normal form before table structures are created. Many real-world databases have been improperly designed or burdened with anomalies if improperly modified during the course of time. You may be asked to redesign and modify existing databases. This can be a large undertaking if the tables are not properly normalized.
Boyce-Codd normal form (BCNF): a special case of 3rd NF
first normal form (1NF): only single values are permitted at the intersection of each row and column so there are no repeating groups
normalization: the process of determining how much redundancy exists in a table
second normal form (2NF): the relation must be in 1NF and the PK comprises a single attribute
semantic rules: business rules applied to the database
third normal form (3NF): the relation must be in 2NF and all transitive dependencies must be removed; a non-key attribute may not be functionally dependent on another non-key attribute
Complete chapters 11 and 12 before doing these exercises.
Also see Appendix B: Sample ERD Exercises
Nguyen Kim Anh, Relational Design Theory. OpenStax CNX. 8 Jul 2009 Retrieved July 2014 from http://cnx.org/contents/606cc532-0b1d-419d-a0ec-ac4e2e2d533b@1@1
Russell, Gordon. Chapter 4 – Normalisation. Database eLearning. N.d. Retrived July 2014 from db.grussell.org/ch4.html
A core aspect of software engineering is the subdivision of the development process into a series of phases, or steps, each of which focuses on one aspect of the development. The collection of these steps is sometimes referred to as the software development life cycle (SDLC). The software product moves through this life cycle (sometimes repeatedly as it is refined or redeveloped) until it is finally retired from use. Ideally, each phase in the life cycle can be checked for correctness before moving on to the next phase.
Let us start with an overview of the waterfall model such as you will find in most software engineering textbooks. This waterfall figure, seen in Figure 13.1, illustrates a general waterfall model that could apply to any computer system development. It shows the process as a strict sequence of steps where the output of one step is the input to the next and all of one step has to be completed before moving onto the next.
We can use the waterfall process as a means of identifying the tasks that are required, together with the input and output for each activity. What is important is the scope of the activities, which can be summarized as follows:
We can use the waterfall cycle as the basis for a model of database development that incorporates three assumptions:
Using these assumptions and Figure 13.2, we can see that this diagram represents a model of the activities and their outputs for database development. It is applicable to any class of DBMS, not just a relational approach.
Database application development is the process of obtaining real-world requirements, analyzing requirements, designing the data and functions of the system, and then implementing the operations in the system.
The first step is requirements gathering. During this step, the database designers have to interview the customers (database users) to understand the proposed system and obtain and document the data and functional requirements. The result of this step is a document that includes the detailed requirements provided by the users.
Establishing requirements involves consultation with, and agreement among, all the users as to what persistent data they want to store along with an agreement as to the meaning and interpretation of the data elements. The data administrator plays a key role in this process as they overview the business, legal and ethical issues within the organization that impact on the data requirements.
The data requirements document is used to confirm the understanding of requirements with users. To make sure that it is easily understood, it should not be overly formal or highly encoded. The document should give a concise summary of all users’ requirements – not just a collection of individuals’ requirements – as the intention is to develop a single shared database.
The requirements should not describe how the data is to be processed, but rather what the data items are, what attributes they have, what constraints apply and the relationships that hold between the data items.
Data analysis begins with the statement of data requirements and then produces a conceptual data model. The aim of analysis is to obtain a detailed description of the data that will suit user requirements so that both high and low level properties of data and their use are dealt with. These include properties such as the possible range of values that can be permitted for attributes (e.g., in the school database example, the student course code, course title and credit points).
The conceptual data model provides a shared, formal representation of what is being communicated between clients and developers during database development – it is focused on the data in a database, irrespective of the eventual use of that data in user processes or implementation of the data in specific computer environments. Therefore, a conceptual data model is concerned with the meaning and structure of data, but not with the details affecting how they are implemented.
The conceptual data model then is a formal representation of what data a database should contain and the constraints the data must satisfy. This should be expressed in terms that are independent of how the model may be implemented. As a result, analysis focuses on the questions, “What is required?” not “How is it achieved?”
Database design starts with a conceptual data model and produces a specification of a logical schema; this will determine the specific type of database system (network, relational, object-oriented) that is required. The relational representation is still independent of any specific DBMS; it is another conceptual data model.
We can use a relational representation of the conceptual data model as input to the logical design process. The output of this stage is a detailed relational specification, the logical schema, of all the tables and constraints needed to satisfy the description of the data in the conceptual data model. It is during this design activity that choices are made as to which tables are most appropriate for representing the data in a database. These choices must take into account various design criteria including, for example, flexibility for change, control of duplication and how best to represent the constraints. It is the tables defined by the logical schema that determine what data are stored and how they may be manipulated in the database.
Database designers familiar with relational databases and SQL might be tempted to go directly to implementation after they have produced a conceptual data model. However, such a direct transformation of the relational representation to SQL tables does not necessarily result in a database that has all the desirable properties: completeness, integrity, flexibility, efficiency and usability. A good conceptual data model is an essential first step towards a database with these properties, but that does not mean that the direct transformation to SQL tables automatically produces a good database. This first step will accurately represent the tables and constraints needed to satisfy the conceptual data model description, and so will satisfy the completeness and integrity requirements, but it may be inflexible or offer poor usability. The first design is then flexed to improve the quality of the database design. Flexing is a term that is intended to capture the simultaneous ideas of bending something for a different purpose and weakening aspects of it as it is bent.
Figure 13.3 summarizes the iterative (repeated) steps involved in database design, based on the overview given. Its main purpose is to distinguish the general issue of what tables should be used from the detailed definition of the constituent parts of each table – these tables are considered one at a time, although they are not independent of each other. Each iteration that involves a revision of the tables would lead to a new design; collectively they are usually referred to as second-cut designs, even if the process iterates for more than a single loop.
First, for a given conceptual data model, it is not necessary that all the user requirements it represents be satisfied by a single database. There can be various reasons for the development of more than one database, such as the need for independent operation in different locations or departmental control over “their” data. However, if the collection of databases contains duplicated data and users need to access data in more than one database, then there are possible reasons that one database can satisfy multiple requirements, or issues related to data replication and distribution need to be examined.
Second, one of the assumptions about database development is that we can separate the development of a database from the development of user processes that make use of it. This is based on the expectation that, once a database has been implemented, all data required by currently identified user processes have been defined and can be accessed; but we also require flexibility to allow us to meet future requirements changes. In developing a database for some applications, it may be possible to predict the common requests that will be presented to the database and so we can optimize our design for the most common requests.
Third, at a detailed level, many aspects of database design and implementation depend on the particular DBMS being used. If the choice of DBMS is fixed or made prior to the design task, that choice can be used to determine design criteria rather than waiting until implementation. That is, it is possible to incorporate design decisions for a specific DBMS rather than produce a generic design and then tailor it to the DBMS during implementation.
It is not uncommon to find that a single design cannot simultaneously satisfy all the properties of a good database. So it is important that the designer has prioritized these properties (usually using information from the requirements specification); for example, to decide if integrity is more important than efficiency and whether usability is more important than flexibility in a given development.
At the end of our design stage, the logical schema will be specified by SQL data definition language (DDL) statements, which describe the database that needs to be implemented to meet the user requirements.
Implementation involves the construction of a database according to the specification of a logical schema. This will include the specification of an appropriate storage schema, security enforcement, external schema and so on. Implementation is heavily influenced by the choice of available DBMSs, database tools and operating environment. There are additional tasks beyond simply creating a database schema and implementing the constraints – data must be entered into the tables, issues relating to the users and user processes need to be addressed, and the management activities associated with wider aspects of corporate data management need to be supported. In keeping with the DBMS approach, we want as many of these concerns as possible to be addressed within the DBMS. We look at some of these concerns briefly now.
In practice, implementation of the logical schema in a given DBMS requires a very detailed knowledge of the specific features and facilities that the DBMS has to offer. In an ideal world, and in keeping with good software engineering practice, the first stage of implementation would involve matching the design requirements with the best available implementing tools and then using those tools for the implementation. In database terms, this might involve choosing vendor products with DBMS and SQL variants most suited to the database we need to implement. However, we don’t live in an ideal world and more often than not, hardware choice and decisions regarding the DBMS will have been made well in advance of consideration of the database design. Consequently, implementation can involve additional flexing of the design to overcome any software or hardware limitations.
After the logical design has been created, we need our database to be created according to the definitions we have produced. For an implementation with a relational DBMS, this will probably involve the use of SQL to create tables and constraints that satisfy the logical schema description and the choice of appropriate storage schema (if the DBMS permits that level of control).
One way to achieve this is to write the appropriate SQL DDL statements into a file that can be executed by a DBMS so that there is an independent record, a text file, of the SQL statements defining the database. Another method is to work interactively using a database tool like SQL Server Management Studio or Microsoft Access. Whatever mechanism is used to implement the logical schema, the result is that a database, with tables and constraints, is defined but will contain no data for the user processes.
After a database has been created, there are two ways of populating the tables – either from existing data or through the use of the user applications developed for the database.
For some tables, there may be existing data from another database or data files. For example, in establishing a database for a hospital, you would expect that there are already some records of all the staff that have to be included in the database. Data might also be brought in from an outside agency (address lists are frequently brought in from external companies) or produced during a large data entry task (converting hard-copy manual records into computer files can be done by a data entry agency). In such situations, the simplest approach to populate the database is to use the import and export facilities found in the DBMS.
Facilities to import and export data in various standard formats are usually available (these functions are also known in some systems as loading and unloading data). Importing enables a file of data to be copied directly into a table. When data are held in a file format that is not appropriate for using the import function, then it is necessary to prepare an application program that reads in the old data, transforms them as necessary and then inserts them into the database using SQL code specifically produced for that purpose. The transfer of large quantities of existing data into a database is referred to as a bulk load. Bulk loading of data may involve very large quantities of data being loaded, one table at a time so you may find that there are DBMS facilities to postpone constraint checking until the end of the bulk loading.
Note: These are general guidelines that will assist in developing a strong basis for the actual database design (the logical model).
analysis: starts by considering the statement of requirements and finishes by producing a system specification
bulk load: the transfer of large quantities of existing data into a database
data requirements document: used to confirm the understanding of requirements with the user
design: begins with a system specification, produces design documents and provides a detailed description of how a system should be constructed
establishing requirements: involves consultation with, and agreement among, stakeholders as to what they want from a system; expressed as a statement of requirements
flexing: a term that captures the simultaneous ideas of bending something for a different purpose and weakening aspects of it as it is bent
implementation: the construction of a computer system according to a given design document
maintenance: involves dealing with changes in the requirements or the implementation environment, bug fixing or porting of the system to new environments
requirements gathering: a process during which the database designer interviews the database user to understand the proposed system and obtain and document the data and functional requirements
second-cut designs: the collection of iterations that each involves a revision of the tables that lead to a new design
software development life cycle (SDLC): the series of steps involved in the database development process
testing: compares the implemented system against the design documents and requirements specification and produces an acceptance report
waterfall model: shows the database development process as a strict sequence of steps where the output of one step is the input to the next
waterfall process: a means of identifying the tasks required for database development, together with the input and output for each activity (see waterfall model)
This chapter of Database Design (including all images, except as otherwise noted) is a derivative copy of The Database Development Life Cycle by the Open University licensed under Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License.
The following material was written by Adrienne Watt:
End users are the people whose jobs require access to a database for querying, updating and generating reports.
The application user is someone who accesses an existing application program to perform daily tasks.
Sophisticated users are those who have their own way of accessing the database. This means they do not use the application program provided in the system. Instead, they might define their own application or describe their need directly by using query languages. These specialized users maintain their personal databases by using ready-made program packages that provide easy-to-use menu driven commands, such as MS Access.
These users implement specific application programs to access the stored data. They must be familiar with the DBMSs to accomplish their task.
This may be one person or a group of people in an organization responsible for authorizing access to the database, monitoring its use and managing all of the resources to support the use of the entire database system.
application programmer: user who implements specific application programs to access the stored data
application user: accesses an existing application program to perform daily tasks.
database administrator (DBA): responsible for authorizing access to the database, monitoring its use and managing all the resources to support the use of the entire database system
end user: people whose jobs require access to a database for querying, updating and generating reports
sophisticated user: those who use other methods, other than the application program, to access the database
There are no exercises provided for this chapter.
Structured Query Language (SQL) is a database language designed for managing data held in a relational database management system. SQL was initially developed by IBM in the early 1970s (Date 1986). The initial version, called SEQUEL (Structured English Query Language), was designed to manipulate and retrieve data stored in IBM’s quasi-relational database management system, System R. Then in the late 1970s, Relational Software Inc., which is now Oracle Corporation, introduced the first commercially available implementation of SQL, Oracle V2 for VAX computers.
Many of the currently available relational DBMSs, such as Oracle Database, Microsoft SQL Server (shown in Figure 15.1), MySQL, IBM DB2, IBM Informix and Microsoft Access, use SQL.
In a DBMS, the SQL database language is used to:
In this chapter, we will focus on using SQL to create the database and table structures, mainly using SQL as a data definition language (DDL). In Chapter 16, we will use SQL as a data manipulation language (DML) to insert, delete, select and update data within the database tables.
The major SQL DDL statements are CREATE DATABASE and CREATE/DROP/ALTER TABLE. The SQL statement CREATE is used to create the database and table structures.
Example: CREATE DATABASE SW
A new database named SW is created by the SQL statement CREATE DATABASE SW. Once the database is created, the next step is to create the database tables.
The general format for the CREATE TABLE command is:
Tablename is the name of the database table such as Employee. Each field in the CREATE TABLE has three parts (see above):
The ColumnName must be unique within the table. Some examples of ColumnNames are FirstName and LastName.
The data type, as described below, must be a system data type or a user-defined data type. Many of the data types have a size such as CHAR(35) or Numeric(8,2).
Bit –Integer data with either a 1 or 0 value
Int –Integer (whole number) data from -2^31 (-2,147,483,648) through 2^31 – 1 (2,147,483,647)
Smallint –Integer data from 2^15 (-32,768) through 2^15 – 1 (32,767)
Tinyint –Integer data from 0 through 255
Decimal –Fixed precision and scale numeric data from -10^38 -1 through 10^38
Numeric –A synonym for decimal
Timestamp –A database-wide unique number
Uniqueidentifier –A globally unique identifier (GUID)
Money – Monetary data values from -2^63 (-922,337,203,685,477.5808) through 2^63 – 1 (+922,337,203,685,477.5807), with accuracy to one-ten-thousandth of a monetary unit
Smallmoney –Monetary data values from -214,748.3648 through +214,748.3647, with accuracy to one-ten-thousandth of a monetary unit
Float –Floating precision number data from -1.79E + 308 through 1.79E + 308
Real –Floating precision number data from -3.40E + 38 through 3.40E + 38
Datetime –Date and time data from January 1, 1753, to December 31, 9999, with an accuracy of one-three-hundredths of a second, or 3.33 milliseconds
Smalldatetime –Date and time data from January 1, 1900, through June 6, 2079, with an accuracy of one minute
Char –Fixed-length non-Unicode character data with a maximum length of 8,000 characters
Varchar –Variable-length non-Unicode data with a maximum of 8,000 characters
Text –Variable-length non-Unicode data with a maximum length of 2^31 – 1 (2,147,483,647) characters
Binary –Fixed-length binary data with a maximum length of 8,000 bytes
Varbinary –Variable-length binary data with a maximum length of 8,000 bytes
Image – Variable-length binary data with a maximum length of 2^31 – 1 (2,147,483,647) bytes
The Optional ColumnConstraints are NULL, NOT NULL, UNIQUE, PRIMARY KEY and DEFAULT, used to initialize a value for a new record. The column constraint NULL indicates that null values are allowed, which means that a row can be created without a value for this column. The column constraint NOT NULL indicates that a value must be supplied when a new row is created.
To illustrate, we will use the SQL statement CREATE TABLE EMPLOYEES to create the employees table with 16 attributes or fields.
The first field is EmployeeNo with a field type of CHAR. For this field, the field length is 10 characters, and the user cannot leave this field empty (NOT NULL).
Similarly, the second field is DepartmentName with a field type CHAR of length 30. After all the table columns are defined, a table constraint, identified by the word CONSTRAINT, is used to create the primary key:
We will discuss the constraint property further later in this chapter.
Likewise, we can create a Department table, a Project table and an Assignment table using the CREATE TABLE SQL DDL command as shown in the below example.
In this example, a project table is created with seven fields: ProjectID, ProjectName, Department, MaxHours, StartDate, and EndDate.
In this last example, an assignment table is created with three fields: ProjectID, EmployeeNumber, and HoursWorked. The assignment table is used to record who (EmployeeNumber) and how much time(HoursWorked) an employee worked on the particular project(ProjectID).
Table constraints are identified by the CONSTRAINT keyword and can be used to implement various constraints described below.
We can use the optional column constraint IDENTITY to provide a unique, incremental value for that column. Identity columns are often used with the PRIMARY KEY constraints to serve as the unique row identifier for the table. The IDENTITY property can be assigned to a column with a tinyint, smallint, int, decimal or numeric data type. This constraint:
For IDENTITY[(seed, increment)]
We will use another database example to further illustrate the SQL DDL statements by creating the table tblHotel in this HOTEL database.
UNIQUE constraint
The UNIQUE constraint prevents duplicate values from being entered into a column.
This is the general syntax for the UNIQUE constraint:
This is an examle using the UNIQUE constraint.
The FOREIGN KEY (FK) constraint defines a column, or combination of columns, whose values match the PRIMARY KEY (PK) of another table.
This is the general syntax for the FOREIGN KEY constraint:
In this example, the field HotelNo in the tblRoom table is a FK to the field HotelNo in the tblHotel table shown previously.
The CHECK constraint restricts values that can be entered into a table.
This is the general syntax for the CHECK constraint:
In this example, the Type field is restricted to have only the types ‘Single’, ‘Double’, ‘Suite’ or ‘Executive’.
In this second example, the employee hire date should be before January 1, 2004, or have a salary limit of $300,000.
The DEFAULT constraint is used to supply a value that is automatically added for a column if the user does not supply one.
The general syntax for the DEFAULT constraint is:
This example sets the default for the city field to ‘Vancouver’.
User defined types are always based on system-supplied data type. They can enforce data integrity and they allow nulls.
To create a user-defined data type in SQL Server, choose types under “Programmability” in your database. Next, right click and choose ‘New’ –>‘User-defined data type’ or execute the sp_addtype system stored procedure. After this, type:
This will add a new user-defined data type called SIN with nine characters.
In this example, the field EmployeeSIN uses the user-defined data type SIN.
You can use ALTER TABLE statements to add and drop constraints.
In this example, we use the ALTER TABLE statement to the IDENTITY property to a ColumnName field.
Use the ALTER TABLE statement to add a column with the IDENTITY property such as ALTER TABLE TableName.
The DROP TABLE will remove a table from the database. Make sure you have the correct database selected.
Executing the above SQL DROP TABLE statement will remove the table tblHotel from the database.
DDL: abbreviation for data definition language
DML: abbreviation for data manipulation language
SEQUEL: acronym for Structured English Query Language; designed to manipulate and retrieve data stored in IBM’s quasi-relational database management system, System R
Structured Query Language (SQL): a database language designed for managing data held in a relational database management system
ATTRIBUTE (FIELD) NAME | DATA DECLARATION |
EMP_NUM | CHAR(3) |
EMP_LNAME | VARCHAR(15) |
EMP_FNAME | VARCHAR(15) |
EMP_INITIAL | CHAR(1) |
EMP_HIREDATE | DATE |
JOB_CODE | CHAR(3) |
Use Figure 15.2 to answer questions 4 to 10.
Also see Appendix C: SQL Lab with Solution
Date, C.J. Relational Database Selected Writings. Reading: Mass: Addison-Wesley Publishing Company Inc., 1986, p. 269-311.
The SQL data manipulation language (DML) is used to query and modify database data. In this chapter, we will describe how to use the SELECT, INSERT, UPDATE, and DELETE SQL DML command statements, defined below.
In the SQL DML statement:
The SELECT statement, or command, allows the user to extract data from tables, based on specific criteria. It is processed according to the following sequence:
SELECT DISTINCT item(s)
FROM table(s)
WHERE predicate
GROUP BY field(s)
ORDER BY fields
We can use the SELECT statement to generate an employee phone list from the Employees table as follows:
This action will display employee’s last name, first name, and phone number from the Employees table, seen in Table 16.1.
Last Name | First Name | Phone Number |
Hagans | Jim | 604-232-3232 |
Wong | Bruce | 604-244-2322 |
Table 16.1. Employees table.
In this next example, we will use a Publishers table (Table 16.2). (You will notice that Canada is mispelled in the Publisher Country field for Example Publishing and ABC Publishing. To correct mispelling, use the UPDATE statement to standardize the country field to Canada – see UPDATE statement later in this chapter.)
Publisher Name | Publisher City | Publisher Province | Publisher Country |
Acme Publishing | Vancouver | BC | Canada |
Example Publishing | Edmonton | AB | Cnada |
ABC Publishing | Toronto | ON | Canda |
Table 16.2. Publishers table.
If you add the publisher’s name and city, you would use the SELECT statement followed by the fields name separated by a comma:
This action will display the publisher’s name and city from the Publishers table.
If you just want the publisher’s name under the display name city, you would use the SELECT statement with no comma separating pub_name and city:
Performing this action will display only the pub_name from the Publishers table with a “city” heading. If you do not include the comma, SQL Server assumes you want a new column name for pub_name.
Sometimes you might want to focus on a portion of the Publishers table, such as only publishers that are in Vancouver. In this situation, you would use the SELECT statement with the WHERE criterion, i.e., WHERE city = ‘Vancouver’.
These first two examples illustrate how to limit record selection with the WHERE criterion using BETWEEN. Each of these examples give the same results for store items with between 20 and 50 items in stock.
Example #1 uses the quantity, qty BETWEEN 20 and 50.
Example #2, on the other hand, uses qty >=20 and qty <=50 .
Example #3 illustrates how to limit record selection with the WHERE criterion using NOT BETWEEN.
The next two examples show two different ways to limit record selection with the WHERE criterion using IN, with each yielding the same results.
Example #4 shows how to select records using province= as part of the WHERE statement.
Example #5 select records using province IN as part of the WHERE statement.
The final two examples illustrate how NULL and NOT NULL can be used to select records. For these examples, a Books table (not shown) would be used that contains fields called Title, Quantity, and Price (of book). Each publisher has a Books table that lists all of its books.
Example #6 uses NULL.
Example #7 uses NOT NULL.
The LIKE keyword selects rows containing fields that match specified portions of character strings. LIKE is used with char, varchar, text, datetime and smalldatetime data. A wildcard allows the user to match fields that contain certain letters. For example, the wildcard province = ‘N%’ would give all provinces that start with the letter ‘N’. Table 16.3 shows four ways to specify wildcards in the SELECT statement in regular express format.
% | Any string of zero or more characters |
_ | Any single character |
[ ] | Any single character within the specified range (e.g., [a-f]) or set (e.g., [abcdef]) |
[^] | Any single character not within the specified range (e.g., [^a – f]) or set (e.g., [^abcdef]) |
Table 16.3. How to specify wildcards in the SELECT statement.
In example #1, LIKE ‘Mc%’ searches for all last names that begin with the letters “Mc” (e.g., McBadden).
For example #2: LIKE ‘%inger’ searches for all last names that end with the letters “inger” (e.g., Ringer, Stringer).
In, example #3: LIKE ‘%en%’ searches for all last names that have the letters “en” (e.g., Bennett, Green, McBadden).
You use the ORDER BY clause to sort the records in the resulting list. Use ASC to sort the results in ascending order and DESC to sort the results in descending order.
For example, with ASC:
And with DESC:
The GROUP BY clause is used to create one output row per each group and produces summary values for the selected columns, as shown below.
Here is an example using the above statement.
If the SELECT statement includes a WHERE criterion where price is not null,
then a statement with the GROUP BY clause would look like this:
We can use COUNT to tally how many items are in a container. However, if we want to count different items into separate groups, such as marbles of varying colours, then we would use the COUNT function with the GROUP BY command.
The below SELECT statement illustrates how to count groups of data using the COUNT function with the GROUP BY clause.
We can use the AVG function to give us the average of any group, and SUM to give the total.
Example #1 uses the AVG FUNCTION with the GROUP BY type.
Example #2 uses the SUM function with the GROUP BY type.
Example #3 uses both the AVG and SUM functions with the GROUP BY type in the SELECT statement.
The HAVING clause can be used to restrict rows. It is similar to the WHERE condition except HAVING can include the aggregate function; the WHERE cannot do this.
The HAVING clause behaves like the WHERE clause, but is applicable to groups. In this example, we use the HAVING clause to exclude the groups with the province ‘BC’.
The INSERT statement adds rows to a table. In addition,
The syntax for the INSERT statement is:
When inserting rows with the INSERT statement, these rules apply:
When you specify values for only some of the columns in the column_list, one of three things can happen to the columns that have no values:
This example uses INSERT to add a record to the publisher’s Authors table.
This following example illustrates how to insert a partial row into the Publishers table with a column list. The country column had a default value of Canada so it does not require that you include it in your values.
To insert rows into a table with an IDENTITY column, follow the below example. Do not supply the value for the IDENTITY nor the name of the column in the column list.
By default, data cannot be inserted directly into an IDENTITY column; however, if a row is accidentally deleted, or there are gaps in the IDENTITY column values, you can insert a row and specify the IDENTITY column value.
To allow an insert with a specific identity value, the IDENTITY_INSERT option can be used as follows.
We can sometimes create a small temporary table from a large table. For this, we can insert rows with a SELECT statement. When using this command, there is no validation for uniqueness. Consequently, there may be many rows with the same pub_id in the example below.
This example creates a smaller temporary Publishers table using the CREATE TABLE statement. Then the INSERT with a SELECT statement is used to add records to this temporary Publishers table from the publis table.
In this example, we’re copying a subset of data.
In this example, the publishers’ data are copied to the tmpPublishers table and the country column is set to Canada.
The UPDATE statement changes data in existing rows either by adding new data or modifying existing data.
This example uses the UPDATE statement to standardize the country field to be Canada for all records in the Publishers table.
This example increases the royalty amount by 10% for those royalty amounts between 10 and 20.
The employees from the Employees table who were hired by the publisher in 2010 are given a promotion to the highest job level for their job type. This is what the UPDATE statement would look like.
The DELETE statement removes rows from a record set. DELETE names the table or view that holds the rows that will be deleted and only one table or row may be listed at a time. WHERE is a standard WHERE clause that limits the deletion to select records.
The DELETE syntax looks like this.
The rules for the DELETE statement are:
What follows are three different DELETE statements that can be used.
1. Deleting all rows from a table.
2. Deleting selected rows:
3. Deleting rows based on a value in a subquery:
There are many built-in functions in SQL Server such as:
Below you will find detailed descriptions and examples for the first four functions.
Aggregate functions perform a calculation on a set of values and return a single, or summary, value. Table 16.4 lists these functions.
FUNCTION | DESCRIPTION |
AVG | Returns the average of all the values, or only the DISTINCT values, in the expression. |
COUNT | Returns the number of non-null values in the expression. When DISTINCT is specified, COUNT finds the number of unique non-null values. |
COUNT(*) | Returns the number of rows. COUNT(*) takes no parameters and cannot be used with DISTINCT. |
MAX | Returns the maximum value in the expression. MAX can be used with numeric, character and datetime columns, but not with bit columns. With character columns, MAX finds the highest value in the collating sequence. MAX ignores any null values. |
MIN | Returns the minimum value in the expression. MIN can be used with numeric, character and datetime columns, but not with bit columns. With character columns, MIN finds the value that is lowest in the sort sequence. MIN ignores any null values. |
SUM | Returns the sum of all the values, or only the DISTINCT values, in the expression. SUM can be used with numeric columns only. |
Table 16.4 A list of aggregate functions and descriptions.
Below are examples of each of the aggregate functions listed in Table 16.4.
Example #1: AVG
Example #2: COUNT
Example #3: COUNT
Example #3: COUNT (*)
Example #4: MAX
Example #5: MIN
Example #6: SUM
The conversion function transforms one data type to another.
In the example below, a price that contains two 9s is converted into five characters. The syntax for this statement is SELECT ‘The date is ‘ + CONVERT(varchar(12), getdate()).
In this second example, the conversion function changes data to a data type with a different size.
The date function produces a date by adding an interval to a specified date. The result is a datetime value equal to the date plus the number of date parts. If the date parameter is a smalldatetime value, the result is also a smalldatetime value.
The DATEADD function is used to add and increment date values. The syntax for this function is DATEADD(datepart, number, date).
In this example, the function DATEDIFF(datepart, date1, date2) is used.
This command returns the number of datepart “boundaries” crossed between two specified dates. The method of counting crossed boundaries makes the result given by DATEDIFF consistent across all data types such as minutes, seconds, and milliseconds.
For any particular date, we can examine any part of that date from the year to the millisecond.
The date parts (DATEPART) and abbreviations recognized by SQL Server, and the acceptable values are listed in Table 16.5.
DATE PART | ABBREVIATION | VALUES |
Year | yy | 1753-9999 |
Quarter | 1-4 | |
Month | mm | 1-12 |
Day of year | dy | 1-366 |
Day | dd | 1-31 |
Week | wk | 1-53 |
Weekday | dw | 1-7 (Sun.-Sat.) |
Hour | hh | 0-23 |
Minute | mi | 0-59 |
Second | ss | 0-59 |
Millisecond | ms | 0-999 |
Table 16.5. Date part abbreviations and values.
Mathematical functions perform operations on numeric data. The following example lists the current price for each book sold by the publisher and what they would be if all prices increased by 10%.
Joining two or more tables is the process of comparing the data in specified columns and using the comparison results to form a new table from the rows that qualify. A join statement:
Although the comparison is usually for equality – values that match exactly – other types of joins can also be specified. All the different joins such as inner, left (outer), right (outer), and cross join will be described below.
An inner join connects two tables on a column with the same data type. Only the rows where the column values match are returned; unmatched rows are discarded.
Example #1
Example #2
A left outer join specifies that all left outer rows be returned. All rows from the left table that did not meet the condition specified are included in the results set, and output columns from the other table are set to NULL.
This first example uses the new syntax for a left outer join.
This is an example of a left outer join using the old syntax.
A right outer join includes, in its result set, all rows from the right table that did not meet the condition specified. Output columns that correspond to the other table are set to NULL.
Below is an example using the new syntax for a right outer join.
This second example show the old syntax used for a right outer join.
A full outer join specifies that if a row from either table does not match the selection criteria, the row is included in the result set, and its output columns that correspond to the other table are set to NULL.
Here is an example of a full outer join.
A cross join is a product combining two tables. This join returns the same rows as if no WHERE clause were specified. For example:
aggregate function: returns summary values
ASC: ascending order
conversion function: transforms one data type to another
cross join: a product combining two tables
date function: displays information about dates and times
DELETE statement: removes rows from a record set
DESC: descending order
full outer join: specifies that if a row from either table does not match the selection criteria
GROUP BY: used to create one output row per each group and produces summary values for the selected columns
inner join: connects two tables on a column with the same data type
INSERT statement: adds rows to a table
left outer join: specifies that all left outer rows be returned
mathematical function: performs operations on numeric data
right outer join: includes all rows from the right table that did not meet the condition specified
SELECT statement: used to query data in the database
string function: performs operations on character strings, binary data or expressions
system function: returns a special piece of information from the database
text and image functions: performs operations on text and image data
UPDATE statement: changes data in existing rows either by adding new data or modifying existing data
wildcard: allows the user to match fields that contain certain letters.
For questions 1 to 18 use the PUBS sample database created by Microsoft. To download the script to generate this database please go to the following site: http://www.microsoft.com/en-ca/download/details.aspx?id=23654.
Here is a statement of the data requirements for a product to support the registration of and provide help to students of a fictitious e-learning university.
An e-learning university needs to keep details of its students and staff, the courses that it offers and the performance of the students who study its courses. The university is administered in four geographical regions (England, Scotland, Wales and Northern Ireland).
Information about each student should be initially recorded at registration. This includes the student’s identification number issued at the time, name, year of registration and the region in which the student is located. A student is not required to enroll in any courses at registration; enrollment in a course can happen at a later time.
Information recorded for each member of the tutorial and counseling staff must include the staff number, name and region in which he or she is located. Each staff member may act as a counselor to one or more students, and may act as a tutor to one or more students on one or more courses. It may be the case that, at any particular point in time, a member of staff may not be allocated any students to tutor or counsel.
Each student has one counselor, allocated at registration, who supports the student throughout his or her university career. A student is allocated a separate tutor for each course in which he or she is enrolled. A staff member may only counsel or tutor a student who is resident in the same region as that staff member.
Each course that is available for study must have a course code, a title and a value in terms of credit points. A course is either a 15-point course or a 30-point course. A course may have a quota for the number of students enrolled in it at any one presentation. A course need not have any students enrolled in it (such as a course that has just been written and offered for study).
Students are constrained in the number of courses they can be enrolled in at any one time. They may not take courses simultaneously if their combined points total exceeds 180 points.
For assessment purposes, a 15-point course may have up to three assignments per presentation and a 30-point course may have up to five assignments per presentation. The grade for an assignment on any course is recorded as a mark out of 100.
The university database below is one possible data model that describes the above set of requirements. The model has several parts, beginning with an ERD and followed by a written description of entity types, constraints, and assumptions.
See Figure A.1.
Student (StudentID, Name, Registered, Region, StaffNo)
Staff (StaffNo, Name, Region) – This table contains instructors and other staff members.
Course (CourseCode, Title, Credit, Quota, StaffNo)
Enrollment (StudentlD, CourseCode, DateEnrolled, FinalGrade)
Assignment (StudentID, CourseCode, AssignmentNo, Grade)
Using Figure A.2, note that a student (record) is associated with (enrolled) with a minimum of 1 to a maximum of many courses.
Each enrollment must have a valid student.
Note: Since the StudentID is part of the PK, it can’t be null. Therefore, any StudentID entered, must exist in the Student table at least once to a maximum of 1 time. This should be obvious since the PK cannot have duplicates.
Refer to Figure A.3. A staff record (a tutor) is associated with a minimum of 0 students to a maximum of many students.
A student record may or may not have a tutor.
Note: The StaffNo field in the Student table allows null values – represented by the 0 on the left side. However, if a StaffNo exists in the student table it must exist in the Staff table maximum once – represented by the 1.
Refer to Figure A.4. A staff record (instructor) is associated with a minimum of 0 courses to a maximum of many courses.
A course may or may not be associated with an instructor.
Note: The StaffNo in the Course table is the FK, and it can be null. This represents the 0 on the left side of the relationship. If the StaffNo has data, it has to be in the Staff table a maximum of once. That is represented by the 1 on the left side of the relationship.
Refer to Figure A.5. A course must be offered (in enrollment) at least once to a maximum of many times.
The Enrollment table must contain at least 1 valid course to a maximum of many.
Refer to Figure A.6. An enrollment can have a minimum of 0 assignments or a maximum of many.
An assignment must be associated with at least 1 with a maximum of 1 enrollment.
Note: Every record in the Assignment table must contain a valid enrollment record. One enrollment record can be associated with multiple assignments.
This is an adaptation, not a derivation as the author wrote half of it. Source: http://openlearn.open.ac.uk/mod/oucontent/view.php?id=397581§ion=8.2
A manufacturing company produces products. The following product information is stored: product name, product ID and quantity on hand. These products are made up of many components. Each component can be supplied by one or more suppliers. The following component information is kept: component ID, name, description, suppliers who supply them, and products in which they are used. Use Figure B.1 for this exercise.
Create an ERD to show how you would track this information.
Show entity names, primary keys, attributes for each entity, relationships between the entities and cardinality.
Component(CompID, CompName, Description) PK=CompID
Product(ProdID, ProdName, QtyOnHand) PK=ProdID
Supplier(SuppID, SuppName) PK = SuppID
CompSupp(CompID, SuppID) PK = CompID, SuppID
Build(CompID, ProdID, QtyOfComp) PK= CompID, ProdID
Create an ERD for a car dealership. The dealership sells both new and used cars, and it operates a service facility (see Figure B.2). Base your design on the following business rules:
Download the following script: OrdersAndData.sql.
1. Show a list of customers and the orders they generated during 2014. Display customer ID, order ID, order date and date ordered.
2. Using the ALTER TABLE statement, add a new field (Active) in the tblcustomer. Default it to True.
3. Show all orders purchased before September 1, 2012. Display company name, date ordered and total amount of order (include freight).
4. Show all orders that have been shipped via Federal Shipping. Display OrderID, ShipName, ShipAddress and CustomerID.
5. Show all customers who have not made purchases in 2011.
6. Show all products that have never been ordered.
OR
7. Show OrderIDs for customers who reside in London. Use a subquery. Display CustomerID, CustomerName and OrderID.
8. Show products supplied by Supplier A and Supplier B. Display product name and supplier name.
9. Show all products that come in boxes. Display product name and QuantityPerUnit.
1. Create an Employee table. The primary key should be EmployeeID (autonumber). Add the following fields: LastName, FirstName, Address, City, Province, Postalcode, Phone, Salary. Show the CREATE TABLE statement and the INSERT statements for the five employees. Join the employee table to the tblOrders. Show the script for creating the table, setting constraints and adding employees.
2. Add a field to tblOrders called TotalSales. Show DDL – ALTER TABLE statement.
3. Using the UPDATE statement, add the total sale for each order based on the order details table.