Smart Computing 123: Introduction to Data Management and Database Technology

cached from: http://www.comp.glam.ac.uk/pages/staff/tdhutchings/chapter5.html

Introduction to Data Management and Database Technology

Introduction
Database Approach

Database Technology

Introduction

The objectives of this section are to describe the development of the database approach, define the term data management, and distinguish between ‘Data Administration’ and ‘Database Administration.

As described in chapter 1 many large organisations have come to realise that their data needs to be professionally managed. This chapter describes some of the activities involved in Data Management and some of the problems likely to be encountered when trying to adopt the database approach.

Contents Page

Home Page

The Database Approach

Firstly some explanation of the ‘database approach’ and its development. Historically information systems development involved the development of stand alone systems for separate applications each with their own set of master files. This meant that data such as addresses could be duplicated in many separate systems, e.g. an employees address could be held once in the payroll system and again in the personnel system. This causes two problems; waste of space and inconsistency, waste because the address is held twice when it only really needs to be held once and inconsistency because it is possible for the address to be updated in one system but not in the other (there may be no easy way of knowing which is definitive). slides 3 - 5

The problems described above led to the development of the database approach. There is a common misconception that the database approach is determined by whether the organisation uses database technology or not. This is not the case, the database approach is an approach in which an organisation attempts to build an integrated corporate database to support all the data requirements of the organisation (in this context database means all the data held by the organisation, whether held in computerised database applications, file systems, spreadsheets or any other medium). slide 6

The real test as to whether an organisation is using the database approach is whether their systems have as a foundation data structures built solely for the immediate requirements of a particular system or on data structures built to support the overall requirements and objectives of the organisation. Of course the implementation of a corporate database won’t happen overnight and there will be compromises between short term requirements and long term aims and objectives. The important thing is that the organisation has a ‘data management’ strategy in place and is moving step by step towards an integrated corporate database.

The problems involved with the implementation of the database approach are concerned with people issues not technical issues. The challenge for the analyst is to get people from different backgrounds with different knowledge of the organisation to agree on common data definitions.

Contents Page

Home Page

What is Data Management?

Data Management is the process of planning, co-ordinating and controlling an organisations data resource.

Contents Page

Home Page

What is a Data Architecture?

Data Management principles can be applied at different levels. At the highest level some organisations (British Gas, British Telecom and the NHS) have built corporate data architectures which define at a very high level the data structures required to support the operation of the whole enterprise. These architectures are not really intended to be used directly in the development of application systems, rather they provide a framework within which lower level data models for particular applications can be validated. In this way it is possible to ensure that individual systems have a view of data which is consistent with the company-wide view. At the other extreme it is often useful to apply data management principles to simple Access applications.

Contents Page

Home Page

What is a Corporate Database?

Many Organisations depend on the accurate and timely presentation of information to employees, customers, suppliers, shareholders and the government for their survival. In many cases computer systems are the only viable mechanism for storing and processing the data necessary to provide this information, given the volume of data and the complexity of the processing involved.

The data can be stored in a variety of ways, e.g.

Simple Sequential Files.
Indexed Sequential & Random Access Files.
Hierarchic Database Management Systems (IMS).
Network Database Management Systems (IDMS).
Relational Database Management Systems (ORACLE)
End-User Database Management Systems (DBASE, PARADOX, ACCESS)
spreadsheets (EXCEL, QUATTRO PRO)

The different data storage techniques can exist on practically any level of machine from PC to Mainframe, in various architectures, e.g. Distributed or Client/Server. In practice large organisations that are mature computer users will use many hardware/software platforms appropriate to their needs.

The total Collection of data held by an organisation - ‘The Corporate Database’ can be its most valuable asset. The intrinsic value of data is hard to calculate, so many organisations use a replacement cost approach. Consider how much it would cost to replace a large organisation such as a bank (where the number of customers is measured in millions and the size of the database measured in gigabytes) to replace information on all their customers....names, addresses, telephone numbers, current balance, account history, overdraft limit etc.

A valuable resource such as a corporate database has to be managed effectively on a day to day basis and over the longer term.

At the day to day level the corporate database has to;

perform at acceptable levels in supporting the required processing,
make reasonably efficient use of disk space and be flexible enough to accommodate changes in storage mechanisms,
Cope with changing data structure and/or processing requirements.

At the abstract level in the longer term the corporate database has to;

accurately reflect the organisations view of the real world and
be flexible enough to change as business requirements change.

Looking to the future the corporate database environment is becoming more and more technically complex, in order to support end users who are increasingly sophisticated in their understanding of the technology and in terms of their requirements. Users expect, from their desktops, to be able to use a friendly (icon or object based) interface to access whatever corporate data they require from many different locations, on lots of different platforms, held in different database management systems or indeed spreadsheet data. They expect to be able to manipulate the data locally using their own particular tools and then transmit updates to other locations in a network, client/server or distributed environment.

The challenge for the database manager is to retain control over data supporting core business activities and to ensure that business rules are consistently applied when this core data is accessed and manipulated. The challenge obviously becomes more difficult in a rapidly changing technical environment.

Contents Page

Home Page

What Activities are Involved in Data Management?

Typically data management activities are divided into two areas, i.e. ‘data administration’ and ‘database administration’.

Data Administration

Data Administration is concerned with the management of an organisations meta-data (i.e. data about data and processes) rather than the physical information held on storage devices. In other words it deals with the logical or business view of data. Data administration involves the following activities:-

Offering consultancy on all aspects of meta-data
Promoting corporate awareness of the data available (most people have a very narrow view of data relating it only to the particular requirement of their job)
Identification of corporate data requirements
Data analysis and definition
Data dictionary control and monitoring
Defining data privacy policy (who has access to what?)
Defining security and recovery policy
Defining standards for physical data models (diagrams which show the physical structure of a database)

Database Administration

Database administration is primarily concerned with the technical integrity of the physical database environment, the day to day operations of an organisations databases and the policies governing their everyday use. The following activities are involved:-

Control of the database environment, including carrying running repairs in the event of failure
Database Design
Implementation of database, security, recovery and integrity checking mechanisms
Technical standards and guidelines
Impact assessment
Monitoring data usage
Training for the database environment

Contents Page

Home Page

Tutorial Sheet Data Management

Objectives

To ensure that you understand the terms ‘database approach’, ‘data management’, data administration’ and database administration’ and understand some of the problems involved in implementing them.

Question 1

Why is data so valuable, why does it need to be professionally managed and what steps can organisations take to protect their investments?

Question 2

How can an organisation using database technology not be adopting the database approach?

Question 3

How could adopting a data architecture affect the timescales of a project, can a data architecture approach be justified in an environment where everybody wants everything yesterday?

Question 4

Consider the following tasks and discuss whether they are data administration activities or database administration activities:-

Changing the definition of an entity in a data dictionary.
Copying a database onto tape and taking the tapes to an off-site safe.
Deciding how often to take the tapes off-site.
Restructuring a database.
Interviewing the director of marketing
Patching a database at 2 am.
Deciding whether the director of finance should be allowed to access individual salary fields.
Giving the finance director access to salary fields.
Deciding whether to build an index for performance reasons.

Contents Page

Home Page

Stable, Common Data Structures

The Data Structures which are required to support the operations of an enterprise are relatively stable over time. This philosophical belief developed by Martin (1976) underpins the Information Engineering methodology (Martin, 1989).

The Information Engineering Methodology (IE, Martin, 1989) differs from other structured methodologies in that its scope is very broad (the whole organisation and all its information systems) and deep (the methodology extends from strategic information systems planning, through to implementation and maintenance. IE is also unusual because of its emphasis on the importance of structural data requirements as opposed to processing requirements.

IE consists of 4 phases :-

Information Strategy Planning: This process is carried out at the corporate level and results in the production of a corporate information architecture, a cross mapping of the key corporate entity types and the business functions which access them.
Business Area Analysis: During this phase the business areas identified in the information strategy plan are analysed in detail using data and functional analysis.
System Planning and Design: This phase is divided into business system design and technical design.
Construction and Cutover: This phase covers the construction of software, implementation and maintenance.

In 1989 a large organisation (Organisation A) adopted the Information Strategy Planning phase of the IE Methodology and built a Corporate Data Architecture (CDA) to provide a framework for the development of corporate systems. The architecture consisted of a very high level Entity-Relationship Model. In 1991 a similar organisation (Organisation B) produced a CDA using the same approach. Superficially the CDAs appeared dissimilar. However on closer examination many of the data structures involved were common to both organisations and in many cases the apparent differences concerned the terminology not the underlying meaning.

The CDA for Organisation A contained an entity called ‘Person_Or_Organisation’. The definition of the entity was "A Person or Organisation who has a relationship with Organisation A (past/present/potential). This entity also includes Organisation A itself". The CDA for Organisation B contained an entity called ‘Party’ which had the following sub-types, Individual Party and Organisational Party . The definition of the entity was "An identifiable body that may be of interest to the company"

There are many other examples to be found within the architectures. Clearly, even though the organisations are currently undergoing fundamental change, both internally and with regard to their respective markets, the underlying data structures required to hold information about the people or organisations involved in the business are largely stable. This would appear to be the case both within and across organisations.

The architectures also include matrix analyses of business processes against entities, both sets of matricies indicated that the fundamental data structures are shared by many processes, from different functional areas of the businesses.

Contents Page

Home Page

Corporate Information Management

The paper has indicated that the information systems supporting core business activities such as Engineering, Marketing, Customer Billing, Payroll and Stock Control have largely stable, common data structures. Unfortunately many organisations in the past developed discrete systems to support these areas, each with their own sets of programs and files or databases. The result has been considerable overlap and duplication of data with all the resultant problems (Howe, 1984). A common example of corporate data concerns people and places, e.g. the names and addresses of customers, shareholders, employees.

Discrete Application Systems

Adopting this approach may cause problems in the not unlikely scenario of a single person being a customer, shareholder and employee of the organisation. Logically the person has one name and address. However the name and address is recorded in three places. Which is correct? What happens if the person changes their name or address? slide 1

Corporate Database

Adopting the corporate data approach enables a person’s name to be held in a single central location. The specific details of a person as a customer are held separately. slide 2

The example chosen is a gross simplification of a complex data modelling area, however it serves to illustrate the concept of ‘Corporate Information Management’. A Corporate Information Management Strategy (CIMS) would imply that the data structures required to support an information system be designed not simply to support the immediate functional requirements of the system in question but also to support the overall aims and objectives of the organisation.

Kent (1978) argues that the data model produced can only ever be a model not the model. However in my view the important thing is that the model is effective in supporting the organisation and that common understanding and agreement is reached.

The approach brings numerous long term benefits and not simply concerning the quality of the data. Adopting the first approach implies that three separate programs are required and that each of these programs will have a section of code to deal with names and addresses. Under the corporate data approach a single names and addresses module would be called from the customer, shareholder and employee programs. The sum of many seemingly small gains could result in a drastic reduction in system maintenance. (It may be noted that this approach dovetails neatly with the ‘gen-spec’ and ‘inheritance’ concepts from the field of object orientation, the ‘customer object’ could inherit name and address data from the ‘person’ object.)

Contents Page

Home Page

Database Technology

A Brief History

Machine Code

In the earliest days of computing programming was effectively carried out with a soldering iron. Problems in this area led to the development of the first programming ‘language’ - machine code (1st generation language). A machine code program consisted of a sequence of binary instructions which the computer would process. At this time there was no real division of labour, a small number of people with great expertise acted as analysts, designers, programmers and operators.

Assembler

The next step was the development of assembler (2nd generation language). An assembler program consisted of a series of mnemonic instructions which were translated/compiled into machine code by the computer. Assembler, though complex was obviously easier to learn and use than machine code, which reduced the reliance on technical experts. Assembler is still widely used in real time and safety critical systems and in games programming, however it would be unusual to come across assembler in a business information system.

3rd Generation Languages

The real leap forward in terms of programming languages was the advent of 3rd generation languages (3GL) the most common of which is COBOL (Common Business Oriented Language). The instructions in a Cobol program consist of English like statements such as ‘open customer-file’, ‘read customer-file’, ‘add bill-amount to outstanding-balance’. During the 70’s large organisations developed complex information systems for billing, payroll and stock control using the Cobol programming language and a combination of sequential, indexed and random access files and in fact there are some of these systems still in existence.

The Database Approach & 3rd Generation Database Systems

As described in section 6, historically the development of information systems involved the use of stand alone systems each with their own set of master files. To overcome the problems (waste of space, inconsistency, duplication of code and very importantly lost opportunities to share data) associated with application centred development approaches the hierarchic and network database management systems were developed.

The mid to late 70’s saw the development of the hierarchic and network database systems (3rd generation database systems). Through the 80’s many organisations converted their existing (file based) information systems to use either hierarchic or network database systems.

These systems allowed files to be linked and shared between applications, they provided code to maintain the integrity of data and common I/O routines. This approach took a considerable weight from the shoulders of programmers who no longer had to concern themselves with the intricate details of how data was physically stored only with its logical structure, the physical complexity was managed by the DBMS. The Hierarchic and Network systems continue to provide an established technology with which to maintain corporate data and are still in common use.

Relational Database Systems (The 4th Generation)

The Hierarchic and Network systems although more flexible than traditional file systems are still not flexible enough. The limitations of these so called 3rd generation database systems led to the development of 4th generation or Relational Database Management Systems. The mid to late 80’s through to the present day has seen the increasing realisation of large scale main-frame relational database systems (although the technology has been in existence for much longer). Relational Database Management Systems (RDBMS) are 4th generation database systems. The key difference between 4th and 3rd generation database systems is in terms of flexibility. It is relatively easy and quick to create new database structures and amend existing structures in relational systems, whereas to restructure a network database is a complex and time consuming task.

An integral part of a relational database system is the language SQL (Structured Query language). SQL is the 4th generation language (4GL). The key difference between 4th and 3rd generation languages is database navigation. In a language like COBOL the programmer has to specify in detail the database navigation required to produce the desired results. Using SQL the developer simply has to specify what information is required, e.g. ‘Select Customer_Name from Customers;’ The RDBMS itself then decides what navigation is required.

Summary

The current systems development environment, particularly in large organisation who have adopted the database approach but still have legacy systems to maintain and end-user applications to enable, is complex with a wide variety of different hardware and software platforms.

It can be seen that data/database administration in this environment is a challenging task, not simply because of the complex technical environment but also because of the inherent ‘people’ problems associated with the adoption of a corporate database approach.

The task is not likely to became any easier;

The technical environment may become more complex with the increasing use of Object Oriented Database Management Systems (OODBMS),

The current political climate advocates short term gain, whereas the corporate database approach is s strategy for long term benefit.

Contents Page

Home Page

The Relational Model

Introduction

The first ideas on using a ‘Relational’ model of data were developed by Ted Codd, while working for the IBM Research Laboratory in San Jose, California. Ted Codd published the seminal work on relational systems in an article entitled ‘A Relational Model of Data for Large Shared Data Banks’ in the Communications of the ACM (Volume 13 / Number 6 / June 1970). The article provides a mathematical underpinning for relational systems based on set theory. Since the publication of this article relational theory has been refined and extended by experts in the field.

The main elements of Relational Theory are the following;

Ted Codd’s 13 rules for relational systems
Relational Integrity
Normalisation

In this document I do not intend to describe the theoretical basis of the relational model, but to describe in practical terms how the body of theory affects data analysis and database design.

The three fundamentals of relational theory are

That all information is held in table structures and that unlike hierarchic and network systems all relationships among data are described using data values, i.e. foreign keys, rather than physical pointers.
That each value in a column in a table is atomic, i.e. does not contain a repeating group or any directly accessible subsidiary elements.
The use of SQL

With these basics in mind we now go on to discuss Ted Codd’s 12 rules for relational systems.

The Relational Rules

The following discussion is based on the findings of an experienced data manager at a large UK organisation. Please remember that the fact that an advertised system fails to comply with one of the following rules doesn’t mean it’s necessarily bad. In fact Ted Codd set a very difficult task which, to date, according to my knowledge, has not been met. There are more things to life (in the world of databases) than fidelity to these rules, such as performance.

Rule 0

‘For any system that is advertised as, or is claimed to be, a relational database management system, that system must be able to manage databases entirely through its relational capabilities.’

In practice this means that all data manipulation must be achievable through SQL. Early versions of Paradox, i.e. those prior to the addition of the SQL interface, though advertised as relational were not truly relational.

Rule 1 The Information Rule

‘All information in a relational database is represented explicitly at the logical level in exactly one way - values in tables.’

The table is the only available structure for storing data in a relational system.

Rule 2 Guaranteed Access Rule

‘Each and every datum (atomic value) in a relational database is guaranteed to be logically accessible by resorting to a combination of table-name, primary key value and column name’.

The common Customer/Order scenario could be implemented in Cobol, or indeed in a network DBMS as a single record type as follows

01 CUSTOMER

03 Customer-Number

03 Customer-Name

03 Customer-Address

05 Street-Address

05 District

05 Postal-Town

05 County

05 Postcode

03 ORDER Occurs 10 times

05 Order-Number

05 Order-Date

05 ORDER-LINE Occurs 15 Times

07 Quantity

07 Part-Number

07 Part-Description

07 Part-Price.

07 Supplier_Number

07 Supplier-Name

What are the problems with this approach regarding the following type of operation and how does the relational model overcome them ?

Insertion
Update
Deletion

The distinguishing features and benefits of the relational model stem from the first three rules;

Simplicity; The simple tabular representation of data allows easier understanding and communication.
Flexibility; Navigation using foreign keys does not have to be predefined.
Productivity; In general SQL is easier to learn and use than a 3GL such as Cobol, since it is possible to define what you want without specifying how to get it.

Rule 3 Systematic Treatment of Null Values

‘Null values are supported in fully relational systems for representing missing information and inapplicable information in a systematic way, independent of the data type.’

In simple terms the RDBMS should be able to distinguish between, 0 and null and, between ‘ ‘ and null. Furthermore the RDBMS should allow columns to defined as ‘Not Null’.

Rule 4 Dynamic On-Line Catalogue Based on the Relational Model

‘The Database Description is represented at the logical level in the same way as ordinary data, so that authorised users can apply the same relational language to its interrogation as they apply to regular data.’

This approach enables the use of SQL statements to access database descriptions, i.e. the ‘DESC table’ command enables the user to access tables containing the description of a table, in fact the RDBMS converts DESC into a series of SQL statements. This is one of the fundamental reasons why RDBMS are more flexible than network systems. In network systems the database description has to be predefined and compiled, in relational systems each time a command like ‘Select * From Customers’ is issued the RDBMS accesses the on-line catalogue to establish what ‘*’ means.

Rule 5 Comprehensive Data Sub-language Rule

‘A relational system may support several languages and various modes of use. However there must be at least one language whose statements are expressible, per some well-defined syntax, as character strings and that is comprehensive in supporting all of the following items;

Data Definition, e.g. Create Table,
View Definition, e.g. Create View CUSTORD as Select etc.,
Data Manipulation, e.g. Select,
Integrity Constraints (see next section)
Authorisation, e.g. Grant Select on Customers to TDHUTCHI.
Transaction Boundaries, (Commit and Rollback)’

This means that a single language (SQL) must be available for all the above purposes. This in effect is the same as Rule 0.

Rule 6 View Updating Rule

‘All views that are theoretically updatable are also updatable by the system’

This is best explained using an example;

Consider the following SQL;

Create Table Employees (Employee_Number Number(6) not null,

Employee_Name Char(30) not null,

Employee_Salary Number(10,2));

Create View Emp_Sal

as Select Employee_Name, Employee_Salary from Employees;

Create View Emp_Name

as Select Employee_Number, Employee_Name from Employees;

Which of the views, Emp_Sal or Emp_Name could be used to insert a new Employee row ?

Rule 7 High Level Insert, Update and Delete

‘The capability of handling a base table or a view using set at a time operation applies not only to the retrieval of data but also to the insertion, update and deletion of data.’

The Select statement provides set at a time operation, i.e. The command ‘Select * from Customers’ will retrieve all customers without the need to code an iteration, in other words a single statement can access a set of records. Rule 7 says that the facility for set at a time operation should also apply to insert, update and delete statements.

Rule 8 Physical Data Independence

‘Application programs should remain unimpaired whenever any changes are made in either storage representation or access methods.’

In other words physical changes should not affect the logical function of any programs. Thus a DBA should be able to move a physical data file from one disk to another or decide to index a column or drop an index without any coding changes being necessary.

ORACLE6 falls down somewhat in this respect. What would happen to a program used to create new customers if the unique index on Customer_Number was dropped ?

Rule 9 Logical Data Independence

‘Application programs should remain logically unimpaired when information preserving changes of any kind that theoretically permit unimpairment are made to base tables.’

Suppose that a single Employee table had to be split into two separate tables one for sales employees who earn commission and another for other employees. Consider the following SQL statements;

Create table Employees (Employee_Number Number(6) not null,

Employee_Name Char(30) not null,

Commission Number(8,2));

(*1) Select Employee_Number, Employee_Name From Employees;

Create Table Sales_Employees

as Select * From Employees where Commission is not null;

Create Table Other_Employees

as Select Employee_Number, Employee_Name from Employees where commission is null;

Drop Table Employees;

Create View Employees

Select Employee_Number, Employee_Name, Commission from Sales_Employees

UNION

Select Employee_Number, Employee_Name from Other_Employees;

(*2) Select Employee_Number, Employee_Name from Employees;

Should the select statements numbered *1 and *2 work in the same way ?

Rule 10 Integrity Independence

‘Integrity Constraints specific to a particular relational database must be definable in the relational data sub-language and storable in the catalogue, not in application programs.’

More on relational integrity in the next section.

Rule 11 Distribution Independence

‘Application programs should not be logically impaired when changes are made to the geographical location of data.’

This is simply an extension of rule 9 regarding physical data independence. Put simply your SQL statements should not be affected if your company moves the employee table from Cardiff to Newport.

Rule 12 The Non-Subversion Rule

‘If a relational system has a low-level (row at a time) language (ORACLE has a language called PL/SQL), the language should not be able to subvert or bypass any rules specified for the high level language.’

If a low level language such as PL/SQL allowed you to create tables with repeating groups then the system itself would not be truly relational.

Relational Integrity

The Relational Model has three integrity rules;

Entity Integrity; No attribute participating in the primary key of a table is allowed to accept null values. A primary key value that was null would be a contradiction in terms, in effect saying that some entity had no identity, i.e. it did not exist. In addition the primary key must be unique otherwise how could an individual row be accessed.
Referential Integrity; If a table contains a foreign key from another table, there are two possibilities for the values in this foreign key; if the member takes part in a mandatory relationship with the owner table then the foreign key must be equal to a primary key in the owner table; if the table takes part in an optional relationship with the owner then the foreign key can either take the value of a primary key in the owner table or the value null. (This emphasises the importance of the systematic treatment of null values.)
Other Integrity Rules; The Catalogue should enable the definition of business rules at table creation, e.g. default values and ranges.

ORACLE (at version 6) falls down with each of these rules ;

Entity Integrity is partially catered for by allowing the DBA to specify not null for primary keys, but uniqueness required the creation of an index.
Referential Integrity is partially catered for by allowing the DBA to specify not null for foreign keys.
Other Integrity isn’t catered for at all, default values and ranges have to be specified in programs.

ORACLE version 7 is much better so far as relational integrity is concerned in that it allows you to specify primary and foreign keys at table creation, e.g.

Create Table Order (Order_Number Number(6) Primary Key,

Customer_Number Number(6)

References CUSTOMERS(Customer_Number),

Order_Date Date);

Normalisation

Normalisation is defined briefly but accurately in the following statement;

‘The Key the Whole Key and Nothing but the Key’

Typically the literature on normalisation covers many levels of normalisation, 9 is not uncommon, but this seems to me to be a race amongst academics to identify as many levels as possible, in 99 cases out of 100, 3 levels of normalisation are all that is required.

1st Normal Form; converting an un-normalised data structure such as a report or an order form into 1st Normal Form (1NF) is commonly referred to as removing repeating groups but also may involve removing complex groups such as the Address Group described in rule 2. The aim is to ensure that each item is atomic.

2nd Normal Form; Converting a 1NF data structure into 2nd Normal Form (2NF) involves looking at each non-primary key attribute and ensuring that it depends on the whole of the key and not just part of it.

3rd Normal Form; Converting a 2NF data structure into 3rd Normal Form (3NF) involves looking at the interrelationships between non key attributes to see if any non key attributes depend only on each other.

This is all best described by looking at an example. Consider the following table which has been built up by an order entry clerk;

Cust#	Name	Ord#	Date	Part#	Desc	Qty	Price	Supp#	Name
1	Tim	123	20/3	1	AA	2	1.99	23	ABC
				2	BB	3	2.99	23	ABC
				3	CC	4	3.99	24	DEF
		456	21/3	4	DD	5	4.99	25	GHI
				5	EE	6	5.99	26	JKL
2	John	789	21/3	4	DD	7	3.99	25	GHI
				6	FF	8	6.99	27	MNO

This table structure could be implemented quite easily in Cobol or in a network DBMS as shown in the example under rule 2, with all the associated problems.

A common representation of this kind of table in text books is as follows;

CUSTOMERS(Customer_Number, Customer_Name, (Order_Number, Order_Date, (Part_Number, Part_Description, Part_Quantity, Part_Price, Supplier_Number, Supplier_Name))

The internal brackets are meant to represent repeating groups and the underline represents a primary key. This called an un-normalised or 0NF data structure.

There are two approaches converting this 0NF structure to 1NF the first involves replicating the values in the table as follows;

Cust#	Name	Ord#	Date	Part#	Desc	Qty	Price	Supp#	Name
1	Tim	123	20/3	1	AA	2	1.99	23	ABC
1	Tim	123	20/3	2	BB	3	2.99	23	ABC
1	Tim	123	20/3	3	CC	4	3.99	24	DEF
1	Tim	456	21/3	4	DD	5	4.99	25	GHI
1	Tim	456	21/3	5	EE	6	5.99	26	JKL
2	John	789	21/3	4	DD	7	3.99	25	GHI
2	John	789	21/3	6	FF	8	6.99	27	MNO

However this seems to be a clumsy approach and results in a three part key consisting of Cust#, Ord# and Part#. A simpler approach is to separate the repeating groups out into separate tables.

Step 1 remove the repeating group of orders

CUSTOMERS(Customer_Number, Customer_Name)

ORDERS(Order_Number, Customer_Number*, Order_Date, (Part_Number, Part_Description, Part_Quantity, Part_Price, Supplier_Number, Supplier_Name))

Step 2 remove the repeating group of parts

CUSTOMERS(Customer_Number, Customer_Name)

ORDERS(Order_Number, Customer_Number*, Order_Date)

ORDER_PARTS(Part_Number, Order_Number*, Part_Description, Part_Quantity, Part_Price, Supplier_Number, Supplier_Name)

The structure is now in 1NF since there are no repeating or complex group items (each item depends on the key). The next step is to convert the structure into 2NF, by examining each non primary key attribute to ensure that each depends on the whole of the key.

The CUSTOMERS and ORDERS tables each have a single column making up their primary key and are therefore by definition in 2NF. However looking at the ORDER_PARTS table it can be seen that Part_Description, Part_Price, Supplier_Number and Supplier Name only depend on Part_Number, i.e. their values are the same regardless of Order_Number. (Part_Quantity depends on the whole of the key since different quantities can appear on different orders.) To convert to 2NF a separate table is created for part descriptions, prices ,and supplier details

CUSTOMERS(Customer_Number, Customer_Name)

ORDERS(Order_Number, Customer_Number*, Order_Date)

ORDER_PARTS(Part_Number, Order_Number*, Part_Quantity)

PARTS(Part_Number, Part_Description, Part_Price, Supplier_Number, Supplier_Name)

The structures are now in 2NF since every non-primary key attribute depends on the whole of the key. The next step is to convert the structure into 3NF by ensuring that each non-primary key attribute depends on nothing but the key.

The CUSTOMERS table is patently in 3NF because there is no non-primary key attribute for Customer_Name to depend on. The ORDERS table is in 3NF because there is no dependency between Order_Date and Customer_Number (a customer can place different orders on different dates). The ORDER_PARTS table is in 3NF because the quantity ordered is dependent on both the order number and the part number. Looking however at the PARTS table it can be seen that the Supplier_Name attribute depends on the Supplier_Number and has nothing to do with the part number. To convert the structure into 3 NF a separate table is created containing supplier details.

CUSTOMERS(Customer_Number, Customer_Name)

ORDERS(Order_Number, Customer_Number*, Order_Date)

ORDER_PARTS(Part_Number, Order_Number*, Part_Quantity)

PARTS(Part_Number, Supplier_Number*, Part_Description, Part_Price)

SUPPLIERS(Supplier_Number, Supplier_Name)

Introduction to SQL

In this section we discuss what has become the de facto standard language for accessing relational database systems, i.e. SQL (Structured Query Language - pronounced SEQUEL or ESQL). It should be remembered that packages such as paradox and access have only recently had SQL interfaces added to them.

The SQL language can be broken down in to three types of statement :-

Environment Statements; These control the run time environment, things like pagesize;
DDL Statements; These Data Description Language statements enable the user to create and restructure tables. These statements are largely unnecessary in paradox since it provides a tool for creating tables;
DML Statements; the Data Manipulation Language statements allow the user to physically insert data into tables, to update this data, to delete data from tables and to run queries against these tables.

In this section we will concentrate on DML and in particular the ‘Select’ statement.

Consider the following tables :-

CUSTOMERS (Customer_Number, Customer_Name)

1 Tim

2 John

3 Anna

ORDERS (Order_Number, Customer_Number*, Order_Date)

1 1 17/10/95

2 1 18/10/95

3 2 19/10/95

ORDER_LINES (Part_Number, Order_Number*, Part_Quantity)

1 1 30

1 2 20

2 1 10

3 1 20

3 2 40

4 3 10

PARTS (Part_Number, Part_Description, Part_Price)

1 aaa 1.99

2 bbb 2.99

3 ccc 3.99

4 ddd 4.99

5 eee 5.99

The SQL language consists of 4 statements, i.e. Insert, Update, Delete & Select.

The insert statement is used to create new rows in tables;

INSERT INTO PARTS (PART_NUMBER,PART_DESCRIPTION,PART_PRICE)

VALUES (5,’fff’,6.99);

The update statement is used to change the information in existing rows in a table;

UPDATE PARTS

SET PART_DESCRIPTION = ‘xyz’

WHERE PART_NUMBER = 1;

The delete statement is used to remove information from tables;

DELETE FROM PARTS

WHERE PART_NUMBER = 5;

The select statement is used to display the information in tables;

SELECT PART_NUMBER,PART_DESCRIPTION

FROM PARTS

WHERE PART_NUMBER <> 1;

The following section describes the ‘Select’ statement in question and answer format;

How can I display all the columns in all the rows of the CUSTOMERS table ?	SELECT * FROM CUSTOMERS; (* could be replaced by a list of all the columns in the customers table)
How can I display all the columns in some of the rows in the customers table ?	SELECT * FROM CUSTOMERS WHERE (CUSTOMER_NUMBER < 3); (the where clause can use equality, inequality and other features like > and <)
How can I display some of the columns from all of the rows in the PARTS table ?	SELECT PART_NUMBER, PART_PRICE FROM PARTS; (the list of columns doesn’t have to include all the columns in the table)
How can I display some of the columns from some of the rows in the PARTS table and put the results into a particular order ?	SELECT PART_NUMBER,PART_PRICE FROM PARTS WHERE PART_PRICE > 2.99 ORDER BY PART_PRICE DESC; (the selection criteria doesn’t have to be based on the primary key)
What happens if I want to display information from two tables, e.g. Display a list of customers together with the dates on which they have placed orders.	SELECT CUSTOMERS.CUSTOMER_NUMBER, CUSTOMERS.CUSTOMER_NAME, ORDERS.ORDER_DATE FROM CUSTOMERS,ORDERS WHERE (CUSTOMERS.CUSTOMER_NUMBER = ORDERS.CUSTOMER_NUMBER) (In this example two tables have been joined together using the foreign key)
What happens if I want to display information from three tables, e.g. display a list of all orders together with their Order_Lines including the part description, placed on a particular date	SELECT ORDERS.ORDER_NUMBER, ORDERS.CUSTOMER_NUMBER ORDERS.ORDER_DATE, ORDER_LINES.PART_NUMBER PARTS.PART_DESCRIPTION FROM ORDERS,ORDER_LINES,PARTS WHERE ((ORDERS.ORDER_NUMBER = ORDER_LINES.ORDER_NUMBER) AND (ORDER_LINES.PART_NUMBER = PARTS.PART_NUMBER)) AND (ORDERS.ORDER_DATE = ‘17/10/95’); (In this example the ORDERS table has been JOINed to the ORDER_LINES table and the ORDER_LINES table has been JOINed to the PARTS table)
What happens if the results of one query depends on the results of another? e.g. display a list of all customers who haven’t placed orders?	SELECT * FROM CUSTOMERS WHERE CUSTOMER_NUMBER NOT IN (SELECT DISTINCT CUSTOMER_NUMBER FROM ORDERS); (In this example the results of the outer query depend on the results of the inner query, the other point to note is the use of the ‘distinct’ clause.
What happens if summary results are required, e.g. How many customers are in the database ?	SELECT COUNT(*) FROM CUSTOMERS; (Running this query would simply display the answer 3)
What happens if information has to be displayed in groups, e.g. which parts appear on more than one order?	SELECT PARTS.PART_NUMBER, PARTS.PART_DESCRIPTION FROM PARTS,ORDER_LINES WHERE (PARTS.PART_NUMBER = ORDER_LINES.PART_NUMBER) GROUP BY PART_NUMBER HAVING COUNT > 1 (This example joins the parts and order_lines tables, groups the result on the part_number and only displays those groups where there is more than one row)
Display the most expensive part	SELECT PART_NUMBER, PART_PRICE FROM PARTS WHERE PART_PRICE = (SELECT MAX(PART_PRICE) FROM PARTS);
What is the average part price?	SELECT AVG(PART_PRICE) FROM PARTS;
What if the results have to contain calculations, e.g. display each order_line together with its value.	SELECT ORDER_LINES.ORDER_NUMBER, ORDER_LINES.PART_NUMBER, ORDER_LINES.QUANTITY, ORDER_LINES.QUANTITY*PARTS.PART_PRICE WHERE (ORDER_LINES.PART_NUMBER = PARTS.PART_NUMBER);

Contents Page

Home Page

Hierarchic and Network Database Systems

In this module we will be discussing the Relational Model in some detail however a brief summary of the features of the Hierarchic and Network Systems follows.

Hierarchic Database Systems

Hierarchic Database Systems (HDS), e.g. IMS (Information Management System) produced by IBM, enable tree structured relationships to be implemented efficiently using indexes and pointers (these concepts will be discussed in detail in the next section on Network Database Systems since the underlying technology is the same). A brief definition of a tree structure is that each child node only has one parent node and at the top of the tree is a single parentless node.

A HDS could be used to implement a customers, orders and order lines structure, however the introduction of parts causes a problem, the model is now no longer a tree structure since the order lines entity has two owners. This common structure is difficult to implement efficiently using HDS and typically extra indexing or data duplication is required.

Network Database Systems

Network Database Systems (NDS) are more flexible and enable ‘network’ structures to be implemented efficiently. A brief definition of a network structure is that child nodes may have any number of parent nodes. We have already seen that LDS’ are networked structures and that these network structures can be implemented in a RDBMS’ such as ORACLE, using appropriate tables and foreign keys.

In the Relational Model the underlying implementation structure is the table. Each entity type in an LDS becomes a table in the RDBMS. The relationships between entities are represented implicitly in tables using a system of foreign keys.

In the Network Model the underlying implementation structures are Record Types and Set Types, typically each entity type becomes a record type and each relationship type becomes a set type.

Network Schemas

Prior to using a NDS each record type and set type has to be defined in a SCHEMA. A schema is a logical definition of the record and set types which comprise a NDS. The first step in the creation of a NDS is the production of a schema An example schema for the well known Customers, Orders, Order_Lines and Parts scenario follows.

Schema Name is ORDERS

Record Name is CUSTOMER

01 CUSTOMER

03 Customer-Number Pic 9(8)

03 Customer-Name Pic X(30)

03 Address

05 Street-Address Pic X(30)

05 District Pic X(20)

05 Postal-Town Pic X(20)

05 County Pic X(20)

05 Postcode Pic X(9)

03 Telephone Pic X(20)

03 Credit-Limit Pic 9(6)V9(2)

03 Outstanding-Balance Pic S9(6)V9(2)

Record Name is ORDER-HEADER

01 ORDER-HEADER

03 Order-Number Pic 9(8)

03 Order-Date

05 Day Pic 9(2)

05 Month Pic 9(2)

05 Year Pic 9(2)

Record Name is PART

01 PART

03 Part-Number Pic 9(8)

03 Part-Description Pic X(30)

03 Quantity-In-Stock Pic 9(6)

03 Price Pic 9(6)V9(2)

Record Name is ORDER-LINE

01 ORDER-LINE

03 Quantity Pic 9(3)

Set Name is PLACES

Owner is CUSTOMER

Member is ORDER_HEADER

Insertion is AUTOMATIC

Retention is MANDATORY

Set Name is CONSISTS-OF

Owner is ORDER-HEADER

Member is ORDER-LINE

Insertion is AUTOMATIC

Retention is MANDATORY

Set Name is APPEARS-ON

Owner is PART

member is ORDER-LINE

Insertion is AUTOMATIC

Retention is MANDATORY.

There are two points to note so far as the record definitions are concerned;

Firstly that group items such as address and date are allowed in NDS (they aren’t in Relational Systems where each item has to be atomic)

Secondly that in this example there are no foreign keys. In network systems foreign keys are not required, since, as will be seen later, the network ‘set’ mechanism physically links each customer to all their orders

In the set definition part of the schema the ‘owner is’ and ‘member is’ clauses are self explanatory, however some explanation of the ‘insertion’ and ‘retention’ clauses is required.

Insertion; The insertion clause defines what happens to the member record of a set when it is created. Insertion can be specified as Automatic in which case the member is automatically connected to the current owner record of the set. The alternative is Manual in which case it is the programmers responsibility to connect the member to the appropriate owner or possibly leave the record free standing.

Retention; The retention clause defines what happens to the member records of a set after initial storage. retention can be specified as Fixed in which case the member record must always remain connected to the same owner record. One alternative is Mandatory in which case the member must remain connected to some owner record. The other alternative is Optional in which case the member can be disconnected from one set occurrence and connected to another or again left free standing.

A schema has to prepared (c/f compiled) before it can be used to ensure that the record and set definitions are valid.

Network Storage Schema

The next step is to create a storage schema which defines how the record types and set types will be physically implemented. An example Storage Schema follows.

Storage Schema Name is SSORDERS

Schema name is ORDERS

Area Name is CUSTORD-AREA

Block Size is 11K

Page Range is 1 to 10,000

Area Name is PART-AREA

Block Size is 2K

Page Range is 10,001 to 12,000

Record Name is CUSTOMER

Area Name is CUSTORD-AREA

Page Range is 1 to 10,000

Location Mode

CALC Using CUSTOMER-NUMBER Duplicates Not Allowed

Set Pointers

Set Name is PLACES

FIRST in Position 1

LAST in Position 2

Record Name is ORDER-HEADER

Area Name is CUSTORD-AREA

Page Range is 1 to 10,000

Location Mode

VIA Set PLACES

Set Pointers

Set Name is PLACES

NEXT in Position 1

PRIOR in Position 2

OWNER in Position 3

Set Name is CONSISTS-OF

FIRST in Position 4

LAST in Position 5

Record Name is PART

Area Name is PART-AREA

Page Range is 10,001 to 12,000

Location Mode

CALC using PART-NUMBER Duplicates Not Allowed

Set Pointers

Set Name is APPEARS-ON

FIRST in Position 1

LAST in Position 2

Record Name is ORDER-LINE

Area Name is CUSTORD-AREA

Page Range is 1 to 10,000

Location Mode

VIA Set CONSISTS-OF

Set Pointers

Set Name is CONSISTS-OF

NEXT in Position 1

PRIOR in Position 2

Set Name is APPEARS-ON

NEXT in Position 3

PRIOR in Position 4

OWNER in Position 5

Set Name is PLACES

Order is SORTED

Key is ASCENDING ORDER-DATE

Set Pointers

NEXT, PRIOR, OWNER

Set Name is CONSISTS-OF

Order is NEXT

Set Pointers

NEXT, PRIOR

Set Name is APPEARS-ON

Order is NEXT

Set Pointers

NEXT, PRIOR, OWNER

The Storage Schema has to be prepared to check that the area, record and set definitions are correct.

There are a number of areas of storage schema definitions which require explanation;

Areas
Page Ranges
Location Mode
Pointers
Set Order

What is an Area?

An area is a physical division of a database which is used to hold records and sets of various types. Areas map onto physical files and can thus be used to spread a large database over many disks. An area is broken down into a number of pages, with each page mapping onto one block of a physical file.

How do Page Ranges Work?

As described above an area is used to store records and sets. When defining a record type in a storage schema it is necessary to specify which area and what page range within the area the records will be stored in.

How do Location Modes Work?

Two location modes appear in the example storage schema;

CALC; This is the equivalent of a random access file in Cobol. In outline when a CALC record is to be stored an algorithm is used to calculate where the record should be stored. The algorithm takes as input the field defined as the CALC key (e.g. Customer-Number or Part-Number) and the page range specified for the record and outputs a page number within that range. CALC provides fast access providing that the key is known. If the CALC key is provided to the NDS it passes through the same algorithm, the result is the page number which holds the record, which can be used to retrieve the correct block from disk in a single I/O operation.

VIA; Using VIA it is possible to store member records on the same page as their owner records which again provides rapid access, since accessing a particular customer will in this case also retrieve all that customers Order-Headers and Order-Lines.

What are Pointers

Some further explanation of how records are stored on pages is required before describing pointers. When a record is stored on a page it has a line number allocated to it. A records page number and line number is physically stored within the record itself (although this is not accessible by programmers). The concatenated page and line number is called a ‘database key’.

Owner records in sets have first and last pointers, pointing to the first and last members of the set. Member records must have a next pointer which points to the next member in the set and may have a prior pointer which points to the previous member in the set and/or an owner pointer which points to the owner of the set. In effect a set with next pointers is a singly linked list, a set with Next and Prior pointers is a doubly linked list and a set with Next, prior and Owner pointers is a doubly linked list with owner pointers.

We have still not answered the question what is a pointer. A pointer is a database key which provides the location of the next, prior or owner record of the set. It can be seen that the first member record in a set will have identical values in its prior and owner pointers and that the last member in a set will have identical values in its owner and next pointers.

Set Order

The set order clause defines the order in which member records are inserted into sets. There are 5 possibilities;

Order is Next; In this case the member record is inserted after the current record, whether this is a member or the owner.
Order is Prior; In this case the member record is inserted prior to the current record.
Order is First; In this case the member is placed as the first member of the set regardless of which is the current record of the set.
Order is Last; In this case the member is placed as the last member of the set.
Order is Sorted; In this case the member is inserted into a sorted list according to the value of the data item defined in the storage schema.

Physical Database Creation

The next step is the creation and mapping of physical data files. On an ICL mainframe system running IDMS/X (Integrated Database Management System / Extra) this would be done as follows.

CRF(CUSTORDFILE, BLOCKSIZE=11K, NUMBEROFBLOCKS=10000, DISK=DISK1);

CRF(PARTFILE,BLOCKSIZE=2K,NUMBEROFBLOCKS=2,000, DISK=DISK2);

SVF(CUSTORDFILE);

SVF(PARTFILE);

IDMSXFILES(STORAGESCHEMA=SSORDERS,

AREAS=CUSTORD-AREA & PART-AREA,

FILES=CUSTORDFILE&PARTFILE);

Network Subschemas

The final stage before the database can be used is the creation of subschemas, these are programmers views of the database. Suppose that in this environment various programs require read only access to the customer record and that the order processing program needs write access to all records. Two subschema would be required.

Subschema Name is CUSTREAD-SUB

Storage Schema Name is SSORDERS

Areas

CUSTORD-AREA

Records

CUSTOMER READ-ONLY

Sets.

Subschema Name is ALLWRITE-SUB

Storage Schema name is SSORDERS

Areas

CUSTORD-AREA

PART-AREA

Records

CUSTOMER

ORDER-HEADER

ORDER-LINE

PART

Sets

PLACES

CONSISTS-OF

APPEARS-ON.

These subschemas have to be compiled before they can be used by programs (typically written in COBOL). In this case the CUSTREAD subschema would be held in a public object code library, whereas the ALLWRITE-SUB would be held in a private object code library

Example (skeleton) COBOL Program

Identification Division

Program-ID. Program1.

Subschema name is CUSTREAD-SUB

Environment Division

Data Division

COPYIDMS CUSTOMER

Procedure Division

Ready CUSTORD-AREA

Obtain First CUSTOMER

Perform Until DB-END-OF-AREA

Display CUSTOMER-NAME

Obtain Next CUSTOMER

End-Perform

Finish CUSTORD-AREA

Stop Run.

What does this program do?

Data Manipulation

In the example Cobol Program in the previous section we saw examples of 3 Data Manipulation Language (DML) statements, READY (this prepares a database area for use by a program c/f ‘OPEN’,OBTAIN, (this in effect is a read statement) and FINISH (this closes a database area c/f ‘CLOSE’).

There are many other DML statements that can be embedded in Cobol programs, such as;

STORE; inserts a record into the database.
MODIFY; updates an existing record.
CONNECT; inserts a record into a set occurrence.
DISCONNECT; takes a record out of a set.

Contents Page

Home Page

Smart Computing 123

Pages

Sunday, September 18, 2011

Introduction to Data Management and Database Technology

Introduction to Data Management and Database Technology

No comments:

Labels