What is a Data
Warehouse?
In
user-friendly terms, a data warehouse is collection of snapshots of data taken
from transaction processing systems at given intervals. Data warehouses are
data bases used solely for reporting.
In
today’s world, business requires an integrated, company-wide view of
high-quality information. A data warehouse centralizes data that are scattered
through disparate operational systems and makes them readily available for
decision-support applications. A properly designed data warehouse adds value to
data by improving their quality and consistency.
A
large number of organizations have found that data warehouse systems are
valuable tools in today’s competitive, fast evolving world. Many people feel
that with competition mounting in every industry, data warehousing is the
latest must- have marketing weapon – a way to keep customers by learning more
about their needs.
A data warehouse is a:
collection
of data in support of management's decision making process. (Bill Inmon)
A data
warehouse is a:
repository
of information gathered generally from multiple sources, stored under a unified
schema usually at a single site.(Korth)
But I
like the combine definition of the above two:
A data
warehouse is a subject oriented, integrated, time –variant and non-volatile
repository of information gathered generally from multiple sources, stored
under a unified schema usually at a single site.
§
Subject Orientation
The first feature of the data
warehouse is that it is oriented around the major subjects of the enterprise.
The data driven, subject orientation is in contrast to the more classical
process/functional orientation of applications, which most older operational
systems are organized around.
The operational world is designed around applications and
functions such as loans, savings, bankcard, and trust for a financial
institution. The data warehouse world is organized around major subjects such
as customer, vendor, product, and activity. The alignment around subject areas
affects the design and implementation of the data found in the data warehouse.
§
Integration
Easily the most important aspect of the data warehouse environment
is that data found within the data warehouse is integrated. The very essence of
the data warehouse environment is that data contained within the boundaries of
the warehouse are integrated. The integration shows up in many different ways -
in consistent naming conventions, in consistent measurement of variables, in
consistent encoding structures, in consistent physical attributes of data, and
so forth. Contrast the integration found within the data warehouse with the
lack of integration found in the applications environment and the differences
are stark. Over the years the different applications designers have made their
own, many decisions as to how an application should be built. The style and the
individualized design decisions of the application designer show up in a
hundred ways. In differences in encoding, key structures, physical
characteristics, naming conventions, and so forth.
§
Time variance
All data in the data warehouse
is accurate as of some moment in time. This basic characteristic of data in the
warehouse is very different from data found in the operational environment. In
the operational environment data is accurate as of the moment of access. In
other words, in the operational environment when you access a unit of data, you
expect that it will reflect accurate values as of the moment of access. Because
data in the data warehouse is accurate as of some moment in time data found in
the warehouse is said to be "time variant".
Once
data is correctly recorded in the data warehouse, it cannot be updated. Data
warehouse data is, for all practical purposes, a long series of snapshots. If
the snapshot of data has been taken incorrectly, then snapshots can be changed.
But assuming that snapshots are made properly, they are not altered once made.
In some cases it may be unethical or even illegal for the snapshots in the data
warehouse to be altered. Operational data, being accurate as of the moment of
access, can be updated as the need arises.
§
Non-Volatile
The fourth defining characteristic of the data warehouse is that
it is non-volatile. Inserts, deletes and changes - are done regularly to the
operational environment on a record by record basis. But the basic manipulation
of data that occurs in the data warehouse is much simpler. There are only two
kinds of operations that occur in the data warehouse - the initial loading of
data, and the access of data. There is no update of data (in the general sense
of update) in the data warehouse as a normal part of processing.
(References
: W. H. Inmon Building the Data warehouse (third edition).
John Wiley & Sons. 2002.)
I did a data
warehouse implementation project in my last year of engineering. For through
literature survey, documentation and implementation details click here.
My Data
Warehouse Team (2003)
From Left Standing: Vijay, Myself, Rahul
& Pradeep
Sitting:
Dr. B.B. Meshram (Supervisor) & External Examiner