What is a Data Warehouse?
In user-friendly terms, a data warehouse is collection of snapshots of data taken from transaction processing systems at given intervals. Data warehouses are data bases used solely for reporting.
In today’s world, business requires an integrated, company-wide view of high-quality information. A data warehouse centralizes data that are scattered through disparate operational systems and makes them readily available for decision-support applications. A properly designed data warehouse adds value to data by improving their quality and consistency.
A large number of organizations have found that data warehouse systems are valuable tools in today’s competitive, fast evolving world. Many people feel that with competition mounting in every industry, data warehousing is the latest must- have marketing weapon – a way to keep customers by learning more about their needs.
A data warehouse is a:
collection of data in support of management's decision making process. (Bill Inmon)
A data warehouse is a:
repository of information gathered generally from multiple sources, stored under a unified schema usually at a single site.(Korth)
But I like the combine definition of the above two:
A data warehouse is a subject oriented, integrated, time –variant and non-volatile repository of information gathered generally from multiple sources, stored under a unified schema usually at a single site.
§ Subject Orientation
The first feature of the data warehouse is that it is oriented around the major subjects of the enterprise. The data driven, subject orientation is in contrast to the more classical process/functional orientation of applications, which most older operational systems are organized around.
The operational world is designed around applications and functions such as loans, savings, bankcard, and trust for a financial institution. The data warehouse world is organized around major subjects such as customer, vendor, product, and activity. The alignment around subject areas affects the design and implementation of the data found in the data warehouse.
Easily the most important aspect of the data warehouse environment is that data found within the data warehouse is integrated. The very essence of the data warehouse environment is that data contained within the boundaries of the warehouse are integrated. The integration shows up in many different ways - in consistent naming conventions, in consistent measurement of variables, in consistent encoding structures, in consistent physical attributes of data, and so forth. Contrast the integration found within the data warehouse with the lack of integration found in the applications environment and the differences are stark. Over the years the different applications designers have made their own, many decisions as to how an application should be built. The style and the individualized design decisions of the application designer show up in a hundred ways. In differences in encoding, key structures, physical characteristics, naming conventions, and so forth.
§ Time variance
All data in the data warehouse is accurate as of some moment in time. This basic characteristic of data in the warehouse is very different from data found in the operational environment. In the operational environment data is accurate as of the moment of access. In other words, in the operational environment when you access a unit of data, you expect that it will reflect accurate values as of the moment of access. Because data in the data warehouse is accurate as of some moment in time data found in the warehouse is said to be "time variant".
Once data is correctly recorded in the data warehouse, it cannot be updated. Data warehouse data is, for all practical purposes, a long series of snapshots. If the snapshot of data has been taken incorrectly, then snapshots can be changed. But assuming that snapshots are made properly, they are not altered once made. In some cases it may be unethical or even illegal for the snapshots in the data warehouse to be altered. Operational data, being accurate as of the moment of access, can be updated as the need arises.
The fourth defining characteristic of the data warehouse is that it is non-volatile. Inserts, deletes and changes - are done regularly to the operational environment on a record by record basis. But the basic manipulation of data that occurs in the data warehouse is much simpler. There are only two kinds of operations that occur in the data warehouse - the initial loading of data, and the access of data. There is no update of data (in the general sense of update) in the data warehouse as a normal part of processing.
(References : W. H. Inmon Building the Data warehouse (third edition). John Wiley & Sons. 2002.)
I did a data warehouse implementation project in my last year of engineering. For through literature survey, documentation and implementation details click here.
My Data Warehouse Team (2003)
From Left Standing: Vijay, Myself, Rahul & Pradeep
Sitting: Dr. B.B. Meshram (Supervisor) & External Examiner