Monday, July 25, 2011

Data Warehouse Basics

                                                                       Data warehouse concepts


Definition: A data warehouse is a database, which is specifically designed for analyzing the business but not for run the business.
OLTP Databases stores 3 to 6 months of transactions to improve the transaction or processing performance.
Data warehouse types:


Volatile Data Warehouse:
Data entered into data warehouse does not reflect to the change which takes place at OLTP database. Hence the data is static.

      Integrated Data Warehouse:
A Data Warehouse is a very large database, which is an integration of multiple OLTP databases.

      Subject Oriented Data Warehouse:
A Data Warehouse is a Subject oriented database which supports the business or department specific decision makers.

Difference between OLTP & Data Warehouse:

             OLTP                                                                   Data Warehouse
Design to support business transactional processing
Design to support decision making process
Volatile Data
Non-Volatile Data
Less History
More History
Normalization
De-Normalization
Clerical Access
Managerial Access
Application oriented data
Subject oriented data
Designed for running the business operations
Design for analyzing the business operations
Detail Data
Summary Data
More Joins
Fewer Joins

Join: Join is a relational concept that combines the data from one or more tables. If there is ‘N’ number of tables there should be ‘N-1’ joins.

Data Mart:

A data mart is a subset of an enterprise data warehouse
A data mart is subject oriented database, which supports the business needs of department specific business users or decision makers.

A data mart is known as high performance query structures (HPQS)


Top down Data Warehouse approach:

According some data warehouse professionals (Inmon) first we build an enterprise specific database know as enterprise data warehouse. From Enterprise Data Warehouse we design a subject specific database known as Data Marts.

A Data Mart is derived from an enterprise Data warehouse.

<Fig>

Bottom-Up Data Warehouse Approach:

According to Kimball first we build department specific or subject specific database known as Data Marts. Integrate these data marts into an enterprise data warehouse known as Bottom-up data warehouse approach.

<Fig>

Types of Data Marts:

Depended Data Mart: In a top down approach a data mart development is depends on an enterprise Data warehouse.
Independent Data Mart: In bottom-up approach data mart development is independent on an enterprise data warehouse.

Difference between EDW and Data Mart:

Stores enterprise specific information
Department specific business information
Integration of multiple subject
Defined single subject
Access to top management
Access to department specific decision makers.



Active Data Warehouse: An active data warehouse is a database which provides the data for business continuity up-to last minute or an hour. (Real time data).

Enterprise Data Warehouse: An Enterprise Data Warehouse is a very large database which stores enterprise specific business data for top managers access.

ETL Basics:


Types of ETL Tools:
Code based ETL Tool: An ETL Application can be developed using some programming languages such as SQL, PL/SQL.
Ex: SAS Base, SAS Access, Teradata ETL Utilities (BTEQ, Fast Load, Multi Load, TPump)

GUI Based ETL Tool: An ETL Application can be developed with a simple graphical user interface, Point & Click Techniques.
Ex: Informatica PowerCenter, Data Stage, Data Junction, AbInitio, Oracle Data Interpreter (ODI), Data Services, Data Manager, SQL Server Integration Services.

Data Acquisition: It is a process of extracting relevant business information from multiple OLTP source system, Transforming the data into a required business format and loading into target system.

The Data Acquisition defined with the following process:

1.       Data Extraction
2.       Data Transformation
3.       Data Loading

Data Extraction: It is a process of reading the data from various types of source systems.
Ex: Mainframes, SAP, SML, Text Files (Flat Files), People Soft, Siebel, JD Edwards, COBOL Files, Excel, Ms Access, Etc…

Transformation: The process of converting and cleansing the data into the required business format.

Following are the various processing activities takes place:

Data Cleansing (Filtering): It is a process of remaining unwanted information which is also known as filtering.
It is a process of changing inconsistence and inaccuracies.
Ex: Eliminating Duplicates, Eliminating records which contains nulls.

Note: ETL Tools are installed with build informations to build the business logic or transformation logic to process the data.
The functions are categorized into as follows:


1.       Numeric Functions
2.       String Functions
3.       Aggregate Functions
4.       Conversion Function
5.       Cleansing Functions
6.       Date Functions
7.       Scientific Functions
8.       Financial Functions





Data Scrubbing: It is a process of deriving new attributes or new definitions (Derived Data)
Data Aggregation: It is a process of calculating the summaries for a group of records.
Use the following aggregate functions to calculate the aggregates: Sum, Avg, Max, Min, Count
Data Merging: It is a process of integrating the data from multiple source systems.

Types of Merging:
Horizontal Merging: It is a process of combining the records from multiple sources horizontally based on common column values. This is known as Joins in SQL.
<Fig>
Vertical Merging: It is a process of combining the records vertically from multiple sources having same data definition.

Data Loading: It is a process of inserting the data into a Target Database.
We have two types of Data Loading:
      1. Initial Load (Full Load)
      2. Incremental Load (Delta Load)
Initial Load: It is a process of inserting the data into an empty Target table. At first load all the required data from the source gets inserted into a target table, hence it is also known as Full Load.
Incremental Load: It is a process of inserting only new records after initial load.

Load Frequency: It defines regular interval at which data gets loaded into the data warehouse. The Load frequency can be achieved using schedule. Ex: Daily Load, Weekly Load.


Data Warehouse:
Data Modeling: The process of designing the DB is known as data modeling. A database architect or data modular creates the database design using a GUI based database design tool called ERWIN.
A Data Warehouse is designed with the following types of schemas
                     Star Schema
        Snowflake Schema
        Galaxy Schema (Hybrid, Constellation, Bus)
Start Schema: A Start schema is a DB Design which contain a centrally location fact table which is surrounded multiple dimension tables.
In data warehouse facts are numeric.
Not every numeric is a fact, but every numeric is known as facts.
Facts are business measures
Facts are used to evaluate or analyze the enterprise business performance.
A dimension is a description data which analyses the key performance known as facts… <chk>
The Dimensions are organized in a dimension table
Since the database design look like a star, hence it is known as ‘star schema’ database design.


Load Order:
First Load into dimension tables, if all the dimension tables loads successful then load the data into fact table.


Snowflake Schema: A very large dimension table is split into one or more dimension tables which results reducing quite bit of table space. It improves query performance.
Disadvantage: As number of tables increases the number of joins increases as a results query performance may degrade.
Note: In a snowflake schema a dimension table may have parents. Where as in a star schema, no single dimension table have parents.
In snowflake schema, large dicemnsion table split into small tables.


Galaxy Schema:
Fact Constellation: It is a process of joining two fact tables from multiple schemas.
Conformed Dimensions: A dimension table which is shared by multiple fact tables is known as conformed tables.
Junk Dimension: A dimension with the type flag and bullion. And bullions are not the key performance indicators are known as Junk Dimensions.
Ex: Gender-Flag, Product_Pramotion-Flag
Dirty Dimension: In a dimension table if the record exists more than once with a change in non-key attribute is known as Dirty Dimension. (Type 2 Dimension)
Slowly Changing Dimensions: A dimension which can be changed during the period of time is known as slowly changing dimension.


Dimension Types:
Type 1: Stores only current changes in the target. It doesn’t store history.
Type 2: It stores complete historical data in the target. For each update in the OLTP it insert a new record in the target. A surrogate key is a system generated sequence number, which is to be defined as primary key.
Type 3: It stores just current and previous data in the target. (Partial History)


Types of Fact Tables:


Details Fact Table: A fact table which contains details of the transactions is known as details fact table.
Summary Fact Table: A fact table which contains aggregate facts is known as summary fact tables.
Fact less fact table: A Fact table without any fact is known as fact less fact table.


Types of Facts:
Additive Facts: A Fact which can be summarized for all the dimensions is known as Additive Fact. Ex: Quantity, revenue.
Semi Additive Facts: A Fact which can be summarized for a few dimension, but not for all the dimensions is known as semi additive facts. Ex: Current Balance.
Non Additive Facts: A Fact which can’t be summarized fir any of dimensions known as non-additive facts. Ex: Discount, Percentages.
OLAP (Online Analytical Processing): An OLAP is set of specifications which allows the decision makers or end users to query the data from database and present the data for analysis in a template called report.


ROLAP (Relational OLAP): An OLAP which can query the data from relational data source is known as ROLAP. 


MOLAP: An OLAP which can query the data from a multi-dimensional database (CUBE) is known as MOLAP. 


HOLAP (Hybrid OLAP): An OLAP which supports the combined properties of ROLAP and NOLAP is known as HOLAP. 


DOLAP (Desktop OLAP): Which can query the data from desktop databases such as text files, XML Files, Excel etc…