Skip to main content

DataWare Housing

                           Data Warehousing

A data warehouse consists of a computer database responsible for the collection and storage of information for a specific organization. This collection of information is then used to manage information efficiently and analyze the collected data. Although data warehouses vary in overall design, majority of them are subject oriented, meaning that the stored information is connected to objects or events that occur in reality. The data provided by the data warehouse for analysis provides information on a specific subject, rather than the functions of the company and is collected from varying sources into one unit having time-variant.
Data warehousing professionals build and maintain critical warehouse infrastructure to support business and assist business executives in making smart business decisions. Warehouse ETL (Extraction, Transformation and Loading of data) is an essential part of data warehousing where the data warehousing professional populate data warehouse with information from production databases. Data warehousing professionals work with business analysts and make changes to warehouse ETL in order to maintain consistent and accurate reporting on warehouse table structures.
ETL has a prominent place in data warehousing and business intelligence architecture.The extract, transformation and loading process includes a number of steps:
Create your own diagrams that show the planned ETL architecture and the flow of data from source to target.
Selecting the right ETL Tools is critical to the success the data warehousing and business intelligence project. Should your company acquire a top of the line specialized ETL tool suite, use lower cost Open Source ETL, or use "Tools at Hand"? The article ETL Tool Selection for the Data Warehouse describes these options along with their pros and cons.Consider these performance improvement methods:Turn off database logging to avoid the overhead of log insertionsLoad using a bulk load utility which does not log Primary keys should be single integers Drop relational integrity (RI) / foreign keys - restore after load is complete.
Drop indexes and re-build after load Partition data leaving data loaded earlier unchanged
Load changed data only - use "delta" processing Avoid SQL Update with logging overhead - possibly drop rows and reload using bulk loader Do a small number of updates with SQL Update, then use bulk load for inserts Use Cyclic Redundancy Checksum (CRC) to detect changes in data rather than brute force method of comparing each column Divide SQL Updates into groups to avoid a big rollback log being create. Use an ETL tool that supports parallelism
Use an ETL tool that supports caching Use RAID technologies
Use fast disk and controllers - 15,000 RPM Dedicate servers and disk to business intelligence - do not share with other applications Use multiple servers to support BI such as: a database server, an analysis server and a reporting server Use a server with large main memory (16 GB +) - this increases data caching and reduces physical data access Use a server with multiple processors / cores to enable greater parallelism Data that Describes Your Data Warehouse.
Metadata is one of the important keys to the success of the data warehousing and business intelligence effort. Metadata management answers these questions:
What is Metadata?
How can Metadata be Managed?
Extracting Metadata from Legacy Systems
What is Metadata?
 • Metadata is your control panel to the data warehouse. It is data that describes the data warehousing and business intelligence system:
• Reports
• Cubes
• Tables (Records, Segments, Entities, etc.)
• Columns (Fields, Attributes, Data Elements, etc.)
• Keys
• Indexes
• Metadata is often used to control the handling of data and describes:
• Rules
• Transformations
• Aggregations
• Mappings
The power of metadata is that enables data warehousing personnel to develop and control the system without writing code in languages such as: Java, C# or Visual Basic. This saves time and money both in the initial set up and on going management.

Data Warehouse Metadata
Data warehousing has specific metadata requirements.

Metadata that describes tables typically includes:
·         Physical Name
·         Logical Name
·         Type: Fact, Dimension, Bridge
·         Role: Legacy, OLTP, Stage,
·         DBMS: DB2, Informix, MS SQL Server, Oracle, Sybase
·         Location
·         Definition
·         Notes

Metadata describes columns within tables:
·         Physical Name
·         Logical Name
·         Order in Table
·         Datatype
·         Length
·         Decimal Positions
·         Nullable/Required
·         Default Value
·         Edit Rules
·         Definition
·         Notes

How can Data Warehousing Metadata be Managed?
Data warehousing and business intelligence metadata is best managed through a combination of people, process and tools.
The people side requires that people be trained in the importance and use of metadata. They need to understand how and when to use tools as well as the benefits to be gained through metadata.
The process side incorporates metadata management into the data warehousing and business intelligence life cycle. As the life cycle progresses metadata is entered into the appropriate tool and stored in a metadata repository for further use.
Metadata can be managed through individual tools:
·         Metadata manager / repository
·         Metadata extract tools
·         Data modeling
·         ETL
·         BI Reporting
·         Metadata Manager / Repository
·         Metadata can be managed through a shared repository that combines information from multiple sources.
The metadata manager can be purchased as a software package or built as "home grown" system. Many organizations start with a spreadsheet containing data definitions and then grow to a more sophisticated approach.

Extracting Metadata from Input Sources
Metadata can be obtained through a manual process of keying in metadata or through automated processes. Scanners can extract metadata from text such as SQL DDL or COBOL programs. Other tools can directly access metadata through SQL catalogs and other metadata sources.
Picking the appropriate metadata extract tools is a key part of metadata management.
Many data modeling tools include a metadata extract capability - otherwise known as "reverse engineering". Through this tool, database information about tables and columns can be extracted. The information can then be exported from the data modeling tool to the metadata manager.

Understand OLAP and how it supports Business Intelligence
If you want to analyze data, then On-Line Analytical Processing (OLAP) is often the best way to organize the data. OLAP organization provides several benefits:
Fast Data is organized for rapid query and analysis.
The database structure uses efficiency multidimensional or tuned relational approaches.
Visual Tools enable the analyst to navigate and view results through graphics such as: bar charts, pie charts and tree structures.
Multidimensional Supports "slicing and dicing" along multiple dimensions such as product, customer and location.
In addition, supports "pivot" / "cross tabs" where the investigator changes the direction of the analysis.
Aggregation Supports both drill down to details as well as roll up.
Some aggregations may be pre-calculated to save analysis time. This pre-calculation is where MOLAP provides value. See below.
Time Series Supports trend analysis. Most data marts include a calendar dimension. This dimension supports time hierarchies: year, quarter, month, week, day of week, and day.
Ranking Find the top, bottom or quartile members of a group, such as the top 10 most profitable products or the 10 lease profitable sales territories.
Clusters and Outliers Gain an understanding of groupings of items with common characteristics (clusters) as well as item with unusual characteristics (outliers).
Complex Criteria Gain an understanding of root causes and patterns by using complex criteria to look at meaningful slices of data.

OLAP Cubes Support Business Intelligence
OLAP helps to visualize data as cube structures.
A cube is a multidimensional structure consisting of dimensions and measurements. Cells are the points where dimensions intersect and contain the measurements.
Dimensions provide the context for analysis which are used for labels on reports and selection criteria for queries. Dimensions answer questions like:
·         Who (customers, employees, partners, ...)
·         When (year, quarter, month, ...)
·         What (products, contracts, ...)
·         Where (state, zip code, territory, ...)
·         How (method, process, formula, ...)
·         Cells supply quantitative information. Cells answers questions like:
·         How many (customer count, inventory count, ...)
·         How much (revenue amount, budget amount, ...)
An atomic measurement is one that is stored at the lowest level, such as an individual sale or a single receipt of goods. The benefit of atomic data is that it supports detailed analysis and atomic data can be summed as needed. The drawbacks of atomic data are that it takes more space to store and it requires time to aggregate into totals for analysis.
Aggregated data is a summation of atomic data. For example, sales by quarter and rejects by month are aggregations. The benefit is that query and analysis time are reduced. The drawbacks are that analysis detail can be lost and it is difficult to predict which aggregations the analyst will want to use.

ROLAP Uses SQL for Business Intelligence
Relational OLAP (ROLAP) uses commonly available relational databases to support multidimensional analysis. It has the advantages of:
·         Use of widely available technology
·         Availability of skilled personnel
·         Unfortunately, ROLAP has some disadvantages. These include:
·         Lack of verbs to support cube operations
·         Inefficient handling of empty cells ("sparse" data)
·         Expense in pre-computing aggregates
Relational database vendors address some of these problems through the "Materialized View" which includes predefined joins and aggregations. This works well for cases where the joins and aggregations can be anticipated but falls down in cases where the need for aggregation can not be predicted. MOLAP was created to address those problems.

MOLAP Business Intelligence Benefits
Multidimensional OLAP (MOLAP) addresses problems of multiple aggregations, sparse data and effective cube handling.
It provides these benefits:
·         High scalability
·         Controlled aggregations
·         Data compression
·         Distributed calculations
·         Cubes that can be partitioned and distributed
·         The Multi-Dimensional eXpressions (MDX) language created by Microsoft has become a defacto OLAP standard.
An industry group, XMLA.org, promotes this approach. Examples of systems that implement MDX include:
Microsoft SQL Server Analysis Services (SSAS) Pentaho Business Intelligence Platform
Java OLAP (JOLAP) is an API that relies on the Common Warehouse Metamodel (CWM) standard. Oracle has implemented this standard at least in part.

HOLAP is Hybrid of ROLAP and MOLAP
The Hybrid OLAP is a combination of ROLAP and MOLAP. It seeks to create the "the best of both worlds".

Comments

Popular posts from this blog

Creating Pdf in multiple Languages in Java

Creating Pdf in multiple Languages in Java This code demonstrate how to create pdf in multiple languages in java by using ITEXT pdf creator. To use this sample code download itext.jar To display different locale on pdf you need to download font known as Arialuni.ttf and added to the project Most important thing is the encoding part set encoding to "Identity-H" (String encoding = "Identity-H";) as well as create font for the same to take effect in pdf create font (Font fontNormal = FontFactory.getFont(("c:/windows/fonts/arialuni.ttf"), encoding,BaseFont.EMBEDDED, 8, Font.NORMAL);) You can download the source code from here import java.io.FileOutputStream; import com.lowagie.text.Chunk; import com.lowagie.text.Document; import com.lowagie.text.Font; import com.lowagie.text.FontFactory; import com.lowagie.text.HeaderFooter; import com.lowagie.text.Paragraph; import com.lowagie.text.Phrase; import com.lowagie.text.pdf....

Basic Java Interview Questions

1. Why java does not support multiple inheritance ? Answer :- To say why java doesn't support inheritance directly like c++ should be the diamond problem faced by c++,(Diamond problem is an ambiguity that arises when two classes B and C inherit from A, and class D inherits from both B and C. If a method in D calls a method defined in A (and does not override it), and B and C have overridden this method differently, then via which class does it inherit: B, or C?) 2. JVM crash is an Error or Exception.? Answer:- JVM crash is an Error as a normal developer as we cannot handle it, but its exception for JVM developer as they can fix it. 3. Print “Hello World!” Without Main Method in Java? Answer:- public class Testing {    static    {        System.out.println("Hello World");    }    public static void main(String[] args)    {    } } ...

Collections in Java

Collections in Java Introduction to Collections Framework The Collections Framework provides a well defined set of interfaces and classes for storing and manipulating the groups of data into a single unit. Introduction to Collections Framework Collections Framework: The Collections Framework provides a well-designed set of interfaces and classes for storing and manipulating the groups of data into a single unit. The collections framework is a unified architecture which is used to represent and manipulate collections. The framework allows the collections to get manipulated independently, additionally it reduces the programming efforts and increases performance. It includes implementation of interfaces and algorithms. Basically it is a unified architecture that consists the following collections: Interfaces: These are the abstract data types that represent collections. With the help of interfaces we manipulate collections independently. A hierarchy is generally formed ...