DW Basics Bala

45
Bala Peddi TERADATA Master November 15, 2008 Introduction to Data Warehousing

description

DW Basics

Transcript of DW Basics Bala

Page 1: DW Basics Bala

Bala Peddi

TERADATA Master

November 15, 2008

Introduction to Data Warehousing

Page 2: DW Basics Bala

2 2

Bala PeddiPrincipal DW Consultant

• Bala Peddi, graduated in BE computer science in 1993 from Karnataka university.

• 15+ years of real time industry experience in Data Warehousing and computer programming.

• Joined Satyam computers in 1995 as Unix & C programmer

• Went to USA in 1996, worked in various fortune 500 organizations such as Fidelity Investments at Boston, AT&T , NCR at San Francisco. First Union Bank, Wachovia bank, Wells Fargo bank at Charlotte, NC.

• Heavily worked in Technologies such as Data Warehousing, Informatica, Ab Initio ETL,IBM Data Stage ETL, TERADATA RDBMS, Informix, UNIX, parallel processing, multi-terabyte environments.

• Worked as DBA for large production systems.

• Founded Simply Track Stock Market tracking product. www.vbssol.com

• Provided senior-level consulting support for a number of high profile Data Warehousing projects.

Page 3: DW Basics Bala

3 3

Bala‟s Data Warehousing Projects

• In 1996-1997 Implemented Data Warehouse of Fidelity Investments. Converted Informix to Teradata warehouse.

• In 1998 - 2000 Implemented Corporate Data Warehouse For Fist union Bank, Charlotte, NC. Implemented Corporate Data Warehouse (CDW)

• From 2000 to 2002 , Implemented Operational Data Store (ODS) for Wachoiva Bank, Charlotte, NC

• In 2003 I have implemented Anti Money Laundering (AML) warehouse to monitor terrorist activities.

• From 2004 to 2006 , Work in Wachovia bank to convert large data warehouse from Informix to Teradata . Implemented Enterprise Data Warehouse.

• From 2007 to 2008 , Implemented Corporate Risk Data Warehouse to support BASELL II regulatory requirements.

• From 2008 to 2010 , Implemented Profitability Data Warehouse to report Customer Level Profitability Reporting ( CLPR).

• All the Above Data Warehousing Projects used technologies like Teradata, oracle, Informatica 8.x, IBM DataStage 8.x, Ab Initio ETL, Unix

Page 4: DW Basics Bala

4 4

Testimonials

Wow Bala, What a shock. I will miss you. I always enjoyed working with you and have and will always have tremednous respect for your knoweldge and your wonderful attitude. I hope all goes as well as it can for you and your family and I wish you the very best of everything. Thanks,Ken Weicholz Systems Analyst/Solutions Delivery Enterprise Information Services (EIS)

Page 5: DW Basics Bala

5 5

What is Raw Data ?Raw data has no use until it becomes information

Page 6: DW Basics Bala

6 6

Information

Metadata

Record Format or Filed names, Column names etc.

Add data types such as decimal, char, integer to become more useful

Page 7: DW Basics Bala

7 7

Information is in files, folders etc..?

What is Data ?What is information?Why can‟t we use files for everything ?

How do you find venkat‟ssalary from 100‟s of Excel files ? Need to open 1 at a time and search for venkat …

Page 8: DW Basics Bala

8 8

Organize information into Tables, Columns and Relations in RDBMS,

NAME City Salary Date of join

srinivas Hyderabad 30,000 1/1/2009

raj Hyderabad 30,001 1/2/2009

bala Hyderabad 30,002 1/3/2009

santhosh Hyderabad 30,003 1/4/2009

veera Hyderabad 30,004 1/5/2009

ravi Hyderabad 30,005 1/6/2009

subba Hyderabad 30,006 1/7/2009

venkat Hyderabad 30,007 1/8/2009

ramesh Hyderabad 30,008 1/9/2009

kishore Hyderabad 30,009 1/10/2009

kumar Hyderabad 30,010 1/11/2009

Select salray from emp where name = „venkat‟

Table

ColumnsRows

RDBMS

Some Examples of RDBMS software's are1. ORACLE2. SQL SERVER3. DB24. Mysql etc ..

Page 9: DW Basics Bala

9 9

RDBMS

• In RDBMS Tables are related, They are called relationships.

• Every table has Primary Key

Page 10: DW Basics Bala

10 10

SQL – Structured Query Language

• DDL

> Create , Alter, Drop

• DML

> Insert, update, delete

• DQL – Data Query Language

> Select <columns> from <tab> where <condition>

> Select <columns> from <tab> group by <columns>

> Select <columns> from <tab> Having < “ >

> Select <columns> from <tab> order by < “ >

> Select * from emp where depno = 20 and job = „manager‟;

> JOINS , UNION, MINUS etc ..

Page 11: DW Basics Bala

11 11

What is a Transaction ?

• A Unit of work in a RDBMS is called transactions.

• During transaction, Data base either update , insert or delete the rows from tables.

• Any failures , it will put back the way it was ( All or nothing)

• Example of Transactions :

> Withdraw money from ATM – It‟s a transaction.

> Buy a book in bookstore – It‟s a transaction.

> Buy Train ticket -- It‟s a transaction.

> Close the account in bank – It‟s a transaction.

> Open an account – It‟s a transaction.

> Buy stocks – It‟s a Transaction. Etc..

Page 12: DW Basics Bala

12 12

What is OLTP?

• On Line Transaction Processing (OLTP) System

• Most RDBMS systems are OLTP application

• Database contains day to day transactions.

• Mostly Inserts with few updates and deletes

• Optimized for specific application or business

• Historical data is archived for performance reasons.

Page 13: DW Basics Bala

13 13

Example OLTP application

• Walk into reliance store , you see OLTP

• Walk into ATM , you see OLTP server

• Purchase Train ticket, OLTP

• Buy LIC policy

• Purchase Air ticket

• Buy TV in electronic shop, OLTP

• Buy a book in Amazon.com, OLTP

• Buy stocks in a broker like karvey,Etrade OLTP

Page 14: DW Basics Bala

14 14

Problems with OLTP

• Not for reporting

• Not for analysis

• Data must be deleted or archived or backed up

• OLTP system must be fast , can not go down.

Page 15: DW Basics Bala

15 15

What is Warehouse ?Picture below shows W that makes shoes.

Page 16: DW Basics Bala

16 16

An Idea behind Data Warehousing ?

Study the past if you would define the future. ConfuciusChinese philosopher & reformer (551 BC - 479 BC)

Page 17: DW Basics Bala

17 17

What is Data Warehouse ?

• It is just an RDBMS like OLTP system

• Storing historic information from various source systems for analysis and study the past.

• Also called DSS ( Decision support systems)

• Database is optimized for “Select” , “Joins”

• Large volumes , In Terabytes

OLTP SYS1

OLTP SYS2

OLTP SYS3

Data Warehouse

Page 18: DW Basics Bala

18 18

A simple OLTP Transaction table?

Trans_id Time Product Quantity Price Total Amount

100 22/08/2010 8:00 AM Soap 5 10 50

101 22/08/2010 8:10 AM Soap 3 10 30

102 22/08/2010 8:20 AM Soap 4 10 40

103 22/08/2010 8:30 AM Soap 2 10 20

Page 19: DW Basics Bala

19 19

A simple DW table, aggregated view?August 22nd 2010

Date Product Quantity Price Total Amount

22/08/2010 Soap 14 10 140

Trans_id Time Product Quantity Price Total Amount

100 22/08/2010 8:00 AM Soap 5 10 50

101 22/08/2010 8:10 AM Soap 3 10 30

102 22/08/2010 8:20 AM Soap 4 10 40

103 22/08/2010 8:30 AM Soap 2 10 20

OLTP

DW

Page 20: DW Basics Bala

20 20

A simple DW table, aggregated view?August 23rd 2010

Date Product Quantity Price Total Amount

22/08/2010 Soap 14 10 140

23/08/2010 Soap 19 10 190

Trans_id Time Product Quantity Price Total Amount

100 23/08/2010 8:00 AM Soap 10 10 50

101 23/08/2010 8:10 AM Soap 3 10 30

102 23/08/2010 8:20 AM Soap 4 10 40

103 23/08/2010 8:30 AM Soap 2 10 20

OLTP

DW

Page 21: DW Basics Bala

21 21

A History is more important then aggregationFollowing table has daily history

Date Product Quantity Avg Price Total Amount

22/08/2009 Soap 20 8 160

23/08/2009 Soap 10 8 80

24/08/2009 Soap 15 8 120 ………………

And so on………….. Until 24/08/2010

25/08/2010 Soap 14 10 140

26/08/2010 Soap 14 10 140

27/08/2010 Soap 14 10 140

This Table has two important things 1. History 2. Aggregated (summary) by day

Page 22: DW Basics Bala

22 22

Example of Why we need history

• To make decision we need lots of data from fast.

• You make better decisions when you have accurate history for last 5 + years.

• In the next slide we take simple example why we need know history to make decissions

Page 23: DW Basics Bala

23 23

Marks List ( 1st Quarter)– Real time DW example

Min Max Score Result

Math 35 100 90 Very Good

Science 35 100 30 Fail

Social 35 100 87 Very Good

English 35 100 65 Good

Page 24: DW Basics Bala

24 24

Marks List (Half Yearly) )– Real time DW example

Min Max Score Result

Math 35 100 95 Very Good

Science 35 100 27 Fail

Social 35 100 84 Very Good

English 35 100 72 Good

Page 25: DW Basics Bala

25 25

Marks List (Three Dimensional )– Real time DW example

Min Max Score Result

Math 35 100 95 Very Good

Science

35 100 27 Fail

Social 35 100 84 Very Good

English

35 100 72 Good

Min Max Score Result

Math 35 100 90 Very Good

Science

35 100 30 Fail

Social 35 100 87 Very Good

English

35 100 65 Good

Mom has history , now she can make decisions 1. Change Teacher 2. Change School 3. Tuition (Decision Support System)

Page 26: DW Basics Bala

26 26

In Summary Data Warehouse Definition

• A Data Warehouse is storing historic information into RDBMS for analysis. Historic information is copied from operational systems , also called OLTP systems.

• In most cases data is aggregated during the copy from OLTP systems.

• It is also called DSS ( Decision Support System)

• In Short

> History of your business

> Summary of your business.

Page 27: DW Basics Bala

27 27

Who needs Data Warehouse ?

• High Management like CEO‟s to looks at overall business trends.

• Middle managers to look at regional business.

• Low managers to look at their own store or branch.

• Marketing team for Cross selling

• Business who want to Make more money and be competitive you need DW

• To retain customers you need DW

• To track campaigns or advertisements you need DW

• To find suspected behaviors from customers in financial industry you need Data Warehouse. ( AML).

• Other governments regulatory requirements like KYC, BASEL II etc you need DW.

Page 28: DW Basics Bala

28 28

It is important to use the information?

Data Warehouse

John Marry

Both Are District ManagersBoth Ran Following report“Show me, for all my stores, a breakdown of second‐quarter sales compared to first‐quarter sales, each store's second‐quarter sales from a year earlier, and the sales of all competitors within two square miles of each store's location.”

Marry calls store managers whose sales are down or Flat. Ask them to run the promotion. With out DW she can‟t make this decision.Some Store manager complained about inventory issues .. So she took care of it

Page 29: DW Basics Bala

29 29

Why DW ? Why can‟t use OLTP to do all

• Can not integrate with other system. Some time customer information for a company is in many OLTP systems.

• Do not effect online system performance

• OLTP are not for query

• Need new database and new way of creating tables for faster queries.

• Reporting tools works best with DW models

• Industry standards

• Data Warehousing needs 2 to 10 years of history, Not possible in OLTP

Page 30: DW Basics Bala

30 30

Data Warehouse book definition?

• Data warehouse is relational database used for query analysis and reporting. By definition data warehouse is Integrated, Non-volatile, Time variant, Subject-oriented.

Integrated Data collected from multiple sources integrated into a user readable unique format.

Non volatile Maintain Historical date.

Time variant data display the weekly, monthly, yearly.

Subject oriented Data warehouse is maintained particular subject.

Page 31: DW Basics Bala

31 31

Integrated

Page 32: DW Basics Bala

32 32

Non volatile

Page 33: DW Basics Bala

33 33

Time variant

Page 34: DW Basics Bala

34 34

Subject oriented

Page 35: DW Basics Bala

35 35

Data Marts

• The data marts are considered sub-sets of the data warehouse. Each data mart is designed for a particular department and is optimized for the analysis needs of one department.

• Two types

> Dependent Data Mart

> Independent Data Mart

Data Warehouse

Marketing Mart

Sales Mart

Accounting Mart

Dependent Data Mart

Marketing Mart

Sales Mart

Accounting Mart

Independent Data Mart

DataWarehouse

Source

Page 36: DW Basics Bala

36 36

Top down and Bottom Up approach?

Top-Down Bottom-Up

Practitioner Bill Inman Ralph Kimball

Emphasize Data Warehouse Data Marts

Design Enterprise based normalized model; marts use a subject orient dimensional model

Dimensional model of data mart, consists star schema

Architect Multi-tier comprised of staging area and dependent data marts

Staging area and data marts

Data set DW atomic level data; marts summary data

Contains both atomic and summary data

Page 37: DW Basics Bala

37 37

Operational Data Store (ODS)

• An ODS is usually designed to contain low-level data (such as transactions and prices) with limited history that is captured "real time" or "near real time" as opposed to the much greater volumes of data stored in the Data warehouse generally on a less-frequent basis. ODS systems mainly used for following applications

> Call Centers

> Product support

> On Demand Marketing

> More inserts/deletes compare to DW

Page 38: DW Basics Bala

38 38

OLTP vs. Data Warehousing

OLTPDATA

WAREHOUSING

Transactional Business Need Analytical

Simple Query Complex

Point-in-Time Timeframe Historical

Known Business Question Unknown

StaticBusiness

EnvironmentDynamic

Page 39: DW Basics Bala

39 39

Real-time Banking DW examples. What is EDW ?

Credit CardsOLTP

Deposits & Withdraw

OLTP

LoansOLTP

InvestmentsOLTP

EnterpriseData Warehouse

Info

rmatic

a/D

ata

Sta

ge E

TL

Teradata

Page 40: DW Basics Bala

40 40

Real time use of DW in Banking

• Cross Selling , If customer open saving account, Send an offer for Credit Card and vice versa

• If customer apply for car loan , call him to see if he can open current account/saving account.

• Campaign management : Run TV adv in New York city in Jan 2010 , Run DW report to see if sales in NY increased ? IF yes run adv across the country if no , dump the adv.

• Customer retention : Act immediately if customer leaves

• Profitability : Calculate profit at customer level. Treat profitable customers with benefits.

• Financial Forecast : We made 2 Billion $ profit in last 3 months , How much we can expect if trend continues.

• Keeping Track banks over all capital requirements etc..

Page 41: DW Basics Bala

41 41

Data Warehousing usage in Anti Money Laundering.

• After Sep 11th attacks , American Government introduced a Law call AML.

• AML is data warehousing system to track customer behavior over the period of time.

• AML Data Warehouse tracks following activities

• If customer gets deposits in large amounts

• If customer gets deposits from different countries ( rogue countries)

• If customer gets too many deposits from various people in short time

• If customer with draw lot of cash

Page 42: DW Basics Bala

42 42

Size of Data Warehouse.

· 1 Bit = Binary Digit· 8 Bits = 1 Byte· 1000 Bytes = 1 Kilobyte · 1000 Kilobytes = 1 Megabyte · 1000 Megabytes = 1 Gigabyte · 1000 Gigabytes = 1 Terabyte · 1000 Terabytes = 1 Petabyte · 1000 Petabytes = 1 Exabyte· 1000 Exabytes = 1 Zettabyte· 1000 Zettabytes = 1 Yottabyte· 1000 Yottabytes = 1 Brontobyte· 1000 Brontobytes = 1 Geopbyte

Top 500 companies in the world has DW, they are more than 50 terabytes..

Page 43: DW Basics Bala

43 43

What do we offer ?

• Data Warehouse concepts

• 5 sessions

• Teradata Programming

• 5 sessions , Teradata development tasks and concepts

• Informatica / Datastage / Ab Initio ETL tool

• 35 sessions , includes practical example, project.

• You can choose any ETL tool, we recommend Datastage or Informatica

• IBM Cognos

• 30 sessions , includes Real time reports.

• A Real time Project with Teradata Backend and ETL tool

• 5 sessions , includes Real time reports

Page 44: DW Basics Bala

44 44

Difference between ETL tools

Informatica Data Stage Ab Initio

ComapyInformatica , Formed 1992 IBM , Formed in 1900

Ab Initio , Israil based company, 1997

Learn Very easy to learn Easy to learn Little complex to learn

Jobs

More Jobs ( 1479 jobs in last few days according to jobs search)

Lesser than Informatica ( 1000 jobs in search)

< Datastage, 700 jobs in search

Resource

More People know about Informatica, easy to find people, more competition

Very few, hence you are from very few

Very very few, you will be in one of them

Who usesSmall to medium compnies

Medium to large companies

Large, Fortune 1000 companies

Cost Cheap reasonable Very expensive

Speed Slow better Very Fast

Parallel Processing limited yes best

Page 45: DW Basics Bala

45 45

Your Goal

• On your resume you should be able to present following skills

• DW, Unix, Data Modeling, Informatica ETL, Data stage, Cognos..

• Jobs that you can apply

• Data Warehousing programmer

• ETL developer

• Informatica developer , Datastage developer, Teradata developer.

• Cognos report developer