DW & OLAP - Hector Garcia-Molina.ppt

58
Data Warehousing and OLAP Hector Garcia-Molina Stanford University

description

DW & OLAP

Transcript of DW & OLAP - Hector Garcia-Molina.ppt

Data Warehousing Overview: Issues, Terminology, Products and ResearchData Warehousing and
Warehousing
Range from desktop to huge:
Walmart: 900-CPU, 2,700 disk, 23TB
Teradata system
Notes12
Outline
Why a warehouse?
What is a Warehouse?
Collection of diverse data
often a copy of operational data
with value-added data (e.g., summaries, history)
integrated
time-varying
non-volatile
more
Notes12
What is a Warehouse?
Warehouse Architecture
Why a Warehouse?
Query-Driven Approach
Advantages of Warehousing
High query performance
Can query data not stored in a DBMS
Extra information at warehouse
Modify, summarize (store aggregates)
Advantages of Query-Driven
less storage
More up-to-date data
Only query interface needed at sources
May be less draining on sources
Notes12
OLTP vs. OLAP
Describes processing at warehouse
OLTP vs. OLAP
Data Marts
Smaller warehouses
Notes12
Warehouse Models & Operators
Star
Notes12
Star Schema
Terms
Dimension Hierarchies
Cube
3-D Cube
dimensions = 3
Multi-dimensional cube:
ROLAP vs. MOLAP
Aggregates
WHERE date = 1
Aggregates
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date
Another Example
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
Aggregates
“Having” clause
Notes12
Cube Aggregation
day 2
day 1
Cube Operators
day 2
day 1
Extended Cube
day 2
day 1
Aggregation Using Hierarchies
customers c2, c3 in Region B)
day 2
day 1
Pivoting
Implementing a Warehouse
Integrating: Loading, cleansing,...
Monitoring
Source Types: relational, flat file, IMS, VSAM, IDMS, WWW, news-wire, …
Incremental vs. Refresh
Monitoring Techniques
Periodic snapshots
Database triggers
Log shipping
Monitoring Issues
Data transformation
remove & add fields (e.g., add date to get history)
Standards (e.g., ODBC)
Integration
Data Cleaning
Scrubbing: use domain-specific knowledge (e.g., social security numbers)
Fusion (e.g., mail list, customer merging)
Auditing: discover rules & relationships
Loading Data
Parallel/Partitioned load
Derived Data
Incremental vs. refresh
Materialized Views
does not exist
at any source
Processing
Index Structures
ROLAP Server
MOLAP Server
Index Structures
Popular in Warehouses
Inverted Lists
Using Inverted Lists
List for name = “fred”: r18, r52
Answer is intersection: r18
Bit Maps
Using Bit Maps
List for age = 20: 1101100000
List for name = “fred”: 0100000001
Answer is intersection: 010000000000
Notes12
Join
Notes12
Join Indexes
join index
What to Materialize?
Example:
Materialization Factors
Cube Aggregates Lattice
city, product, date
Dimension Hierarchies
Dimension Hierarchies
city, product
Interesting Hierarchy
Design
How to clean data?
What to summarize?
What to materialize?
What to index?
Tools
Development
Planning & Analysis
Warehouse Management
System & Network Management
Workflow Management
Notes12
Current State of Industry
Everything copied at warehouse
Query optimization aimed at OLTP
High throughput instead of fast response
Process whole query before displaying anything
Notes12
c1c2c3*
p156450110
p211819
*671250129
customer
custId
name
address
city
53
joe