DW & Informatica basics

download DW & Informatica basics

of 49

Transcript of DW & Informatica basics

  • 7/27/2019 DW & Informatica basics

    1/49

    Data Warehousing Basic

    What is Data Warehousing?

    Data warehousing collection of data designed to support management decision making.In another words it is a repository of integrated information, available for querying and

    analyzing.According to Inmon, famous author for several data warehouse books, "A data warehouseis a subject oriented, integrated, time variant, non volatile collection of data in support ofmanagement's decision making process".

    Who need data warehousing?

    It is needed by the knowledge worker. e.g. Manager, Analyst, Executive and anyauthorized person who needed the information from the large scale of database.

    Types of Systems

    There are two types of systems.

    1. OLTP2. DSS (OLAP)

    Features OLTP OLAP

    Characteristic Operational Processing Informational Processing

    Orientation Transactional Analysis

    User Clerk, DBA Knowledge Worker

    Function Day to day operation Long term informational requirements

    DB Design ER Based, application oriented star/snowflake, subject oriented

    View detailed, flat relation Summarized, Multidimensional

    Access Read/Write Mostly Read

    DB Size 10MB to 100MB 100MB to TB

    Data Warehouse Life Cycle

    The data warehouse life cycle comprises of various phasesPhase 1: Business Requirements Collections

    A business analyst is responsible for gathering requirements from the end usersfor the following example domains

    1. Telecom2. Insurance3. Manufacturing4. Sales & Retails

    Phase 2: Data ModelingIt is the process of designing database by database architect using ERWIN tool.

    Phase 3: ETL DeveloperAn application developer designs an ETL application by following the ETL

    specification using GUI based tools. Such as INFORMATICA, DATASTAGE.

    1

  • 7/27/2019 DW & Informatica basics

    2/49

    Phase 4: ETL TestingThis phase is completed by ETL tester as well as application developer also.Carried out the following test in the test environment

    1. ETL unit testing2. System Testing

    3. Performance Testing4. UAT (User Acceptance Testing)

    Phase 5: Report DevelopmentDesign the reports by fulfilling the report requirements templates using followingtools.

    CognosBO

    Phase 6: DeploymentIt is a process of migrating ETL and Report development application to the

    production environment.

    Phase 7: MaintenanceMaintain the Data warehousing in 24*7 environments with the help of productionsupport team.

    Data warehouse design Database DesignA data warehouse design with the following types of schemas.

    1. Star Schema2. Snow Flake Schema

    3. Galaxy Schema1. Star Schema:-Is a database design which contains a centrally located fact tablesurrounded by dimension tables. Since the database design looks like astar hence it is called star schema Database design.

    A fact table contains facts.

    Facts are numeric measure.

    Not every numeric measure is fact but numeric switch over the time keep

    performance indicator known as facts.

    A dimension is a descriptive data which describes the key performance indicators

    known as facts.

    Dimension tables are de-normalized. A dimension provides answers to the following question.

    Who, What, When, Where

    2

  • 7/27/2019 DW & Informatica basics

    3/49

    Star Schema

    2. Snowflake Schema

    The snowflake schema is a variant of star schema, where someDimension tables are normalized, thereby further splitting the data into additional tables.The resulting schema graph forms a shape similar to snowflake.Adv:-

    Space can be minimized by splitting into the normalized table.Disadv:-

    It can hamper the query performance due to more number of joins.

    a

    Snowflake Schema

    3. Galaxy Schema (Fact Constellation Schema)

    Sophisticated application mayrequire multiple facts table to share dimension table. This type of schema can be viewedas combination of stars hence called galaxy schema or fact constellation schema.

    Sales Fact

    Sale_d(pk)

    Cust_id (fk)

    Store_id (fk)

    Product_id(fk)Date_in (fk)

    Sales Fact

    Sale_d(pk)

    Cust_id (fk)

    Store_id (fk)

    Product_id(fk)

    Date_in (fk)

    3

    Store Dimen

    Time Dimen

    Customer Dimension

    Product Dimension

    Store Dimension

    Time Dimension

    Customer Dimension

    Product Dimension

    Item Dimension

    City Dimension

    Fact 2Fact 1

    D1 D3 D5 D6 D7

  • 7/27/2019 DW & Informatica basics

    4/49

    Galaxy SchemaDimensions

    Dimension tables are sometimes called lookup or reference table.

    1. Confirmed Dimension : - A dimension table which can be shared by multiple facttables is known as confirmed dimension.

    2. Junk Dimension :- A dimension with the type descriptive, flag, Boolean which arenot used to describe the key performance indicators knows as facts, suchdimensions are called junk dimensions. Example, Product description, Address,Phone number etc.

    3. Slowly Changing Dimension: - A Dimensions that change over time are calledSlowly Changing Dimensions. For instance, a product price changes over time;

    People change their names for some reason; Country and State names may changeover time. These are a few examples of Slowly Changing Dimensions since somechanges are happening to them over a period of time. Slowly ChangingDimensions are often categorized into three types namely Type1, Type2 andType3. The following section deals with how to capture and handling thesechanges over time.

    Type 1: Overwriting the old values.In the year 2005, if the price of the product changes to $250, then the oldvalues of the columns "Year" and "Product Price" have to be updatedand replaced with the new values. In this Type 1, there is no way to findout the old value of the product "Product1" in year 2004 since the table

    now contains only the new price and year information.Type 2: Creating another additional record.In this Type 2, the old values will not be replaced but a new rowcontaining the new values will be added to the product table. So at anypoint of time, the difference between the old values and new values canbe retrieved and easily be compared. This would be very useful forreporting purposes.

    Type 3: Creating new fields.In this Type 3, the latest update to the changed values can be seen.Example mentioned below illustrates how to add new columns and keeptrack of the changes. From that, we are able to see the current price andthe previous price of the product, Product1.

    4

    D2 D8 D9D4

  • 7/27/2019 DW & Informatica basics

    5/49

    Data ModelingA Data model is a conceptual representation of data structures (tables) required for adatabase and is very powerful in expressing and communicating the businessrequirements.A data model visually represents the nature of data, business rules governing the data,and how it will be organized in the database.Data modeling consists of three phases to design the database.

    1. Conceptual Modeling

    Understand the business requirements

    Identify the entities (tables)

    Identify the columns (attributes)

    Identify the relationship2. Logical Modeling

    Design the tables with the required attributes.

    3. Physical Modeling

    Execute the logical tables to exist physical existence in the database.

    Data modeling tools

    There are a number of data modeling tools to transform business requirements intological data model, and logical data model to physical data model. From physical datamodel, these tools can be instructed to generate SQL code for creating database.

    5

  • 7/27/2019 DW & Informatica basics

    6/49

    6

  • 7/27/2019 DW & Informatica basics

    7/49

    INFORMATICAIntroduction

    Is GUI based ETL product from Informatica corporation.

    Is a client server technology.

    Is developed using JAVA language.

    Is an integrated tool set (To Design, To Run, To Monitor)Versions:

    1. 5.02. 6.03. 7.1.14. 8.1.15. 8.56. 8.6

    Meta Data

    Meta Data is a Data about Data means Data that describes data and other structures,such as objects, business rules, and processes.Example: Table Structure (column name, data type, precision, scale and kyes),Description

    Mapping

    Is a GUI representation for the data flow from source to target. In other words, thedefinition of the relationship and data flow between source and target objects.

    Requirements for mappings

    a) Source Metadata

    b) Business logic

    c) Target Metadata

    Repository

    Central Database or Metadata Storage place

    7

    Informatica

    Source DatabaseTarget Data warehouse

    Repository (WorkingPlace in Informatica)Repository

  • 7/27/2019 DW & Informatica basics

    8/49

    Staging Area

    A place where data is processed before entering the warehouse.

    Source System

    A database, application, file, or other storage facility from which the data in a datawarehouse is derived.

    Target System

    A database, application, file, or other storage facility to which the "transformed sourcedata" is loaded in a data warehouse.

    Cleansing

    The process of resolving inconsistencies and fixing the anomalies in source data,typically as part of the ETL process.

    Transformation

    The process of manipulating data. Any manipulation beyond copying is a transformation.Examples include cleansing, aggregating, and integrating data from multiple sources.

    Transportation

    The process of moving copied or transformed data from a source to a data warehouse.

    Working Professional Divisions

    Two Flavors

    1. Informatica Power center For Big Scale Industries.

    2. Informatica Power mart For Small Scale Industries.

    Components of Informatica

    Client Components

    1. Designer2. Work flow Manager3. Work flow Monitor4. Repository Manager

    Designation Roles

    ETL Architects Designing Schema, ETL SpecificationDeveloper Developing ETL Application

    Administrator Installation, Configuration, Managing, Monitoring

    8

  • 7/27/2019 DW & Informatica basics

    9/49

    5. Admin Console

    Roles for Designer1. Use Mapping2. Source analyze3. Connect to source Database with ODBC4. Target Designer5. Mapping Designer6. Mapplet Designer7. Transformation Developer

    Roles for Workflow Manager

    1. Task Developer

    2. Work flow designer3. Worklet designer

    9

    DesignerImport Source DefinitionImport Target MetadataImport Designing Mapping

    Mapping(M_xyz)

    Save|

    Repository

    Workflow Manager1.Create Session

    Mapping(S_xyz)

    Save

    2. Create Workflow

    Start

    --Executing intoInformatica server.

    --Integration servicesare responsible forexecution.

    Workflow Monitor

    Monitoring (Mapping)

    Session

    Admin Console

    For Administrative Purpose.

  • 7/27/2019 DW & Informatica basics

    10/49

    Working flow of Client Component in Informatica

    Note:- To Run the Mapping in the Informatica is called Creating Session.

    10

  • 7/27/2019 DW & Informatica basics

    11/49

    How the Mapping can be done?

    Customer

    CID number(4)pk

    Cfname varchar2(5)

    Clname varchar2(5)

    Gender number(1)

    Concat ( Cfname, Clname )Decode (Gender, 0,F,1,M)

    Note: - At the time of plan (mapping) you worked with metadata only.At the time of execution you worked with data records.

    Power Center ComponentsWhen we install the power center enterprise solution the following components getinstall:-

    1. Power Center Clients2. Power Center Repository3. Repository Services4. Integration Services5. Web Service Hub6. Power Center domain7. Power Center administration console

    11

    Dim_CustomerCIDnumber(4)pkCnamevarchar2(10)Gender

    varchar(1)

    Client(GUI)

    E T L

    Customer CONCAT() Dim_Customer CID DECODE() CIDCfname CnameClname Gender Gender

    MAPPING

    ODBC

    ODBC

    Source DatabaseData Warehouse

    [Target Database]

  • 7/27/2019 DW & Informatica basics

    12/49

    Designer Workflow Manger Workflow Monitor Repository ManagerCreate SourceDefinition

    Create session for eachMapping

    View workflow &session status

    Create Edit & Delete folders

    Create targetDefinition

    Create Workflow Get Session log Create Users, groups, assignpermission.

    Define T/R Rule Execute WorkflowDesign Mapping Schedule Workflow

    Informatica PowerCenter Client Architecture

    Note: One Workflow can contain more than one session but one session will contain onlyone mapping.Workflow is upper layer of the development while session is middle layer and mapping isinner layer.

    12

    ExternalClient

    RepositoryServices

    IntegrationServices

    Web ServicesHub

    MappingSource DefinitionTarget DefinitionT/R RuleSessionWorkflowSession log

    Schedule Info

    Repository

    Target DBSource DB

    StagingArea

    E

    T

    L

  • 7/27/2019 DW & Informatica basics

    13/49

    Power Center Clients

    The following power center clients gets installed1. Designer

    It is a GUI based client component which allows you to design the planof ETL process called mapping.

    The following types of metadata objects can be created using designerclient.a) Create Source Definitionb) Create Target Definitionc) Design Mapping with or without a Transformation rule.

    2. Workflow Manager

    It is a GUI based client component which allows you to create thefollowing task.

    a) Create session for each mapping.b) Create workflowc) Execute workflow

    d) Schedule workflow3. Workflow MonitorIt is a GUI based client component which provides the followinginformation:

    a) Give the workflow and session status (Succeeded or Failed)b) Get Session Log from the repository.c) Start, Stop sessions and workflows.

    4. Repository Manager

    The Repository manager is GUI based administrative client whichallows you to create following objects.

    a) Create, Edit and Delete folders which are required to organize themetadata and the repository.

    b) Create users, user groups, assign permissions and privileges.5. Power Center Repository

    The Power Center Repository is a Relational Database (SystemDatabase) which contains instruction required to extract transform andload data.

    The Power Center client application can access the repositorydatabase through repository service.

    The Repository consists of metadata which describes thedifferent types of objects such as source definition, target definition,mapping etc.

    The Integration service uses repository objects to performextraction, transformation and load data.

    The repository also stores administrative information such asusername, passwords, permission and privileges.

    The Integration service also creates metadata such as session-log, workflow and session status, start and finish time of the session andstores in repository through repository service.

    13

  • 7/27/2019 DW & Informatica basics

    14/49

    6. Repository Service

    The Repository service manages connections to the power centerrepository from client applications.

    The Repository service is a multithreaded process that inserts,retrieves, deleted and updates metadata in the repository.

    The Repository service ensures the consistency of the metadata inthe repository.The Following Power Center applications can access the repository

    servicea) Power Center Clientb) Integration Servicec) Web Service Hubd) Command Line Program (For backup and Recovery for

    administrative purpose)7. Integration Service

    The Integration Service reads mappings and session information from

    the repository.It extract the data from the mapping source stores in the memory(Staging Area) where it applies the transformation rule that you canconfigure in the mapping.

    The Integration Service loads the transformed data into themapping targets.

    The integration service connects to the repository throughrepository service to fetch the metadata.

    8. Web Service Hub

    The Web Service Hub is a web service gateway for the external clients.The web service clients (Internet Explorer, Mozilla) access the

    integration service and repository service through web service hub.It is used to run and monitor web enabled work flows.

    Definitions

    Session: A Session is a set of instruction which perform extraction, transformation andloading.A session Created to make the mapping available for execution.

    Workflow: A Workflow is a start task which contains a set of instruction to execute theother task such as session.Workflow is a top object in the power center development hierarchy.

    Schedule Workflow: A Schedule workflow is an administrative task which specifies the

    data and time to run the workflow.

    ** The following client component makes communication to integration service.

    1. Workflow Manager2. Workflow Monitor

    14

  • 7/27/2019 DW & Informatica basics

    15/49

    Transformation

    A transformation is an object used to define business logic for processing the data.

    Transformation can be categorized in two categories1. Based upon no. of rows processing

    2. Based upon connectionBased upon no. of rows processing there are two types of Transformation1. Active Transformation2. Passive Transformation

    Active Transformation:

    A transformation which can affect the number of rows while data is going from source totarget is known as active transformation.The following are the list of active transformation used for processing the data.

    1. Source Qualifier Transformation2. Filter Transformation

    3. Aggregator Transformation4. Joiner Transformation5. Router Transformation6. Rank Transformation7. Sorter Transformation8. Update Strategy Transformation9. Transaction Control Transformation10. Union Transformation11. Normalizer Transformation12. XML Source Qualifier13. Java Transformation14. SQL Transformation

    Passive Transformation:

    A transformation which does not affect the number of rows when the data is moving fromsource to target is known as passive transformation.The following are the list of passive transformation used for processing the data.

    1. Expression Transformation2. Sequence Generator Transformation3. Stored Procedure Transformation4. Lookup Transformation

    Example:

    Example of Active Transformation.

    14 rows 14 rows 14(I) 6(O)

    15

    SQ_EmpEmp T_EmpSAL>3000

    Filter Transformation

  • 7/27/2019 DW & Informatica basics

    16/49

    Example of Passive Transformation

    14 Rows 14 Rows 14(I) 14(O)

    Based on Connection there are two types of Transformation1. Connected2. Unconnected

    Connected: A transformation which is participated in mapping data flow direction(connected to the source and target) is known as connected transformation.--All active and passive transformation can be used as connected transformation.--A connected transformation can receive the multiple inputs and can provide multipleoutputs.

    S T

    S I O TaxI O Annual Sal

    Unconnected: A transformation which is not participating in a mapping data flowdirection (neither connected to source nor to the target) is known as unconnectedtransformation.-- An unconnected transformation can receives the multiple inputs but provides a singleoutput.-- The following transformation can be used as unconnected transformation.1. Stored Procedure Transformation2. Lookup Transformation

    S T

    16

    SQ_EmpEmp T_EmpTax=Sal*0.10

    Expression Transformation

    T/R

    SAL

    COM

    T/R

    T/R

    Mapping

  • 7/27/2019 DW & Informatica basics

    17/49

    Port & Types of Port

    A Port represents column of the database table or file.

    The following are the types of port.

    1. Input Port

    2. Output PortInput Port: A port which can receive the data is known as input port, which is representedas I.

    Output Port: A port which can provide the data is known as output port, which isrepresented as O.

    ETL Specification Document (Mapping Specification Document)

    A mapping specification document is an excel sheet or word document which containsinformation about following objects.

    1. Source

    2. Target

    3. Business Logic (Transformation Rule)

    Source Target Transformation Rule

    Source Type Target Type Calculate Tax (sal*0.10)Source Table Target Table for top 3 employees basedSource Column Target Column on the salary in dept 30;Format Type P, S Format Type P, SDescription Description

    DFD

    14 Rows 14 Rows 14(I) 6(O) 6(I) 3(O) 3(I) 3(O)

    1. Filter Transformation

    This is of type an active transformation which allows you to filter the data based on givencondition.-- A condition is created with the three elements

    1. Port2. Operator3. Operand

    17

    Emp SQ_EMP Dept = 30 Top 3 Tax (Sal*.10) T_Emp

  • 7/27/2019 DW & Informatica basics

    18/49

    The integration service evaluates the filter condition against each input record, returnsTRUE or FALSE.-- The integration service returns TURE when the records is satisfied with the conditionand the records are given for further processing or loading the data into the target.-- The integration service returns FALSE when the input record is not satisfied with the

    condition and those records are rejected from filter transformation.-- Filter transformation does not support IN operator.-- The filter transformation supports to send the data to the single target.-- Use filter transformation to perform data cleansing activity.-- The filter transformation functions as WHERE clause in terms of SQL.

    2. Rank Transformation

    This is of type an active transformation which allows you to identify the TOP andBOTTOM performers.

    -- The rank transformation can be created with following types of ports.

    1. Input Port2. Output Port3. Rank Port (R)4. Variable Prot (V)

    Rank Port: - The port based on which rank is determined is known as Rank Port.

    Variable Port: - A port which can store the data temporally is known as a variable port.

    The following properties need to be set for calculating the Ranks.1. Top/Bottom2. Number of Rank

    The Rank transformation by default create with an output port called Rank index.

    Dense Ranking: - It is a process of calculating the ranks for each group.

    Sampling: - It is a process of reading the data of specified size (No. of records) fortesting.

    3. Expression Transformation

    This is a type of passive transformation which allows you to calculate the expression foreach record.

    The expression can be calculated only in the output ports.

    Used expression transformation to perform data cleansing and data scrubbing activities.

    Expression transformations define only on the output port.

    4. Sorter Transformation:

    This is of type an Active Transformation which sorts the data in ascending or indescending order.-- The port on which sorting takes place is represented as a key.-- User sorter Transformation for eliminating duplicates.

    18

  • 7/27/2019 DW & Informatica basics

    19/49

    5. Aggregator Transformation

    This is of type of an Active transformation which allows you to calculate the summaryfor a group of records.

    Aggregator transformation is created with following four components.1. Group by: It defines the group on a port for which summaries are calculated. Ex.

    Deptno2. Aggregate Expression:- The aggregate expressions can be developed only in the

    output ports using following aggregate function.--sum( )--max( )-- avg( )

    3. Sorted Input: - An aggregator transformation receives sorted data as an input toimprove the performance of summary calculations.

    The port on which group is defined, the same ports need to be sorted,

    using sorter transformation. (Only group by port need to be sorted bysorter transformation)

    4. Aggregate Cache: - The Integration service creates cache memory when the firsttime session executes on it.-- The aggregate cache stored on server hard drive.-- An incremental Aggregation uses aggregate cache to improve the performance

    of session.

    Incremental AggregationIts a process of calculating the summary for only new records, which pass throughmapping using historical cache.

    Note: - Both sorted input and incremental aggregation can not be used for a sameapplication to achieve the greater performance. (Session gets failed because ROWIDwill not matched)

    6. Lookup Transformation

    This is of type of passive transformation which allows you to perform a lookup onrelational tables, flat files, synonyms and views.-- When the mapping contains a lookup transformation the integration service queries thelookup data and compares it with transformation port values(EMP.DEPTNO=DEPT.DEPTNO).

    -- A lookup transformation can be created with the following types of port.1. Input Port ( I )2. Output Port ( O )3. Lookup Port ( L )4. Return Port (R)

    --There are two lookups1. Connected2. Unconnected

    19

  • 7/27/2019 DW & Informatica basics

    20/49

    -- Use lookup transformation to perform following tasks.1. Get related value2. In updating slowly changing dimension.

    Difference between Expression and Aggregator Transformation

    Expression Transformation Aggregator Transformation

    Passive Transformation Active Transformation

    Expressions are calculated for eachrecord

    Expressions are calculated for group ofrecord

    Non-Aggregate functions used Aggregate functions used

    7. Joiner Transformation

    This is of type of an Active transformation which allows you to combine the data from

    multiple sources into a single output based on given join condition.-- The joiner transformation is created with the following types of ports.

    1. Input Port2. Output Port3. Master Port (M)

    A Source which is defined with lesser number of records than other source is designatedas master source.A master source is created with the master ports. The joiner transformation can be createdwith following types of join.1. Normal join (Equi Join)2. Master outer join

    3. Detail outer join4. Full outer join.The default type of joiner transformation is Normal join (Equi Join).

    1. Normal Join keeps only matching rows on the condition.

    2. Master Outer Join Keeps all rows from detail and matching rows from master.

    3. Detail Outer Join Keeps all rows from master and matching rows from detail.

    4. Full Outer Join Keeps all rows from both master and detail.Joiner transformation does not support non-equi join.Use joiner transformation to perform merge the data records horizontally.Use joiner transformation to perform join on the following types of sources.

    1. Table + Table2. Flat file + Flat file3. XML file + XML file4. Table + Flat file5. Table + XML file6. Flat file + XML file

    20

  • 7/27/2019 DW & Informatica basics

    21/49

    Merging

    Horizontally Vertical

    Joiner Transformation Union Transformation

    Equi-Join Master outer join Detail outer join

    Full Outer Join

    8. Router Transformation

    Router transformation is a type of active transformation which allows to apply multiplecondition, to load multiple target table.-- Is created with two types of group.1. Input Group: - Which receives the data from source.2. Output Group: - Which sends the data to target.Output groups are also of two types.

    1. User defined group allows to apply condition.

    2. Default group captures the rejected record.

    21

  • 7/27/2019 DW & Informatica basics

    22/49

    Difference between Filter & Router transformation.

    Filter Router

    Single Condition based Multiple Condition based

    Single Target Multiple Target

    Can not capture rejects Capture the rejects

    DFD

    Router Transformation

    Input

    State=HR

    State=DL

    State=KA

    Default

    9. Union Transformation

    Union transformation combines multiple input flows into a single output flow.It supports homogeneous and heterogeneous sources also.Created with two groups.

    1. Input group: - Receives the information2. Output group: - Sends the information to the target.

    Union transformation works as union all in Oracle.

    Note: All the sources should have the same structure.

    10. Stored Procedure Transformation

    This is of type passive transformation which is used to call the stored procedure from thedatabase.A stored procedure is a set of pre compiled SQL Statements which receives the input andprovides the output.

    There are two types of stored procedure transformation.1. Connected Stored Procedure2. Unconnected Stored Procedure

    The following properties need to be set for stored procedure transformation.i. Normalii. Source Pre Loadiii. Source Post Loadiv. Target Pre Loadv. Target Post Load

    22

    Sales Sales_SQ

    State HR

    State DL

    State KA

    Default

  • 7/27/2019 DW & Informatica basics

    23/49

    Use the normal property when the stored procedure involves is performing calculation

    11. Source Qualifier Transformation

    This is a type of an active transformation which allows you to read the data from

    databases and flat files (text file).SQL OverrideIts a process of changing the default SQL using Source filter, User defined joins, Sortinginput data and Eliminating duplicates (Distinct)Source Qualifier transformation supports SQL override when the source is database.The above logic gets process on the database server.The business logic process is sharing between integration service and database server.This improves the performance of data acquisition.

    User Defined Joins If the two sources are belongs to the same database user account orsame ODBC then apply the joins in the source qualifier rather than using joiner

    transformation.

    Mapplet & Types of Mapplet

    A mapplet is reusable metadata object created with business logic using set oftransformation.A mapplet is created using mapplet designer tool.There are two types of mapplet.

    1. Active mapplet: - Its created with the set of active transformation.2. Passive mapplet: - Its created with the set of passive transformation.

    It can be reused in a multiple mappings, having the following restrictions.1. When you want to use stored procedure transformation you should use the stored

    procedure transformation with the type Normal.2. When you want to use sequence generator transformation you should use the

    reusable sequence generator transformation.3. The following objects can not be used to create a mapplet.i. Normalizer Transformationii. XML Source Qualifier Transformationiii. Pre/Post Stored Procedure Transformationiv. Mapplets (Nested Mapplet)

    Note: Reusable TransformationContains Single Transformation

    Mapplet

    Contains Set of Transformation

    Reusable Transformation

    A Reusable transformation is reusable metadata object which contains the business logiccreated with single transformation.

    23

  • 7/27/2019 DW & Informatica basics

    24/49

    It is created in two different waysi. Using Transformation developerii. Converting a Non-Reusable transformation into a Reusable Transformation

    RestrictionSource Qualifier transformation does not support to create reusable transformation.

    Constraints Based Load Ordering (CBL)

    A CBL specifies the load order into the multiple targets based on primary key and foreignkey relationship.A CBL is specified when you want to load the data into snow-flake schema dimensions,which is having primary and foreign key relationship.

    -Empno-Ename-Job-Sal-Deptno-Dname-Loc

    Scheduling Workflow

    A schedule specifies the data and time to run the workflow.There are two types of schedule.

    1. Reusable Schedule: - A schedule which can be attached to the multiple workflowis known as reusable schedule.

    2. Non-reusable: - A schedule which is created at the time of creating workflow isknown as non-reusable schedule.

    A non-reusable schedule can be converted into reusable schedule.

    Target Load Plan

    A target load plan specifies the order in which data being extracted from Source QualifierTransformation.

    Flat Files

    A flat file is an ASCII text file which are saved with an extension .txt, .csvThere are two types of flat files.

    1. Delimited Flat Files: - In this type of file each field or columns separated by somespecial character like comma, tab, space, semicolon etc;

    2. Fixed width Flat files: - A record of continuous length to be splitted into multiplefields.

    24

    Emp_Dept Emp S_Q

    Emp

    Dept

    -Empno-Ename-Job-Sal-Deptno

    -Deptno-Dname-Location

  • 7/27/2019 DW & Informatica basics

    25/49

    Note:-

    -- Relational Reader Its reads the data from relational sources.

    -- File Reader Its reads the data from flat files.

    -- XML Reader Its reads the data from XML Reader.

    --Relational Writer Its writes the data to the relational targets.

    -- File writers Its writes the data to the flat file targets.

    --XML writer Its writes the data to the XML file targets.

    -- DTM (Data Transformation Manager) Its process the business logic defined in themapping.

    The above readers, writers and DTM are known as Integration service components.

    File List

    A file list is a list of flat files with the same data definition, which needs to be mergedwith the source file type as indirect.

    XML Source Qualifier Transformation

    This transformation is used to read the data from XML files. (Just like Source Qualifier)

    Every XML source definition by default associates with XML source qualifiertransformation.

    An XML is a case sensitive markup language saved with extension .xml

    Note: XML files are case sensitive file.

    XML File Example:

    Emp.xml

    100

    PRAKASH

    DEVELOPER

    17000

    20

    200

    JITESH

    MANAGER

    77000

    20

    25

  • 7/27/2019 DW & Informatica basics

    26/49

    Normalizer Transformation

    This is of type of an active transformation which reads the data from Global file source.It is used to read the file from COBOL source. Every COBOL source definition bydefault associate with Normalizer transformation.

    Normalizer transformation functions like a source qualifier which reading the data fromCOBOL Sources.

    User Normalizer transformation to convert a single input record from source into multipleoutput data records. This process is known as data pivoting

    Example:

    File name: Account.txt

    Year Account Month1 Month2 Month3

    2008 Salary 25000 30000 28000

    2008 Others 5000 6000 4000

    Output

    Year Account Month Amount

    2008 Salary 1 25000

    2008 Salary 2 30000

    2008 Salary 3 280002008 Others 1 5000

    2008 Others 2 6000

    2008 Others 3 4000

    Transaction Control Transformation

    This is of type an active transformation which allows to controls the transaction by set ofcommit and rollback condition.

    If you want to control the transactions then use transaction control transformation at

    mapping level.We can define control expression by using the following predefined variables.

    TC_ROLLBACK_BEFORE

    TC_ROLLBACK_AFTER

    TC_COMMIT_BEFORE

    TC_COMMIT_AFTER

    TC_CONTINUE_TRANSACTION (Default)

    26

  • 7/27/2019 DW & Informatica basics

    27/49

    A transaction can be control at session level also by using the property commit interval.

    Sequence Generator Transformation

    This is of type passive transformation which allows you to generate the sequence number

    to be treated as primary keys.-- A surrogate key is a system generated sequence number to be used as primary key tomaintain the history in a dimension tables.

    -- A surrogate key is also known as dimensional key or artificial key or synthetic key.

    -- A sequence generator transformation is created with two default output ports.

    i. Nextval

    ii. Curval

    -- This Transformation does not allow you to create a new ports or edit the existingoutput ports.

    This transformation is used in implementing slowly changing dimensions type2 tomaintain the history in type2 SCD.

    The following are the properties to be set to generate the sequence number.

    1. Start Value

    2. Current Value

    3. Increment by

    Update Strategy Transformation

    This is of type an active transformation which flag the source records for Insert, Update,Delete, and Reject data driven operations.

    This transformation functions an DML command in terms of SQL.

    There are two different ways to implement an update strategy.

    i. Using update strategy transformation at mapping level.

    ii. Using target table options at session level.

    The conditional update strategy expressions can be developed using followingconstraints.

    DD_Insert 0

    DD_Update 1

    DD_Delete 2DD_Reject 3

    -- DD stands for Data Driven

    Ex: IFF(SAL>3000, DD_Insert, DD_Reject)

    The above expression can be implemented using update strategy transformation atmapping level.

    27

  • 7/27/2019 DW & Informatica basics

    28/49

    The default update strategy expression is DD_Insert.

    Update strategy transformation functions works on target definition table.

    The target table should contain primary key.

    Use the following target table options at session level to implement an update strategy

    i. Insert: - It inserts the records in the target.ii. Update: - Update as Update--It updates the record in the target.

    iii. Delete: - It deletes the records on the target.

    iv. Update as insert: - For each update it insert a new record in the target.

    v. Update else insert: - It updates the record if exist else insert new record in thetarget.

    Use an update strategy transformation to update SCD.

    CACHE

    JOINER CACHE

    How it Works

    There are two types of cache memory, index and data cache.

    All rows from the master source are loaded into cache memory.

    The index cache contains all port values from the master source where the port isspecified in the join condition.

    The data cache contains all port values not specified in the join condition.

    After the cache is loaded the detail source is compared row by row to the values in theindex cache.

    Upon a match the tows from the data cache are included in the stream.Key Point

    If there is not enough memory specified in the index and data cache properties theoverflow will be written out to disk.

    Performance consideration

    The master source should be the source that will take up the least amount of space incache.

    Another performance consideration would be the sorting of data prior to the joinertransformation. (Sorted Input).

    Note: The index cache is saved with an extension .idx and data cache is saved with anextension .dat

    The cache stored on server hard drive.

    28

  • 7/27/2019 DW & Informatica basics

    29/49

    Joiner Cache

    Index Cache Data Cache

    LOOKUP CACHE

    How it works

    There are two types of cache memory index and data cache.

    All ports value from the lookup table where the port is part of the lookup condition areloaded into index cache.

    The index cache contains all ports value from the lookup table where the port is specifiedin the lookup condition.

    The data cache contains all port values from the lookup table that are not in lookup

    condition and are specified as output ports.After the cache loaded, values from the lookup input ports that are part of lookupcondition are compared to index cache.

    Upon a match the rows from the cache are included in stream.

    Types of Lookup Cache

    When the mapping contains lookup transformation the integration service queries thelookup data and stores in the lookup cache.

    The following are the types of cache created by integration service.

    1. Static Lookup CacheThis is the default lookup cache created by integration service, it is the read only cache,can not be updated.

    2. Dynamic Lookup Cache

    The cache can be updated during the session run and particularly used when you performa lookup on target table in implementing Slowly Changing Dimension.

    29

    Deptno10203040

    Dname LocationHR HYBDIT NDLSMKT KASALE CHE

  • 7/27/2019 DW & Informatica basics

    30/49

    It the lookup table is the target the cache is changed dynamically as target load rows areprocessed.

    New row to be inserted or updated in the target are also written to the cache.

    Business Purpose

    In a data warehousing dimensions tables are frequently updated and changes to the newrow data must be captured within the load cycle.

    New Lookup Row

    0 The integration service does not update or insert the row in cache.

    1 The integration service inserts the row into the cache.

    2 The integration service updates the row into the cache.

    Key Points

    1. The lookup transformation Associated port matches a lookup input port withthe corresponding part in the lookup cache.

    2. The Ignore null inputs for updates should be checked for ports where null datain the input stream may overwrite the corresponding field in the lookup cache.

    3. The Ignore in Comparison should be checked for any port that is not to becompared.

    4. The flag New Lookup Row indicates the type of row manipulation of the cache.If an input row creates an insert n the lookup cache the flag is set to 1. If aninput row creates an update of the lookup cache the flag is set to 2. If nochanges is detected the flag is set to 0. A filter or router transformation can beused with an update strategy transformation to set the proper row tag to update atarget table.

    30

    Dynamic LookupCache

    LookupTransformation

    LookupResponse

    LookupRequest

    Write toCache

    TargetTable

    Write toTarget

  • 7/27/2019 DW & Informatica basics

    31/49

    Performance Consideration

    A large lookup table may require more memory resources than available. A SQL overridein the lookup transformation can be used.

    Persistent Lookup CacheThe cache can be reused for multiple session runs. It improves the performance of thesession.

    AGGREGATE CACHE

    How it Works

    When the first time session executes on integration service, the integration service createsan aggregate cache which is made up of index cache and the data cache.

    The integration service uses an aggregate cache to perform incremental aggregation.

    This improves the performance of session.

    There are two types of cache memory, index and data cache.

    All rows are loaded into cache before any aggregation tasks place.

    All index cache contains group by port values.

    The data cache contains all ports value variable and connected output ports.

    Non-group by input ports used in non-aggregate output expression.

    Non group by input/output ports.

    Local variable ports.

    Ports containing aggregate function (multiply by three).

    One output rows will be required for each unique occurrence of the group ports.When you perform the incremental aggregation the integration service reads the recordfrom the source and check in the index cache for the existence of group value.

    If the group value exist then it performs the aggregation calculation incrementally usinghistorical cache.

    If it does not find the group in he index cache it creates the group and performaggregation.

    Performance Consideration

    Sorted Input: - Aggregator performance can be increased when you sort the input in thesame order as the aggregator group by ports prior to doing the aggregation. Theaggregator stored input property would need to be checked.

    Relational source data can be sorted using an order by clause in the source qualifieroverride.

    Flat file source data can be sorted using an external sort application or the sortertransformation. Cache size is also important in assuring optimal performance in theaggregator. Make sure that your cache size settings are large enough to accommodate allof the data. If they are not, the system will cache out to disk causing a slow down inperformance.

    31

  • 7/27/2019 DW & Informatica basics

    32/49

    Perform incremental aggregation using aggregate cache. Perform group on numericalport rather than using character port.

    Aggregate Cache

    Index Cache Data Cache

    SORTER CACHE

    How it Works

    If the cache size is specified in the properties exceeds the available amount of memory onthe integration service process machine then the integration service fails the session.

    All of the incoming data is passed into cache memory before the sort operation is

    performed.If the amount of incoming data is greater than the cache size specified then thePowerCenter will temporary store the data in the sorter transformation work directory.

    Key Points

    The integration service requires disk space of at least twice the amount of incoming datawhen storing data in work directory.

    Performance Consideration

    Using sorter transformation may improve performance over an Order by clause in aSQL override in aggregate session when the source is a database because the sourcedatabase may not be tuned with the buffer size needed for a database sort.

    Performance Consideration in Various Transformations

    Filter Transformation

    Keep the filter transformation as close to the source qualifier as possible to filter the dataearly in the data flow.

    If possible move the same condition to source qualifier transformation.

    32

    Deptno10203040

    Sum(Sal)800012000600099000

  • 7/27/2019 DW & Informatica basics

    33/49

    Router Transformation

    When splitting row data based on field values a router transformation has a performanceadvantage over multiple filter transformation because a row is read once into the inputgroup but evaluated multiple times based in the number of groups. Whereas using

    multiple filter transformation requires the same row data to be duplicated for each filtertransformation.

    Update Strategy Transformation

    The update strategy transformation performance can vary depending on the number ofupdates and inserts. In some cases there may be a performance benefit to split a mappingwith updates and insert into two mapping and sessions. One mapping with inserts andother with updates.

    Expression TransformationUse operator instead of functions

    Ex: Instead of using concat function use || operator to concatenate two string fields.

    Simplify the complex expressions by defining variable ports.

    Try to avoid the usage of aggregate function.

    TASK and TYPES OF TASK

    A task is defined as a set of instructions. There are two types of task.

    i. Reusable Task: - A task which can be defined for multiple workflows is known as

    reusable task. A reusable task is created using task developer tool. Ex: Session,command, Email.

    ii. Non-Reusable Task: A task which is created and defined at the time of creatingworkflow is known as non-reusable task. Ex: Session, Command, Email, Decision task,Control task, Timer task, Event wait task, Event raise task, Worklet.

    Note:- A non-reusable task can be converted into reusable task.

    Types of Batch Processing

    There are two types of batch processing.

    i. Parallel batch processing: - In a parallel batch processing all the session start executingat the same point of time. Session execute concurrently.

    33

    WKF

    F

    S-10

    S-20

    S-30

  • 7/27/2019 DW & Informatica basics

    34/49

    ii. Sequential batch processing: - Session executes one after another.

    The above pictorial representation defines as follows:

    If S-10 is finished (Succeeded or Failed) then S-20 start and so on.

    Link Condition

    In sequential batch processing the session executed sequentially and conditionally usinglink condition. Define the link conditions using a predefined variable calledPrevTaskStatus

    The above pictorial representation defined as follows, If the S-10 succeeded then S-20will execute and so on.

    WORKLET and TYPE OF WORKLET

    A Worklet is defined as group of tasks. There are two types of worklet.i. Reusable Worklet: - A worklet which can be defined in a multiple workflows is knownas reusable worklet.

    A reusable worklet is created using worklet designer tool. In a workflow manager.

    A worklet can be executed using a start task known as workflow.

    ii. Non-reusable Worklet: - a worklet which is created at the time of creating workflow isknown as non-reusable worklet.

    A non-reusable worklet can be converted into the reusable worklet.

    COMMAND TASKYou can specify one or more shell commands to run during the workflow with commandtask.

    You specify the shell commands in the command task to delete, reject file, copy file etc.

    Use command task in the following ways:

    1. Stand-alone command task:- Use a command task anywhere in the workflow orworklet to run the shell command.

    34

    WKF

    F

    S-10 S-20 S-30

    WKF

    FS-10 S-20 S-30

    $S-10: PrevTaskStatus: SUCCEEDED

    $S-20: PrevTaskStatus: SUCCEEDED

  • 7/27/2019 DW & Informatica basics

    35/49

    2. Pre-Post Session shell command: - you can call the command task as the pre-postsession shell command for a session task.

    You can use any valid UNIX commands for UNIX servers and any valid DOS commandfor WINDOWS server.

    Event Task

    You can define the events in the workflow to specify the sequence of task execution.

    The event is triggered based on the completion of sequence of the task.

    Use the following task to define the vent in the workflow.

    i. Event Raise Task: - The event raise task represent User definedevent. When the integration service runs the event raise task. The event raisetarget triggers the event. Use event raise task with event wait task to define theevents.

    ii. Event Wait Task: - The event wait task waits for an event to occur.Once the event triggers the integration service continues executing the rest ofworkflow. You may specify the following types of event for event wait and eventraise task.

    a) Pre-defined: - A predefined event is the file watch event. For a predefined eventsuse event wait task to instruct the integration service to wait for specified indicatorfile. To appear before continuing with the rest of workflow. When the integrationservice locates the indicator file it starts the next task in the workflow.

    b) User Defined Event: - A user defined event is a sequence of task in the workflow.Use an event raise task to specify the location of user defined event in theworkflow.

    Decision Task

    You can enter a condition that determines the execution of the workflow with decision

    task, similar to the link condition. The decision task has a predefined variable called$decision_task_name.condition that represents the result of decision condition.

    The integration service evaluates the condition in the decision task and sets the pre-defined condition variable to True or False.

    Use decision task instead of multiple link condition in the workflow.

    35

    WKF

    FS-10 CMD Task

    Stand Alone Command Task

    Copy C:\test.txt D:\New Test

  • 7/27/2019 DW & Informatica basics

    36/49

    Timer Task

    You can specify the period of time to wait before integration service runs the next task inthe workflow with the timer task.

    The timer task has two types of settings.i. Absolute type: - We can specify the time that integration service starts running

    the next task in the workflow.

    ii. Relative type: - You instruct the integration service to wait for specifiedperiod of time. After the timer task.

    Ex: A workflow contains two sessions. You want the integration service wait 10 minutesafter the first session completes, before it runs the second session.

    Use the timer task after the first session, in the relative time setting of a timer task.Specify 10 minutes for start time of the timer task.

    Assignment Task

    You can assign a value to user defined workflow variable with the assignment task.

    To use assignment task in the workflow first create an add an assignment task toworkflow. Then configure the assignment task to assign value or expression to userdefined variable.

    Email Task

    Email task is used to send an email within a workflow.

    Note: - Emails can also be set post session in a session task.

    -- Can be used within a link condition to notify success or failure of prior task.

    PMCD Utility

    The PMCD is a command line program utility which communicates with integrationservices.

    Using PMCD the following task can be preformed

    i. Start Workflow

    ii. Schedule Workflow

    iii. Get Service details

    iv. Ping Service

    The following commands can be used with PMCD Utility.

    1. Connect it connect the PMCD program to the integration service.

    2. Disconnect It disconnects the PMCD from the integration service.

    3. Exit Disconnects the PMCD from the integration service and closes the PMCDprogram.

    36

  • 7/27/2019 DW & Informatica basics

    37/49

    4. Ping Service Verifies the integration service is running or not.

    5. Help Returns the syntax for the command that you specify with help.

    6. Start Workflow It starts the workflow on integration service.

    7. Schedule Workflow Instructs the integration service to schedule a workflow.

    Before working with these commands you have to set environment variable for commandprompt.

    Set the Environment Variable

    1. My ComputerRight ClickPropertiesAdvancedEnvironment Variable

    2. Click New (User Variables for Administrator)

    Variable Value

    Infa_Home C:\Program Files\Informatica\PowerCenter 8.6.0

    3. From System VariableSelect PathEditAssign the variable

    C:\Program Files\Informatica\PowerCenter 8.6.0\Server\bin

    Open the Command Prompt type the PMCD

    i. Syntax for Connect Command

    Connect sv service name d domain name u Username p Password

    ii. Syntax for Start Workflow

    Startworkflow f folder name wkf workflow name

    ** Rest of the command and Syntax you can find in the help menu of Informatica Client

    Designer window.

    PMREP Utility

    The PMREP is a command line program utility. That provides a communication torepository service. To administrate the repository and update the repository content.

    The following Commands can be used with PMREP Utility

    1. Connect Connect r Repository Name d Domain_name x Password

    Ex: connect r nipuna_rep d domain_admin n administrator n administrator

    2. Backup Use this command to the backup of the repository in .rep file format.

    backup o filenamebackup o C:\backup\batch7pm.rep

    3. Create Folder It creates a new folder in the repository

    Create folder n folder name

    4. Object Export Export the object to .xml file.

    Object export n object name o object type f folder name u xml output file

    Ex: - object export n M40 o mapping f batch7pm u test.xml

    37

  • 7/27/2019 DW & Informatica basics

    38/49

    5. Exit Exit the PMREP from command line.

    User Defined Function

    It lets you to create customized function or user specific function to meet the specific

    business task that is not possible with built in functions.The user defined functions can be private or public.

    Mapping Parameters

    A mapping parameters represents a constant value that can be define before mapping run.

    A mapping parameter is created with the name, type, datatype, precision and scale.

    A mapping parameter is defined in a parameter file, which is saved with an extension.prm

    A mapping can be reused for various business rules by parameterize the mappings.

    Represented by $$.Parameter file Syntax

    [Folder Name .WF: workflow name . ST : Session name]

    $$ parameter = Value

    Example:

    [Batch7pm . WF: wkf_mp . ST: S_mp]

    $$ deptno = 30

    $$ tax = 0.15

    Mapping Parameters are specific to the mapping and local to the mapping.

    Mapping Variables

    A mapping variable represent a value that can be change during mapping run.

    A mapping variable is created with the name, time, data type, precision, scale andaggregation.

    Business Purpose

    A mapping variable is defined to perform incremental extraction from source.

    Note: A mapping variable can be used in Source Qualifier Transformation also.

    A variable with the value stored in repository. A mapping variable is created to performvalue based increment extraction.

    Session Parameter

    A Session parameter defines connection path to database system and file system.

    A Session parameter defines in parameter file saved with an extension with .prm.

    A Session parameter represented by $.

    Syntax

    38

  • 7/27/2019 DW & Informatica basics

    39/49

    [Folder . Session]

    $ session parameter = Connection

    Tracing LevelA tracing level determines the amount of information in the session log.

    The following are the types of tracing levels.

    1. Normal

    2. Verbose

    3. Verbose Data

    4. Terse

    The default tracing level is Normal.

    Tracing Level Description

    Normal Integration Service logs initialization and status information, errorsencountered and skipped rows due to transformation row errors.Summarizes session results, but not at the level of individual rows.

    Terse Integration Service logs initialization information and errormessages and notification of rejected data.

    VerboseInitialization

    In addition to normal tracing, Integration Service logs additionalinitialization details names of index and data files used, and detailed

    transformation statistics.

    Verbose Data In addition to verbose initialization tracing, Integration Service logseach row that passes into the mapping. Also notes where theIntegration Service truncates string data to fit the precision of acolumn and provides detailed transformation statistics.

    Allows the Integration Service to write errors to both the sessionlog and error log when you enable row error logging.

    When you configure the tracing level to verbose data, theIntegration Service writes row data for all rows in a block when itprocesses a transformation.

    .

    Session Recovery

    If you stop a session or an error passes a session to stop. Then identified the reasons forthe failure and start the session again using one of the following methods.

    1. Restart the session again if the integration service has not issued at least onecommit.

    39

  • 7/27/2019 DW & Informatica basics

    40/49

    2. Perform session recovery if the integration service has issued at least one commit.

    When you start the recovery session the integration service reads the ROWID of last rowcommitted record from OPB_SRVR_RECOVERY table.

    The integration service reads all the source data and start processing from next ROWID.

    DEBUGGER

    It is used to debug the mapping to check the business functionality.

    Metadata Extension

    A metadata extension provides information about the developer who has created anobject.

    Metadata extension includes the following information.

    1. Developer Name

    2. Object Creation Date3. Email ID

    4. Desk Phone etc

    Difference between Normal and Bulk Loading

    Normal Loading The integration service make an entry of the data record into the data

    log before loading into the target. The integration service consumes more time to load thedata into the target.

    Bulk Loading: - The integration service bypasses the data log and make an entry of thedata record directly into the target.

    It improves the performance of data loading.

    Note: - In bulk loading you can not perform session recovery.

    UNIT TESTING

    A unit test for the data warehouse is a white box testing. It should check the ETLprocedure, mappings, and front end developed reports.

    Executes the following test cases

    1. Data Availability

    Test Procedure

    40

  • 7/27/2019 DW & Informatica basics

    41/49

    Connect to the source database with valid username and password. Run the SQL Queryon the database to verify that the data is available in the table from where it needs to beextracted.

    Expected Behavior

    -- The login to the database should be successful.

    -- The table should contain relevant data.

    Actual Behavior

    -- As expected

    Test Result

    -- Pass or Fail

    2. Data Load/Insert

    Ensure that records are being inserted in the target.

    Test Procedure

    i. Make sure that target table is not having any recordsii. Run the mapping and check that records are being inserted in the target table.

    Expected Behavior

    The target table should contain inserted record.

    Actual Behavior

    -- As expected

    Test Result

    -- Pass

    3. Data Load/Update

    Ensure that update is properly happening in the target.

    Test Procedure

    i. Make sure that some records are there in the target already.

    ii. Update the value of the some field in a source table record which has been alreadyloaded into the target.

    iii. Run the mapping

    Expected Behavior

    The target table should contain updated record.

    Actual Behavior

    -- As expected

    Test Result

    -- Pass

    4. Incremental Data Load

    Ensure that the data from the source should be properly populated into the targetincrementally and without any data loss.

    41

  • 7/27/2019 DW & Informatica basics

    42/49

    Test Procedure

    i. Add new record with new values in addition to already existing record in the source.

    ii. Run the mapping

    Expected Behavior

    The target table should be added with only new record.Actual Behavior

    -- As expected

    Test Result

    -- Pass

    5. Data Accuracy

    The data from the source should be populated into the target accurately.

    Test Procedure

    i. Add new record with new values in addition to already existing record in the source.

    ii. Run the mapping

    Expected Behavior

    The column values in the target should be the same the data source value.

    Actual Behavior

    -- As expected

    Test Result

    -- Pass

    6. Verify Data LossCheck the number of records in the source and target.

    Test Procedure

    i. Run the mapping and check the number of records inserted in the target and number ofrecords rejected.

    Expected Behavior

    No. of records in the source table should be equal to the number of records in the targettable + rejected records.

    Actual Behavior

    -- As expectedTest Result

    -- Pass

    7. Verify Column Mapping

    Verify that source columns are properly linked to the target column.

    Test Procedure

    42

  • 7/27/2019 DW & Informatica basics

    43/49

    i. Perform a manual check to confirm that source columns are properly linked to thetarget columns.

    Expected Behavior

    The data from the source columns should be placed in target table accurately.

    Actual Behavior-- As expected

    Test Result

    -- Pass

    8. Verify Naming Standard

    Ensure that objects are created with industry specific naming standard.

    Test Procedure

    i. A manual check can be performed to verify the naming standard.

    Expected Behavior

    Objects should be given appropriate naming standards.

    Actual Behavior

    -- As expected

    Test Result

    -- Pass

    9. SCD Type2 Mapping

    Ensure that surrogate keys are properly generating for a dimensional change.

    Test Procedure

    i. Insert a new record with new values in addition to already existing records in the

    source.

    ii. Change the value of some field in a source table record which has been already loadedinto the target run the mapping.

    ii. Verify the target for appropriate surrogate keys.

    Expected Behavior

    The target table should contain appropriate surrogate key for insert and update.

    Actual Behavior

    -- As expected

    Test Result

    -- Pass

    SYSTEM TESTING

    System testing also called Data Validation Testing.

    The system and acceptance testing are usually separate. It might be move beneficial tocombine the two phases in case of tight timeline and budget constraint.

    43

  • 7/27/2019 DW & Informatica basics

    44/49

    A simple technique of counting the number of records in the source table that should betie up with Number of records in the target table + Number of records rejected

    Test the rejects for business logic.

    The ETL system is tested with the full functionality and is expected to function as inproduction.

    In many case the dimension table exists as masters in OLTP and can be checkeddirectly.

    Performance Testing and Optimization

    The first step in performance tuning is to identify the performance bottleneck in thefollowing order.

    1. Target

    2. Source

    3. Mapping

    4. Session

    5. System

    The most common performance bottleneck occurs when the integration service writes thedata to target.

    1. Identifying Target Bottleneck

    Test Procedure: A target bottleneck can be identified by configuring the session to writeto a flat file target.

    Optimization:

    i. User Bulk loading instead of Normal load.

    ii. Increase Commit Interval

    iii. Drop index of target table before loading

    2. Identifying Source Bottleneck

    Test Procedure: A source bottleneck can be identified by removing all the transformationin test mapping and if the performance is similar then there is source bottleneck.

    Test Procedure: Add a filter condition after the Source Qualifier to false so that no data isprocessed passed the filter transformation. If the time it takes to run the new session issame as original session there is a source bottleneck.

    Optimization:

    i. Create Index

    ii. Optimize the query using hint = WHERE clause.

    3. Identifying Mapping Bottleneck

    Test Procedure: Add a filter condition before each target definition and set condition tofalse so that no records are loaded into the target.

    If the time it takes to run the new session is same as original session then there is amapping bottleneck.

    44

  • 7/27/2019 DW & Informatica basics

    45/49

    Optimization:

    i. Joiner Transformation

    1. Use Sorted Input

    2. Define the source as master source which occupies the least amount of memory

    in the cache.ii. Aggregator Transformation

    1. Use Sorted Input

    2. Incremental aggregation with aggregate cache.

    3. Group by simpler ports, preferably Numeric Ports.

    iii. Lookup Transformation

    1. Define SQL Override on lookup table

    2. User persistent lookup cache.

    iv. Expression Transformation

    1. Use operators instead of function

    2. Avoid the usage of aggregate function call.

    3. Simplify the expression by creating variable ports.

    v. Filter Transformation

    1. Keep the filter transformation as close to the source qualifier as possible to filterthe data early in the data flow.

    4. Identifying Session Bottleneck

    Test Procedure: Use Collect performance details to identify session bottleneck. Low (0-

    20%) buffer input efficiency and buffer output efficiency counter values indicates sessionbottleneck.

    Optimization:

    Tune the following parameters in the session.

    1. DTM buffer size 6M to 128M

    2. Buffer block size 4K to 128K

    3. Data cache size 2M to 24M

    4. Index cache size 1M to 12M

    Test Procedure: Double Click Session Properties Tab Select Collect

    performance Data Click ApplyOk

    Execute the Session.

    The Integration service creates a performance file that saved with an extension .pref

    The .pref file located in session log directory.

    5. Identifying System Bottleneck

    45

  • 7/27/2019 DW & Informatica basics

    46/49

    If there is no target, source, mapping and session bottleneck then there may be a systembottleneck. Use the system tool to monitor CPU usage and memory usage.

    On Windows Operating System used Task Manager, on Unix Operating System usesystem tool such as iostat, sar.

    Optimization:

    Improve Network Speed

    Improve CPU Usage

    SQL Transformation

    The SQL Transformation processes the SQL queries in the pipeline. You can insert,delete, update and retrieve rows from the database. You can pass the database connectioninformation to the SQL Transformation as input data at run time.

    You can configure the SQL Transformation to run into the following modes.

    1. Script Mode: - An SQL Transformation running in script mode runs SQL Scripts

    from the text file.

    You pass each script file name from source to SQL Transformation using scriptname port.

    The Script file name contains complete path to script file.

    An SQL Transformation configure for script mode has the following default ports.

    i. Script Name Input port

    ii. Script Result Output Port (Returns passed if the script executionsucceeded otherwise returns fail)

    iii. Script Error Output Port (Returns Error Message)

    2. Query Mode: - When a SQL Transformation runs in query mode it executes anSQL Query that you define in the transformation.

    When you configure the SQL Transformation to run in a query mode youcreate an active transformation. The transformation can returns multiplerows for each row.

    Unconnected Stored Procedure

    An Unconnected stored procedure transformation is not a part of data flow. It can becalled through other transformation using :sp( ) Identifier.

    An Unconnected stored procedure can receive act as function that can be called throughother transformation such as expression transformation.

    An Unconnected stored procedure can receive multiple inputs but provides single output.

    Difference between Connected and Unconnected Lookup Transformation

    Connected Lookup Unconnected Lookup

    1 Part of the mapping data flow Separate from mapping data flow

    46

  • 7/27/2019 DW & Informatica basics

    47/49

    2 Returns multiple values (by linkingOutput ports to another transformation.)

    Returns one value by checking the Return Portoption for the output port that provides thereturn value.

    3 Execute for every record passingthrough the transformation

    Only executed when the lookup function iscalled

    4 More visible, shows where thelookup values are used.

    Less visible as the lookup is called from anexpression within another transformation.

    5 Default values are used. Default values are ignored.

    Joins Versus Lookup

    Source Qualifier Join

    Advantage

    Can join any number of tables

    Full functionality of standard SQL variable. May reduce volume of data on network

    Disadvantage

    Can only join homogeneous relational tables

    Can affect performance on the source database.

    Joiner

    Advantage

    Can join Heterogeneous source

    Can join non-relational source

    Can join partially transformed data

    Disadvantage

    Can only join two input data steams per joiner

    Only supports equijoin

    Does not support OR condition

    Lookup

    Advantage

    Can reuse cache across session run

    Can reuse cache with mapping

    Can modify cache dynamically

    Can chose to cache or not to cache

    Can query relational table or flat file

    Inequality comparison are allowed

    47

  • 7/27/2019 DW & Informatica basics

    48/49

    SQL Override supported

    Can be unconnected and invoked as needed

    Disadvantage

    Can not output multiple matches

    Unconnected can only have one return value Does not support OR condition

    Unconnected Lookup Transformation

    An unconnected transformation is not a part of data flow, act as a lookup that can becalled through other transformation using :LKP identifier.

    It improves the efficiency of mapping.

    High Level Design

    The following activities need to be identified.

    1. Identify the source system

    2. Identify the RDBMS

    3. Identify the hardware requirements

    4. Identify the ETL & OLAP software requirement.

    5. Identify the operating system requirements.

    48

  • 7/27/2019 DW & Informatica basics

    49/49

    ETL Development Life Cycle

    ETL Project Plan

    Business Requirements

    High Level Design

    Low Level Design

    ETL Development

    ETL Unit Testing

    System Testing

    Performance Testing

    ETL User Acceptance Testing

    Deployment

    Warranty, Stabilization Period

    Maintenance