Starschema Antares iDL™

16 MPH

Starschema Antares iDL™

A fully automated, compliant-by-design intelligence data lake architecture with real-time ingestion and best-of-breed standardization and audit features

TAMÁS FÖLDI

CTO/Partner, USA

DANIEL ISTVAN SZABÓ

Enterprise Data Lake Engineering team lead

CHRIS VON CSEFALVAY

Principal Data Scientist

WP-SDL-050619/P

[email protected] Starschema.com

Starschema, Inc.. 2221 South Clark St., Arlington, VA 22202 +1 415 944-5455

1

Introduction Businesses that attempt to embark on the journey towards a data-driven business model, especially if they are operating at the enterprise scale, often encounter a vexing problem: they have a staggering range of different systems, each responsible for parts of the business, holding data at different speeds, in various formats and with different costs of storage. This is an unsurprising result of functional areas of the enterprise each selecting solutions that are peak performers in their respective category. A system that aggregates live point-of-sale data for a retail business has to fulfill different expectations than the HR system in the same business when it comes not only to the type and format of the data, but also its Vs: velocity, volume, variety and validity.

A result of this has been the gradual shift towards isolated data silos. Across the typical enterprise, data resides in hundreds, sometimes thousands, of separate systems representing a combination of open-source, proprietary and tailored solutions — each with its own data model, speed, availability, cost and other characteristics appropriate to it. How can this diversity — and the peak performance it delivers — be retained, while also allowing for processes that require data from the various siloed systems to be joined together, such as data visualization, data science and the ML/AI applications, and power the enterprise of the future?

Figure 1. The Starschema Antares iDL™ architectural concept involves ingestion using CDC and batch ingestion (using HVR) and stream ingestion (using Apache Kafka). A multi-speed data lake with a strict standard model and a speed layer for specific applications running at high speed exposes this to consumers through a mirror architecture driven by an array of speed-differentiated systems, ranging from inexpensive low-level storage (Hadoop, Spark or S3) through intermediate Warm Layer storage for intermediate applications to specialized Speed Layer applications that provide super-fast results over a limited time-critical subset of the data. For the speed-differentiated hierarchy, please refer to Figure 2.

Starschema’s data lake design is driven by a fully standardized internal model and offers the flexibility of accommodating an enormous range of enterprise systems as data sources — we know of no other data lake that meets this important enterprise criterion. Designed for the always-on

Source systems Ingestion Data Lake Consumers

Datascience

Datavisualization

ML/AI

Mirror Standard model Speed layer

CDC and batch

Streaming



2

global enterprise and successfully implemented in some of the world’s largest companies, Starschema Antares iDL™ has been successfully serving industries as diverse as energy, oil & gas, finance, power generation, manufacturing, renewables, aerospace/defense and healthcare.

In line with efforts to democratize data across the enterprise, Starschema Antares iDL™ is designed for the data-driven transformation of the entire business. Rather than merely serving a segment of the enterprise, such as its IT community, it makes the entire data wealth of the business available through a wide range of interfaces to functional areas of the enterprise. Managers can access dashboards from the same source and with the same breadth of capture as internal systems or data scientists. Designed with the future of data in mind, being suited to use by data science, machine learning and artificial intelligence applications has been one of the crucial drivers of Starschema Antares iDL™’s architectural design — it is the data lake of the future, available today. We invite you to explore what Starschema Antares iDL™ can do for you over the following pages.



3

Real-time ingestion Over the last few years, data lakes have become central hubs of information exchange for the data-driven enterprise, and use cases have grown beyond the typical reactive reporting paradigm to powering apps, external customer tools, unstructured data processing and being a single point of access for data to power users like data scientists. With this, the requirements relating to ingestion have dramatically shifted away from the old extraction-based approach, which has proven to be a bottleneck for several reasons:

• Relying on full extracts has overloaded source systems, impairing their performance.

• Creating incremental loads on different ERPs and transaction systems needs business rules to be specified in advance, which requires expensive resources from analysts and takes time. In addition, incremental extracts still cause a load on source systems, resulting in an adverse effect on the system’s performance.

• New digitalization opportunities like personalization, fraud detection or business anomaly detection require shorter time-to-insight and action cycles. This demands near real-time ingestion.

Figure 2. The Starschema Antares iDL™ heat map. The three data lake layers provide differentiated access speed/data velocity, for different purposes. This allows usage- and need-dependent resource utilization while delivering the required speed of access to all customers, creating a uniquely powerful tool for the democratization of data access within the enterprise.

Speed

Hot

WarmDat

a la

ke la

yers

Super-speed specialized data access.

Rapid data access for internal tools and workflows.

Low level access for power users and developers.

Doc

umen

t sto

re

Mobile app Geospatial Real time Search

Apps and external customers

Reporting Data discovery Data science Archiving ETL offload Unstructureddata

processing

Internal customers Data science, prototyping & development



4

To address these issues, Starschema Antares iDL™ relies entirely on change data capture (CDC) technologies like HVR. The advantages of using CDC solutions in data lakes are manifold:

• Speed to insights and actions is dramatically reduced: upstream data pipelines can run continuously without waiting for the ingestion pieces to complete, thus delivering faster actionability and availability of new analytical insights.

• Handle legacy and non-legacy data sources with the same, captured based ingestion: HVR supports a variety of sources, including Oracle RDBMS, IBM DB2, MSSQL, MySQL, Postgres, Salesforce, Apache Kafka, SAP ERP, SAP HANA or web services. There is no need to define “last updated” fields in sources, capture based on transaction journal eliminating any software logic or human errors. This provides better data quality and ways to “compare” bronze/mirror layers with source systems taking the “in-flight” record into consideration.

• Support for multi-target ingestion: the same source feed can be parallelly ingested to cold, warm and/or hot targets. The same transactions can be ingested to Spark, Snowflake and MemSQL simultaneously and processed differently for different use-cases. Also, the same solution can help dual-ETL or production identical QA environment data feed.

• Fault tolerance: HVR is a high-availability clustered solution with a spoke-hub architecture.

• Metadata driven approach: CDC tools can copy and maintain real time replication of entire database copies.

To learn more about our CDC solutions in data lakes, please refer to our HVR Customer Story.



5

Persistence and computation The purpose of the Data Lake is to eliminate data silos and provide a complete, comprehensive view of all corporate data in one simple location. However, data lakes have to support a variety of use cases from real time, highly responsive mobile applications to complex AI/ML use cases on a petabyte scale. Cost is also an important factor: some outcomes might only have a positive Return of Investment (ROI) at a lower infrastructure cost. These divergences mean that all use cases cannot be efficiently served from one single database engine – only a heterogenous solution resting on a tightly integrated combination of multiple storage and computation engines can deliver the desired results. Starschema Antares iDL™ divides these engines to the following categories:

• Cold/Batch storage: provides low cost, commodity server based, cloud or on-prem batch optimized storage for any type of data (structured and unstructured). This layer serves primarily as the basis of prototyping, ML/AI batch use cases, ETL offloading and archiving. Typical technologies include Cloudera Hadoop, AWS S3/EMR, Databricks Spark, and Azure BlobStore/HDInsight.

• Warm storage: serves as the central hub for structured, normalized datasets, providing scalable, ACID compliant, transaction-safe SQL based access with massive parallel processing (MPP) capabilities. Typical use cases include standard and operational reporting (financial closing-related cross-ERP operational reports, compliance reports, etc.) and subject-organized data marts for self-service analytics (Tableau or PowerBI reporting). Typical technologies include Snowflake and Greenplum, which provide scale-out capabilities with strong consistency.

• Hot storage: provides consumption specific, highly optimized storage and computation. Typical use cases are high speed, low latency OLAP, free text search and spatial analytics. Technologies include Druid, Sparkline, MemSQL, ElasticSearch, Vertica or Kinetica.

Maintaining an outcome-based data duplication strategy — determining what goes to which platform — is essential. For all non-duplicated data sets, the system provides interoperability: different database engines can access data from other database engines, albeit at a slightly slower pace, but without the need of data duplication, delivering one consistent data lake view while reducing storage costs for infrequently accessed data.



6

Framework Built specifically for high-complexity, enterprise-grade requirements, Starschema Antares iDL™’s Generic ETL Framework (GEF) is a distributed, platform agnostic data pipeline management tool. It provides services for ingesting hundreds of sources pipelined into one up-to-date, easy-to-adjust (SOX, ITIL change management), coherent data model covering the entire data pipeline management lifecycle. The framework manages data across Hadoop and non-Hadoop based databases, takes care of batch processing jobs, stream replications and denormalizations driven entirely by metadata.

Ingestion with the Generic ETL Framework

GEF ingests raw Copper/Bronze/Silver source-system data in real-time via the Automated Superset-View technique, which identifies similarities between tables and facilitates the completely automated generation of superset views. All superset views generated atomically, based on the common fields across the same type of source systems. Consumers can execute queries across source systems with different column sets without the need to manually build these consolidated views. Generating thousands of consolidated superset views as needed, GEF can accommodate even the most highly customized ERPs like Oracle ERP or SAP ECC.

Providing database integrity via metadata driven constraints — even in MPP and Hadoop systems that lack primary key and foreign key constraints — Starschema Antares iDL™’s GEF framework is designed for data management and handling from the ground up. The highly optimized logic implemented by the GEF prevents processes from inserting any data that violates complex data requirements (duplicates, missing data, load time anomalies).

Designed to accommodate interaction with other databases and systems, Starschema Antares iDL™’s approach has proven to be both faster and more flexible in a massively parallel environment than using legacy database constraints and relying on proprietary algorithms.

Data integrity assurance

GEF also provides configurable data integrity business rules for consistent incremental loads using coherent ID collection. This is a critical feature to ensure stability and coherence of streaming/lambda architectures where the source data stream is ingested continuously. The GEF framework ensures that we have a consistent view of source data, especially in streaming. By always managing business entities, not individual tuples, the framework ensures that every child record of a parent entity is processed together, blocking processing until all dependent tuples arrived to the data lake. For instance, in the case of an invoice, our system does not further process the invoice entity until all related records (accounting distributions, payment checks etc.) are fully loaded to



7

the related tables. This ensures database consistency downstream even in the case of incremental changes. Real-time performance is preserved even where some columns are denormalized or calculated.

Using GEF’s pipeline mapping the data architecture to individual data lake deployment patches, Starschema Antares iDL™ maintains a standard, automated, metadata-driven software development environment that ensures all sources can be ingested/monitored in a fully parallel, reusable and human-error-free environment. The framework can generate ETL jobs on-the-fly based on metadata. The generated jobs are cached until a change in the metadata detected. At a Fortune 500 client with 6,000 ETL jobs generated on the fly, this code caching saved 20 minutes every day.

The handling of subsequent data layers is kept in a standardized way by code and data models normalized to background engines. We use a single layer maintenance technique to create and maintain all subsequent and technical staging layers, with performance based on demand statistics as gathered from source-system metrics and consumption requests.

The Starschema Antares iDL™ Threading Engine

Newly detected source-system to source-system row-level data relationship knowledge is stored, managed and denormalization is automatically performed by the built-in Starschema Antares iDL™ Threading Engine. Where a source system generates data and interfaces with another source system, both passing their data to Starschema Antares iDL™, the interface data connections are captured, discovered and stored in metadata. Every time a new line is loaded into the Data Lake, relevant stored metadata in the knowledge base is examined, and a column is added to point to the exact row in another source system’s table. This way, users can track details through systems regarding a single flow of transactions. For instance, if a requisition is initiated, approved and converted into a purchase order in one system, but delivery is tracked in another and payment in a third, the Threading Engine stores primary key pointers for each of the hops to find the next point at source system level.

In addition, Starschema Antares iDL™’s Smart Data Transfer capability avoids unnecessary waves of changed data when the actual hash of the entire column shows no change. Starschema Antares iDL™ also provides a built-in historization solution for OLTP data if needed (Pure Type6).

Starschema Antares iDL™ provides a solution to ingest hundreds of divergent sources into a single up-to-date, easily manageable, SOX and ITIL compliant data model using the Generic ETL Framework (GEF, see previous chapter) and the Data Lake Audit Framework (DAF). Together, GEF and DAF create a Gold Standard Data Model that facilitates projection and consumption using the Standard Data Feed and Standard Data Queue features. These are powered by a multi-tenant



8

subscription solution to HDFS or in-memory replications. Streaming and batch data alike are served through these channels dynamically to multiple platforms: MPP to Big Data Spark, in-memory databases, and many more.



9

Metadata management and data quality For thousands of ingestions a day, we provide a predefined Metadata Model consumed by the Starschema Antares iDL™ Generic ETL Framework (GEF). This Metadata Model is essential for data consistency, performance, data cleansing, audit and model expansion.

Metadata management

The risk of metadata mismanagement is reduced significantly by Starschema Antares iDL™’s solution to store metadata together with its history, and held in an easy-to-configure way using our predefined processes and patch lifecycle management toolset:

• The Starschema Antares iDL™ Workbench is designed to make GEF metadata management easy from design through development to deployment. A standalone application, it takes care of code delta packaging, cross-validation, code freeze, code-review, change management and release auto-deployment. Tightly integrated with your team’s collaboration and ITSM tools, it supports a range of SCM systems (Git and others), as well as a number of other management/ITSM tools like ServiceNow, Aha, Rally, JIRA and others.

• The Starschema Antares iDL™ DevOps Portal presents change management and status information in an easily digestible format over a web application accessible anywhere. This provides a uniquely powerful tool to validate functionalities and determine whether a change process was run successfully.

In practice, these tools manage a range of major maintenance tasks automatically, saving time and labor costs while reducing the risk of errors. This includes

• automated patch creation, with support for validation, migration versioning and concurrency handling,

• patch list creation/approval via a dashboard, integrated with the ServiceNow API, and

• automated rollback with the click of a single button, rolling the entire data lake back to its previous state at a single click by the operations engineer.

These tools have supported enterprise data lake development teams with nearly 400 developers, attesting to the stability and reliability of Starschema Antares iDL™’s development tools in even the most demanding conditions for enterprise data lakes.

Data quality and auditing

To maintain data quality when moving data from one layer to another, the Data Lake Audit Framework (DAF) ensures consistent data quality. A dedicated engine with dedicated metadata



10

structures ensures that all data loads are audited and, in the case of any inconsistencies found, these are corrected or pinpointed in connected alerting systems. Whenever the data lake or its segments are out of integrity, users are immediately notified that data consistency has been lost and the data is not reliable until integrity is restored.

The DAF primarily focuses on safeguarding data integrity through tracking

• the number of unique tuples,

• aggregated business-defined sum values compared between source and target layers, and

• possible data incompleteness due to load time join failure.

The DAF engine is a three-level job structure where one job is registering the new data entities to be audited, the second job controls priorities and schedules execution, and the third executes Audit actions sequentially. Starschema Antares iDL™ DAF includes audit rule autocreation features using advanced machine learning based on predefined metadata and actual data issues. Alerts are automatically transmitted to ticketing systems like ServiceNow, and can be monitored using dashboard solutions like Tableau or PowerBI.



11

Security and compliance by design At Starschema, we believe that compliance should not be an afterthought, but an integral part of systems design and architecture. As such, the Starschema Antares iDL™ architecture has been designed ‘compliance first’, meeting the need to provide data scientists and BI analysts with high quality aggregated data while also providing adequate safeguards to comply with general and industry-specific standards of data protection. With the entry into force of the General Data Protection Regulation (GDPR) within the EU and analogous legislation throughout the world following suit, the importance of consistent compliance cannot be overstated. Starschema Antares iDL™ is a natural choice for enterprises subject to strict compliance regimes (GDPR, SOX, PCI DSS, HIPAA etc.) who require effective and reliable compliance solutions while retaining the ability to unlock insights from their data. Whether on-premise, hybrid or fully in the cloud, there is a secure, reliable and compliant Starschema Antares iDL™ data lake for your enterprise.

Data anonymization

In order to comply with GDPR and analogous national provisions on the subject of data privacy, personally identifying information (PII) and certain other categories of sensitive information must be treated confidentially. Starschema Antares iDL™ performs anonymization on the fly and for the entire data set, configurable through a web-based interface for end users to configure their anonymization requirements. Anonymization configurations are stored in DynamoDB, a NoSQL database.

Multiple algorithms are used for anonymization and provide resistance to rainbow table attacks by adding a random salt (a random-generated string added before hashing to prevent known-cleartext attacks using pre-generated hashes of potential solutions, referred to as a ‘rainbow table’). In certain scenarios, anonymity is provided through data aggregation (bucketing), removing individual attributes and returning group-level data.

Art. 5(e) GDPR establishes the principle and requirement of ‘data minimization’ – when it comes to data on individuals, both volume and time of retention must be minimized, and personal data shall be kept for no longer than is necessary for the purposes for which it is being processed. To facilitate that, an audit log is in place for raw data access, and mechanisms are in place for both scheduled and on-request deletion of data. To maintain consistency over time, aggregated data that no longer forms data on individuals subject to the requirement in Art. 5(e) can be retained in its aggregated form, preserving the consistency and auditability of past analyses based on them.

Access controls

Similarly to data in motion, which is protected using SSL/TLS certificates, data at rest is also closely



12

guarded. AWS’s Key Management System (KMS) is used to encrypt all storage – local and cloud alike – that holds data at rest. Key access is only provided to functional tools, never individual users, to access raw data.

With the help of Data Stewards, data governance is achieved not only at the metadata level but also at the content level. Content is audited to determine levels of access to be granted at the attribute/row level and create roles. Role-based access control (RBAC) using access control lists (ACLs) governs provisioning user privileges in a consistent and compliant manner.

In special scenarios, network security may be augmented by client whitelists that restrict access to the Starschema Antares iDL™ deployment to authorized subnets and through specific ports.

DevSecOps

Security considerations are part of the DevOps journey every step of the day. The DevOps driven nature of Starschema Antares iDL™ integrates strongly with tools to facilitate secure code, including security analysis of all code with SonarQube prior to deployment. A Starschema Antares iDL™ data lake is designed to be compliant and secure throughout its lifespan and over the course of any changes made to it.



13

Automation, DevOps and Application Lifecycle Management Starschema Antares iDL™ provides a best-in-class fully automated data lake experience. By leveraging DevOps best practices and building on decades of cumulative experience building data lakes and Big Data environments, Starschema Antares iDL™ rests on the solid foundations of platform as code and code as code. Built for teamwork, Starschema Antares iDL™ ships with a web-based Integrated Development Environment (IDE) based on AWS Cloud9, enabling teams to collaboratively code and test in real-time.

Infrastructure automation: platform as code

Building on DevOps best practices and years of experience in creating and automating data lakes, Starschema Antares iDL™ provides a fully automated data lake implementation configurable entirely as code and deployed using – whether in the cloud, hybrid or on premise:

• When deployed in the cloud, the data lake infrastructure is created from AWS CloudFormation templates. Cloud deployments also benefit from built-in scalability: to ensure performance in times of high demand (e.g. end-of-year or end-of-quarter closing), resources are assigned to autoscaling groups at time of creation. This means that the Starschema Antares iDL™ can dynamically respond to increased demand at any time by spinning up new instances, while also ensuring economical use of resources through winding down resources when demand decreases.

• Hybrid infrastructures are automated using AWS CodeDeploy and AWS OpsWorks, orchestrating the synchronous deployment and management of both the on-premise and cloud-deployed parts of the infrastructure.

• On-premise systems are also fully automated, with Chef deploy scripts managing the DevOps process and a CI/CD pipeline run by Jenkins.

Application Lifecycle Management: code as code

With agile processes built in, integrated CI/CD tools and a distributed Source Code Management (SCM) system storing configurations for all application building blocks, Starschema Antares iDL™ ensures consistency through efficient distributed metadata management and built-in developer tools that seamlessly integrate with existing ITSM/IDM systems:

• Starschema Antares iDL™ Team Collaboration provides data architects and developers with a user interface for agile data lake development, managing user stories, features, ideas and initiatives in a single and collaboratively maintained central location.

• Starschema Antares iDL™ Workbench drives code development by managing code delta packaging, validation, code review, change management and deploying releases.



14

• Finally, Starschema Antares iDL™ DevOps Portal is a one-stop shop for tracking releases, artifact deployments and log files, providing developers, architects and managers with a single overview of the data lake’s deployment status.



15

Case study: a structured, SOX compliant, multi-layer data lake for a large global conglomerate Every day, our client’s group of companies records tens of thousands of financial transactions. This poses a unique challenge to a data lake: not only does highly granular data, usually transaction-level, have to be ingested in near real-time (latency <1 hour) from over 100 source systems belonging to more than thirty different types, but due to compliance requirements, there is zero tolerance for data inconsistencies. When our client faced the challenge of implementing such a system over more than 200TB of compressed enterprise data comprising more than 25,000 data domains (tables) of over five hundred distinct types while also allowing simultaneous data consumption and continuous ingestion, they turned to Starschema for a solution.

Our approach

Starschema implemented a state-of-the-art architecture by deploying the Starschema Antares iDL™ design, in which raw data would first be mirrored by ingestion into a Massively Parallel Processing (MPP) relational database using HVR and Talend. Subsequently, data would be replicated to in-memory and Hadoop (file system) based consumption layers for later use, including aggregation, data stores, and data science applications. Data consumption then takes place over a dynamic lambda architecture, providing streaming and batch processing layers.

To facilitate the operation of this multi-layer data lake, a standard data definition structure (standard model) was devised for identical domain types of raw data, and a metadata knowledge base was used to store discovered constraints and relationships within the data. At this stage, a standard data model was devised for identical domain types of raw data. In addition, the data lake automatically generated a continuously updated metadata knowledge base to store discovered constraints and relationships within the data, providing a comprehension of the data lake’s underlying structure itself. This in turn drives the Generic ETL Framework (GEF) and Data Lake Audit Framework (DAF) that constantly maintains and audits the data layers. Operations, change management, and development are supported by ITIL and SOX compliant DevOps CI/CD pipeline applications, which maintain compliance through a process-forcing design.

Results

Every day, the SOX certified Antares iDL™ data lake implementation ingests approximately 200TB data through 6,000 parallel ETL processes from a diverse range of source systems – Oracle, SAP, PeopleSoft, Hyperion, enterprise-developed systems, etc. – into a single Oracle EBS type Standard Model. Data is ingested in near real-time, allowing the enterprise to perform crucial finance functions, such as closing and reporting, account reconciliation and centralized tax calculation,



16

based on accurate, consistent and up-to-date data.

“Starschema has been our partner since the beginning of our big data journey. What

truly differentiates them is their ability to operate and deliver with a strong product

mindset though catering to a service industry. Starschema, through its partnership

with some of the most innovative companies in the big data space, has uniquely

created an informal ecosystem and community to share best practices. The talent

pool in Starschema is committed and differentiated and the company does a great

job in hiring and retaining some of the best talents.”

– Goel Divakar, VP, Global CDO, General Electric

Starschema Antares iDL™

Documents

Transcript of Starschema Antares iDL™