Tuesday, February 2, 2016

Hadoop's support for ETL

What is ETL? 

Before we go discussing about the hadoop's support for ETL, let me explain briefly about ETL.  ETL is a short form of Extract, Transform & Load to a data warehouse.  As a first step of ETL, the data from various form of data sources are EXTRACTed.  Then the extracted data is TRANSFORMed into proper format so that any business analysis can be performed on the data.  Finally the transformed data is LOAD into a target database (A data warehouse).

ETL software primarily compile data from multiple applications (systems), those are developed and supported by software vendors or hosted on separate computer hardware. The disparate systems containing the original data are frequently managed and operated by different employees. For example, a University Management System may combine data from courses, students, examinations, staffing, staff payrolls, accounting software, etc.  And if an university is spread across various geographical locations / countries, then all these data need to be integrated.

Stages of ETL:


The first part of an ETL process involves extracting the data from the source system(s). In many cases this represents the most important aspect of ETL, since extracting data correctly sets the stage for the success of subsequent processes and one has to spend quality time on deciding the data sets that need to be extracted.  Brainstorming group discussions would be of great use here.

In the data transformation stage, a series of rules or functions are applied to the extracted data in order to prepare it for loading into the end target. Some data does not require any transformation at all; such data is known as "direct move" or "pass through" data.

The load phase loads the data into the end target that may be a simple delimited flat file or a data warehouse.  Depending on the requirements of the organization, this process varies widely. Some data warehouses may overwrite existing information with cumulative information; updating extracted data is frequently done on a daily, weekly, or monthly basis. Other data warehouses (or even other parts of the same data warehouse) may add new data in a historical form at regular intervals—for example, hourly. To understand this, consider a data warehouse that is required to maintain sales records of the last year. This data warehouse overwrites any data older than a year with newer data.  The final data, ie., the data LOADed to a data warehouse is the one that could be used for any BI reports, management reports, analytical reports based on which an organization can decide/plan for their future business focus.  So, some discussions with the senior management, marketing team, other stakeholders would be desired to load the required data.

Increased data volume - Missing ETL timelines?

In the mid-to-late 1990s, the data sizes started to grow beyond the ability of ETL tools to finish the transformation work within allocated time (the ETL window) — that is, they had difficulties in executing the TRANSFORM tasks fast enough.

Most organizations with traditional data platforms—typically relational database management systems (RDBMS) coupled to enterprise data warehouses (EDW) using ETL tools—find that their legacy infrastructure is either technically incapable or financially impractical for storing and analyzing big data.

Matters were further complicated by human error. For example, if it is discovered that part of the ETL didn't complete properly during the previous run, or if any logical errors observed in the user defined functions (UDF) if any, the transformations have to be redone. This problem often takes weeks to be corrected, as the ETL tool has to redo all that past work while continuing to do the work for new daily data as it arrives. In summary, ETL transformation engines became a significant bottleneck.

Apache Hadoop
Rapidly ingesting, storing, and processing big data requires a cost effective infrastructure that can scale with the amount of data and the scope of analysis. As the industries struggled with SLAs, a new solution evolved: Apache Hadoop, a distributed data storage and processing system built from the ground up to be massively scalable (thousands of servers) using industry-standard hardware. Furthermore, Hadoop was built to be very flexible in terms of accepting data of any type, regardless of structure.

Apache Hadoop is an open source distributed software platform for storing and processing data. Written in Java, it runs on a cluster of industry-standard servers configured with direct-attached storage.  Using Hadoop, you can store petabytes of data reliably on tens of thousands of servers while scaling performance cost-effectively by merely adding inexpensive nodes to the cluster.

Central to the scalability of Apache Hadoop is the distributed processing framework known as MapReduce. MapReduce helps programmers solve data-parallel problems for which the data set can be sub-divided into small parts and processed independently.

MapReduce is an important advance because it allows ordinary developers, not just those skilled in high-performance computing, to use parallel programming constructs without worrying about the complex details of intra-cluster communication, task monitoring, and failure handling. MapReduce simplifies all that.

The system splits the input data-set into multiple chunks, each of which is assigned a map task that can process the data in parallel.  Each map task reads the input as a set of (key, value) pairs and produces a transformed set of (key, value) pairs as the output. The framework shuffles and sorts outputs of the map tasks, sending the intermediate (key, value) pairs to the reduce tasks, which group them into final results. MapReduce uses JobTracker and TaskTracker mechanisms to schedule tasks, monitor them, and restart any that fail.

The Apache Hadoop platform also includes the Hadoop Distributed File System (HDFS), which is designed for scalability and fault tolerance.  HDFS stores large files by dividing them into blocks (usually 64 or 128 MB) and replicating the blocks on three or more servers.  HDFS provides APIs for MapReduce applications to read and write data in parallel. Capacity and performance can be scaled by adding Data Nodes, and a single NameNode mechanism manages data placement and monitors server availability. HDFS clusters in production use today reliably hold petabytes of data on thousands of nodes.

Besides MapReduce & HDFS, Apache hadoop has many components such as Apache Flume, Apache Sqoop, Pig, Hive etc. some of them are useful for the ETL process.




Apache Flume* is a distributed system for collecting, aggregating, and moving large amounts of data from multiple sources into HDFS or another central data store.

Apache Sqoop* is a tool for transferring data between Hadoop and relational databases. You can use Sqoop to import data from a MySQL or Oracle database into HDFS, run MapReduce on the data, and then export the data back into an RDBMS. Sqoop automates these processes, using MapReduce to import and export the data in parallel with fault-tolerance.

Apache Hive* and Apache Pig* are programming languages that simplify development of applications employing the MapReduce framework. HiveQL is a dialect of SQL and supports a subset of the syntax. Although slow, Hive is being actively enhanced by the developer community to enable low-latency queries on Apache HBase* and HDFS. Pig Latin is a procedural programming language that provides high-level abstractions for MapReduce. You can extend it with User Defined Functions written in Java, Python, and other languages.

Hadoop's support
Now that Hadoop has been commercially available in the market for years, hundreds of organizations are moving the TRANSFORM function from their databases to Hadoop. This movement achieves a number of key benefits:

1.    Hadoop can perform TRANSFORM much more effectively than RDBMSs. Besides the performance benefits it is also very fault tolerant and elastic. If you have a nine-hour transformation job running on 20 servers, and at the eighth hour four of these servers go down, the job will still finish — you will not need to rerun it from scratch. If you discover a human error in your ETL logic and need to rerun TRANSFORM for the last three months, you can temporarily add a few nodes to the cluster to get extra processing speed, and then decommission those nodes after the ETL catch-up is done.
2.    When the TRANSFORM moves to Hadoop, the RDBMS'es are free to focus on doing the Q really well. It would be interesting to see how much faster these systems get once voluminous unstructured data is moved out of them.

Using Hadoop in this way, the organization gains an additional ability to store and access data that they “might” need, data that may never be loaded into the data warehouse. For example, data scientists might want to use the large amounts of source data from social media, web logs, or third- party stores stored on Hadoop, to enhance new analytic models that drive research and discovery. They can store this data cost effectively in Hadoop, and retrieve it as needed (using Hive or other analytic tools native to the platform), without affecting the EDW environment.

Please note that I have give some views on how hadoop can be utilised for ETL / Dataware housing related activities, and I am not saying hadoop is out & out replacement for an ETL or and RDBMS, since they may have their own strengths.  Its the responsibility of the clients who can choose the best implementation based on analysis.


Thanks.