What is ETL?
Before we go discussing about the hadoop's support
for ETL, let me explain briefly about ETL.
ETL is a short form of Extract, Transform & Load to a data warehouse. As a first step of ETL, the data from various
form of data sources are EXTRACTed. Then the extracted data is TRANSFORMed into proper format so that
any business analysis can be performed on the data. Finally the transformed data is LOAD into a target database (A data
warehouse).
ETL software primarily compile data from
multiple applications (systems), those are developed and supported by software
vendors or hosted on separate computer hardware. The disparate systems
containing the original data are frequently managed and operated by different
employees. For example, a University Management System may combine data from courses,
students, examinations, staffing, staff payrolls, accounting software,
etc. And if an university is spread
across various geographical locations / countries, then all these data need to
be integrated.
Stages of ETL:
The first part of an ETL process involves
extracting the data from the source system(s). In many cases this represents
the most important aspect of ETL, since extracting data correctly sets the
stage for the success of subsequent processes and one has to spend quality time
on deciding the data sets that need to be extracted. Brainstorming group discussions would be of
great use here.
In the data transformation stage, a series of
rules or functions are applied to the extracted data in order to prepare it for
loading into the end target. Some data does not require any transformation at
all; such data is known as "direct move" or "pass through"
data.
The load phase loads the data into the end
target that may be a simple delimited flat file or a data warehouse. Depending on the
requirements of the organization, this process varies widely. Some data
warehouses may overwrite existing information with cumulative information;
updating extracted data is frequently done on a daily, weekly, or monthly
basis. Other data warehouses (or even other parts of the same data warehouse)
may add new data in a historical form at regular intervals—for example, hourly.
To understand this, consider a data warehouse that is required to maintain
sales records of the last year. This data warehouse overwrites any data older
than a year with newer data. The final
data, ie., the data LOADed to a data warehouse is the one that could be used
for any BI reports, management reports, analytical reports based on which an
organization can decide/plan for their future business focus. So, some discussions with the senior
management, marketing team, other stakeholders would be desired to load the
required data.
Increased data volume - Missing ETL timelines?
In the
mid-to-late 1990s, the data sizes started to grow beyond the ability of ETL
tools to finish the transformation work within allocated time (the ETL window)
— that is, they had difficulties in executing the TRANSFORM tasks
fast enough.
Most organizations with traditional data platforms—typically
relational database management systems (RDBMS) coupled to enterprise data
warehouses (EDW) using ETL tools—find that their legacy infrastructure is
either technically incapable or financially impractical for storing and
analyzing big data.
Matters were further complicated by human
error. For example, if it is discovered that part of the ETL didn't complete
properly during the previous run, or if any logical errors observed in the user
defined functions (UDF) if any, the transformations have to be redone. This
problem often takes weeks to be corrected, as the ETL tool has to redo all that
past work while continuing to do the work for new daily data as it arrives. In
summary, ETL transformation engines became a significant bottleneck.
Apache Hadoop
Rapidly ingesting, storing, and processing big
data requires a cost effective infrastructure that can scale with the amount of
data and the scope of analysis. As the industries struggled with SLAs, a new
solution evolved: Apache Hadoop, a distributed data storage and processing
system built from the ground up to be massively scalable (thousands of servers)
using industry-standard hardware. Furthermore, Hadoop was built to be very
flexible in terms of accepting data of any type, regardless of structure.
Apache Hadoop is an open source distributed
software platform for storing and processing data. Written in Java, it runs on
a cluster of industry-standard servers configured with direct-attached
storage. Using Hadoop, you can store
petabytes of data reliably on tens of thousands of servers while scaling
performance cost-effectively by merely adding inexpensive nodes to the cluster.
Central to the scalability of Apache Hadoop is
the distributed processing framework known as MapReduce. MapReduce helps
programmers solve data-parallel problems for which the data set can be
sub-divided into small parts and processed independently.
MapReduce is an important advance because it
allows ordinary developers, not just those skilled in high-performance computing,
to use parallel programming constructs without worrying about the complex
details of intra-cluster communication, task monitoring, and failure handling.
MapReduce simplifies all that.
The system splits the input data-set into
multiple chunks, each of which is assigned a map task that can process the data
in parallel. Each map task reads the
input as a set of (key, value) pairs and produces a transformed set of (key,
value) pairs as the output. The framework shuffles and sorts outputs of the map
tasks, sending the intermediate (key, value) pairs to the reduce tasks, which
group them into final results. MapReduce uses JobTracker and TaskTracker mechanisms
to schedule tasks, monitor them, and restart any that fail.
The Apache Hadoop platform also includes the
Hadoop Distributed File System (HDFS), which is designed for scalability and
fault tolerance. HDFS stores large files
by dividing them into blocks (usually 64 or 128 MB) and replicating the blocks
on three or more servers. HDFS provides
APIs for MapReduce applications to read and write data in parallel. Capacity
and performance can be scaled by adding Data Nodes, and a single NameNode
mechanism manages data placement and monitors server availability. HDFS
clusters in production use today reliably hold petabytes of data on thousands
of nodes.
Besides MapReduce & HDFS, Apache hadoop has
many components such as Apache Flume, Apache Sqoop, Pig, Hive etc. some of them
are useful for the ETL process.
Apache Flume* is a distributed system for collecting,
aggregating, and moving large amounts of data from multiple sources into HDFS
or another central data store.
Apache Sqoop* is a tool for transferring data
between Hadoop and relational databases. You can use Sqoop to import data from
a MySQL or Oracle database into HDFS, run MapReduce on the data, and then
export the data back into an RDBMS. Sqoop automates these processes, using
MapReduce to import and export the data in parallel with fault-tolerance.
Apache Hive* and Apache Pig* are programming
languages that simplify development of applications employing the MapReduce
framework. HiveQL is a dialect of SQL and supports a subset of the syntax.
Although slow, Hive is being actively enhanced by the developer community to
enable low-latency queries on Apache HBase* and HDFS. Pig Latin is a procedural
programming language that provides high-level abstractions for MapReduce. You
can extend it with User Defined Functions written in Java, Python, and other
languages.
Hadoop's
support
Now
that Hadoop has been commercially available in the market for years, hundreds
of organizations are moving the TRANSFORM function from their databases to
Hadoop. This movement achieves a number of key benefits:
1.
Hadoop can perform TRANSFORM much
more effectively than RDBMSs. Besides the performance benefits it is
also very fault tolerant and elastic. If you have a nine-hour transformation
job running on 20 servers, and at the eighth hour four of these servers go
down, the job will still finish — you will not need to rerun it from scratch.
If you discover a human error in your ETL logic and need to rerun TRANSFORM for
the last three months, you can temporarily add a few nodes to the cluster to
get extra processing speed, and then decommission those nodes after the ETL
catch-up is done.
2.
When the TRANSFORM moves to Hadoop, the RDBMS'es
are free to focus on doing the Q really well. It would be interesting to see
how much faster these systems get once voluminous unstructured data is moved
out of them.
Using Hadoop in this way, the organization
gains an additional ability to store and access data that they “might” need,
data that may never be loaded into the data warehouse. For example, data
scientists might want to use the large amounts of source data from social
media, web logs, or third- party stores stored on Hadoop, to enhance new
analytic models that drive research and discovery. They can store this data
cost effectively in Hadoop, and retrieve it as needed (using Hive or other
analytic tools native to the platform), without affecting the EDW environment.
Please note that I have give some views on how
hadoop can be utilised for ETL / Dataware housing related activities, and I am
not saying hadoop is out & out replacement for an ETL or and RDBMS, since
they may have their own strengths. Its
the responsibility of the clients who can choose the best implementation based
on analysis.
Thanks.