Advertisment

Hadoop: Next big thing in Big Data space

author-image
Krystal
New Update

BANGALORE, INDIA: In the recent past, Hadoop has become the next big thing in the big data space. Almost 80 percent of the world's data is unstructured, and most businesses do not even attempt to use this data to their advantage.

Advertisment

Earlier, it was difficult to save all the data that was generated by businesses for future analysis. Hadoop solves this problem. Also, owing to its no time limitation quality, many enterprises are embracing Hadoop as it makes experimenting and exploring the different uses with large data sets easy.

ThoughtWorks is a global IT consultancy providing Agile-based systems development, consulting and transformation services to companies globally. ThoughtWorks believes that a solution for Big Data should solve challenges of three Vs - Volume, Variety and Velocity, by employing existing assets such as Big Data warehouses combined with newer approaches, one of which is Hadoop.

ThoughtWorks have been encouraging their clients to use hadoop as it lowers the cost of innovation, procures large scale resources quickly along with various other advantages.

Advertisment

Shyam Kurien works with the Advanced Analytics and Distributed Computing practice for ThoughtWorks India, where he is responsible for defining and execution of the business strategy for the practice.

Here, he discusses how Apache Hadoop has become the next big thing in the big data space, how it helps save all the data that is generated by a business for future analysis, the technical nuances behind different SQL-on-Hadoop solutions and more.

CIOL: How has Apache Hadoop become the next big thing in big data space?

Advertisment

Shyam Kurien: A number of factors have contributed to making Hadoop the central force in the big data analytics space. What started off as adoption of the principles of Google File System and MapReduce to improve the scalability and robustness of the search engine, Nutch has evolved into a very extensive and highly capable ecosystem.

First of all, the research papers published by Google have been instrumental in setting the direction and philosophy of the Hadoop framework. In addition to HDFS and MapReduce being modeled around the abovementioned papers from Google, there are many other components which have been inspired by Google research papers: HBase based on Google BigTable, Impala inspired by Google Dremel, Pig modeled around Google Sawzall, etc.

One should soon start seeing further maturing of the Hadoop ecosystem to have distributed database capabilities if one is to go by the publishing of Google F1 and Google Spanner papers.

Advertisment

The second key factor has been the Open Source nature of the Hadoop platform. Starting with Yahoo, which gave an official home to the fledgling Hadoop community, a number of organizations that undertake processing and analyzing of massive amounts of data have adopted Hadoop as a core piece of their data infrastructure.

A number of advancements in the ecosystem have been contributed to the open source by such players. For example, the data warehouse abstraction on Hadoop- Hive came out of the Facebook stable. Similarly, Berkeley University's Amp Labs has built a number of extensions and released the same again as open source components Spark/Mesos/Shark, etc.

CIOL: How does Hadoop help save all the data that is generated by a business for future analysis?

Advertisment

SK: To allow massive amounts of data to be stored, any (distributed) file system should have the following traits:

a) It should be horizontally scalable.

b) It should be highly fault tolerant.

c) It should be very flexible in what can be stored.

The core of Hadoop consists of a distributed file system called Hadoop File System (HDFS) and a distributed processing framework (YARN / MapReduce), both of which have been built ground up with the objectives of horizontal scaling and fault tolerance.

Advertisment

HDFS comprises of a master node called NameNode and any number of slave nodes called DataNodes, which are controlled and coordinated by the NameNode. Whenever a file (typically very large in size) is written on HDFS, the file is broken into a number of pieces, each piece called a block, and the same is distributed across a number of DataNodes.

Further, each block is also replicated on more than one DataNode based on the cluster configuration. The entire metadata about the file system, including the mapping files to blocks (including the replication) is maintained by the NameNode as part of the file system namespace.

In case of failure of any DataNode, the NameNode ensures that the node is no longer used and any files (or rather blocks) that were stored in those nodes are replicated to other DataNodes, thus ensuring that high degree of fault tolerance.

Advertisment

Both, the NameNode and the DataNodes are designed to run on commodity hardware; hence it is very easy and inexpensive to scale the storage horizontally by just adding additional DataNodes to the cluster and registering the same with the NameNode.

The file system does not put any restriction on the kind of data that can be stored and retrieved from the same. Any structure that needs to be imposed on top of the data can be defined at a later point, and not necessarily at the time of first loading the data into the file system. This is quite contradictory to the typical database where the schema is defined upfront and then data (which matches this schema) is loaded into the schema.

This "schema on read" capability of the Hadoop framework also goes a long way in allowing enterprises to save all the data generated by the business without being burdened by the rigid structures of an enterprise data warehouse.

CIOL: What are the technical nuances behind different SQL-on-Hadoop solutions?

SK: A huge number of vendors have been working on their SQL-on-Hadoop solutions, and in the recent past a number of them have released or are in the process of getting released. Impala from Cloudera, Drill from MapR, Lingual from Cascading, Hadapt, Polybase from Microsoft, Hawq from Pivotal HD are but a few of them, the latest entrant being Presto from Facebook.

Each one of these solutions would have considerable or subtle differences in their technical approaches. However, they could be classified into four, based on the classification provided by Daniel Abadi (of Vertica fame):

a) Faster Hive / SQL converted to MapReduce. Such solutions convert the SQL query to a series of MapReduce jobs, and the same is executed on a cluster by the job manager. However, the high latency of Hive queries has been a big drawback for this approach.

Initiatives to improve the performance of Hive by the Qubole team (who build Hive in the first place at Facebook) and the Stinger initiative by Hortonworks are improving the performance to the levels that will support low latency querying.

b) Distributed Query Processing engines. Most of the latest set of SQL-on-Hadoop solutions like Impala, Drill, and Lingual are of this nature.

c) Split Query Processing engines. Microsoft's Polybase and Hadapt belong to this category, where the approach is to split the processing as some parts to executed as MapReduce jobs and some others as native SQL operators.

d) DBMS / Hadoop connectors. Solutions like HAWQ from Greenplum and SQL-H use this connector approach, where the data is extracted out of HDFS at query time and then consolidated by the MPP execution engines.

CIOL: How has Hadoop been adopted across enterprises?

SK: Hadoop has been adopted across enterprises in many ways.

Platformisation: As organizations mature in their adoption of Hadoop and its ecosystem and build multiple analytics applications, typically they reach a phase where a number of the base features of the applications can be abstracted to a data processing platform.

At a basic level, this could be data lifecycle management platform or at a higher level of sophistication, the platformisation could also involve building a domain specific data asset repository. Such abstractions allow the analytics applications to focus primarily on business domain with the boilerplate code for plumbing provided by the platform.

Leveraging the elasticity cloud: Multiple reasons have been driving the adoption of cloud as a platform for deploying Hadoop based applications. Reasons could vary from the need to support variable resource requirements associated for typical batch processing jobs to running on the same platform as the transactional system deployed on the cloud.

Another reason could be that organizational processes (as compared to hardware or software capabilities) might start becoming the limiting factor for scalability, once the scale of operations of an in-house Hadoop cluster grows beyond an inflection point. For whatever reason, organizations of various sizes and domains have started opting for cloud deployment, providing a fillip to a large number of infrastructural players offering their version of Hadoop of the cloud.

Real time analytics: While the batch mode of operation of conventional Hadoop applications meets most requirements of an organization, a number of organizations are exploring how they can achieve two flavours of real-time capability. The first one is real-time in terms of latency of some form of querying of the data store, and organizations have started experimenting with various SQL-on-Hadoop solutions that promise to provide near real time responses to queries.

The other limitation brought about by the batch processing mode of Hadoop is that the analytics algorithms would typically work on data that is slightly old; it does not include real-time data. One solution is to utilise a hybrid architecture, what is referred to as the Lambda architecture, where the "Batch layer" typically implemented as a series of MapReduce jobs is combined along with a "Speed layer" which might be implemented on Storm.