Blog post

Stop Talking About “Hadoop”

By Merv Adrian | March 04, 2020 | 8 Comments

open sourcemachine learningGartnerdata lakeApache YARNApache SubmarineApache NiFiApache MapReduceApache KafkaApache HDFSApache HadoopApacheAnalytics and BI SolutionsData and Analytics LeadersData and Analytics StrategiesTechnology Innovation

As an analyst, I get many inquiries about Hadoop. Few of them bear much resemblance to the ones I took 5 or 6 years ago. The investor inquiries are similar – they still think Hadoop is a “market” and want to talk about how it’s doing. User inquiries are very different. They used to reflect confusion about what it was. Today they are characterized by a seeming certainty that it’s whatever they want it to be as long as there is data involved. But “Hadoop” has long since moved beyond its original batch processing, disk-based roots on-premises. Available packages from vendors who used to be called “Hadoop distributors” now include Spark, HBase, Hive, Kafka, Flink, NiFi and many other components. Most are built for use cases that the original MapReduce processing would never have been used for. The term has outlived its usefulness, even as a shorthand label.

The larger stack of Apache projects typically used has always been characterized by substitutability at any layer. Years of competition among distributors was based in part on their dueling projects.  Many had their own version of this or that, still open source, and after a while the confusion was a problem, and a barrier to progress. What does the Apache project called Hadoop – the core – contain today, according to the Apache website? Two file systems (HDFS and Hadoop Ozone), a system for parallel processing of large data sets (Hadoop MapReduce), a job scheduling and cluster resource manager (Hadoop YARN), and, until recently, a machine learning engine (Submarine). (3/4/20 – Since first publication of this post, the latter has moved to Related Projects.)

Standards? What Standards?

But storage is not standard. In an era where more and more data is moving to the cloud, the object stores from the cloud platform players are increasingly replacing HDFS. Apache Ozone is in part an attempt to create a modern open alternative, in keeping with the philosophy that has informed all the layers. Some now refer to Hadoop Compatible File Systems (HCFS). But note that

“The Apache Software Foundation can make no assertion that a third party filesystem is compatible with Apache Hadoop: these are claims by the vendors which the Apache Software Foundation do not validate.”

Processing – even for the same “parse large files and get some useful info out” use cases – is not standard either. Some historians might argue that MapReduce was the whole point. But today, Spark has often replaced it, and some even argue it’s the standard processing piece we ought to be talking about. Others would say “let’s not wait for the data to come to rest when we can work with it in motion.” They make the same claim for Kafka, NiFi and Flink.

The Apache definition did include a machine learning component, but Submarine is new – in its second year. It has now been promoted to a top level project. It’s fair to say that earlier ML attempts based on Mahout did not dominate the uses of “Hadoop.” Perhaps Submarine will have more success. Gartner research indicates that the market is moving to more full-featured, commercial products and away from open-source technical collections of algorithms pointed at engineers. Submarine is a more complete collection that nods in that direction, so it bids more to be a Machine Learning workbench than a “Hadoop component.”

Let’s get specific

Any specific collection drawn from these pieces and all the others will have its own strengths, and one is likely to be more suitable for a specific use case. At other layers, similar alternatives exist. But there is another key point – few use cases rely on only one layer. A very large percentage of significant enterprise-class requirements is likely to rely on 3 or more of them. Hence the notion of a “platform,” which can be thought of as another way to say “distribution” for our purposes here, though there are obvious differences. If you’re determined to talk about technology, be specific – talk about the ones being used.

These days, owning a “platform” is everyone’s target – from traditional BI and analytics, Data Integration, DBMS, and Machine Learning vendors, to cloud platform providers. Even the last remaining “Hadoop distributors” (who no longer wish to be called that) talk “platform.”All of them substitute some pieces at some of the layers of some core stack. Some alternative components at specific layers talk to alternative components at other layers. For example, you may want to use Spark on AWS, with Kinesis instead of Kafka, to read log data from S3, because an application you are connecting to made that storage choice. The good news is that Spark can do that. (Spark’s third-party packages list had 474 entries in Feb 2020.) But not all projects have such a rich ecosystem. Similar scenarios occur when using Microsoft HDInsight or Google DataProc. Wherever you deploy, there are “local favorite” options.

The combinations can be numbing – and then there’s the “plumbing.” (Sorry for the rhyme – it’s the starting point for a talking blues.) The early Hadoop teams were typically fenced off on their own clusters. They didn’t have to worry much about connections to the rest of the fabric, governance, or security beyond access control. Today, many of the components in delivered commercial packages need to be instrumented for granular role-based security, metadata management, lineage, data quality, portability, the coordination of distributed applications, and more. The teams using the technology routinely coordinate with, exchange data with, and participate in policy enforcement with the rest of the business and technology units in their firms. And there will be a need for resource management, orchestration, governance and security that none of those teams do today.

Let’s Talk About What Matters

All this suggests the name “Hadoop” has outlived its usefulness as a way to identify what we are trying to do with the various technologies in our stack du jour. Perhaps it’s simply time to talk about the use case – data lake, machine learning, operational data management, “your favorite here.” Acknowledge that it’s more descriptive, and more useful to use that as a basis for design, development, integration and operational planning.

The former “Hadoop” vendors made this transition a long time ago. It’s time for the rest of us to. Let’s think – and communicate – in terms of the use cases, functional activities, outcomes and audiences. “Our customer-facing system needs to integrate data from several new semi-structured data feeds to provide better analytics-based visibility of preferred offerings to our highest value subscribers.” Might there be some Hadoop in there? Sure. But it won’t move the conversation forward much to start the conversation there. Let’s talk about customer systems, or digital transformation based on machine learning, or data enrichment to feed field operations…..

 

 

Comments are closed

8 Comments

  • ehsaider says:

    You publish valuable articles

  • spotify apk says:

    very good article

  • Wangda Tan says:

    Hi Merv,
    Thanks for publishing the great and insightful article.
    Just one correlation: Hadoop Submarine is already spin off to Apache Submarine (Which is an Apache top-level-project). It also a sign of the project is in a healthy state.

    Wangda (Apache Hadoop/Submarine PMC)

    • Merv Adrian says:

      Thanks for the info – this is what I love about blogs. I’ve made updates to the post to reflect this information.

  • Mick says:

    Nicely put Merv…