Hadoop Ecosystem

7 years ago, the Hadoop ecosystem was under rapid development. Now lots of projects are mature enough and ready for production deployment.

Credit to Mercy (Ponnupandy) Beckham

Here is my personal pick for you to get start your Hadoop journey.

“The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.”

The project includes these modules:

Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

  • YARN (Distributed Resource Management): Part of the core Hadoop project, YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform, unlocking an entirely new approach to analytics. YARN is the foundation of the new generation of Hadoop and is enabling organizations everywhere to realize a modern data architecture.
  • Spark (Distributed Programming): Apache Spark™ is a fast and general engine for large-scale data processing. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
  • Tez (Distributed Programming): Apache™ Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.
  • Hive (SQL-On-Hadoop): The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.
  • Hbase (Column Data Model NoSQL): Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.
  • Cassandra (Column Data Model NoSQL): The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.Cassandra’s support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.
  • MongoDB (Document Data Model NoSQL): MongoDB is a document database with the scalability and flexibility that you want with the querying and indexing that you need. MongoDB stores data in flexible, JSON-like documents.
  • Redis (Key-Value Data Model NoSQL): Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache and message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs and geospatial indexes with radius queries. Redis has built-in replication, Lua scripting, LRU eviction, transactions and different levels of on-disk persistence, and provides high availability via Redis Sentinel and automatic partitioning with Redis Cluster.
  • Flume (Data Ingestion): Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
  • Sqoop (Data Ingestion): Apache Sqoop™ is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
  • Kafka (Data Ingestion): Kafka™ is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.
  • Thrift (Service Programming): The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.
  • ZooKeeper (Service Programming): A high-performance coordination service for distributed applications.
  • Mahout (Machine Learning): The Apache Mahout™ project’s goal is to build an environment for quickly creating scalable performant machine learning applications.
  • Oozie (Scheduling): Oozie is a workflow scheduler system to manage Apache Hadoop jobs.


  1. http://hadoopecosystemtable.github.io
  2. https://mydataexperiments.com/2017/04/11/hadoop-ecosystem-a-quick-glance/

Rule based, Machine learning or a hyper?

Fraud detection and prevention is always a batter field. Fraudsters will keep finding new ways to game the system. On the other hand, that is a big business opportunity. Let see how crowded the market is:


This is a hand selected list of some notable players in this markets. There are couple interesting observations:

  • Machine learning (or AI) is a must-have feature to the product.
  • Big data is still a treasure box with big potential.
  • Rule based system is still popular (it is old fashion?).
  • A hyper system combining rules and AI will be promising.
  • Visualization can be helpful and user friendly for customers.

Anyway, here are two nice summaries from seon.io and unfraud.com:

Screen Shot 2017-06-04 at 10.28.59 AM

Screen Shot 2017-06-05 at 10.30.06 AM

Fraud detection for eCommerce

If you work on e-commerce, beside making sure the online payment is smooth, another critical task is to deal with fraud! Fraud shares lots of common characteristics with (information) security. Good guys and bad guys are always fighting endlessly, just like Marvel comics: super heroes vs. super villains.

Credit: Marvel Comics

Most company either builds an in-house solution or use some market available solution. So what is the hard core of a fraud management system?

Before answering this question, maybe we can go through a simple e-commerce checkout flow (you know in reality, it will be much more complex):


In last 5 years, I have worked on three fraud management systems. In a plaintext, we gather all possible “evidences” of fraudsters and try to convict the “crime”. Translate the previous sentence to technical words: a payment transaction comes to a rule engine, it will run bunch of rules at real time,  then outputs a decision like reject, approve, review, etc. Based on the configuration and the business model (Merchant on Record or not, etc.) , the payment system will take corresponding action.

To build the rule engine, Drools is a popular choice. Of cause, we can build a similar in-house version too. The key is the rules. Here is a list of some rules:

Any single item above in details can be an individual post. But hope you get some basic ideas.

some thoughts on Microservice

Be in the Horzion project for months now, learn lots of new things and it might be a good time to document some learnings.

To switch from monolithic service to micro service means a total mightset change. The frist thing is to become a DevOps. You need to write/debug/test your code, build/deploy the code locally and remotely and monitor the service. This is very chanllenging in mature company because thses functions are isolated into different teams/organizations.

Here are some components in building microservice:

  • service register/discover
  • config management
  • (distributed) cache
  • log/alert/monitor
  • (distrobued) database
  • authentication for internal and external users

something need special handling:

  • how to handle exception: you can throws an exception in one service and hope another service catch it in old fasion.
  • how to pass data between services: if several services need to process the same request at different stage. if you need to merge the reply from couple services? or some service need the same intermedian data?
  • how to resolve service dependency: the business logic can be sync or async.
  • how to handle failover: if a service not available, will it retry and how?
  • how to document your service: this will impact how user can easily use the APIs

More to add later…

Hypermedia API

What is the term? “The simultaneous presentation of information and controls such that the information becomes the affordance through which the user obtains choices and selects actions.” – Roy Fielding

To understand this, we need first to understand what is the benefit: decouple client and server of RESTful API. In current API development, the client follows the API specification and any change at server side will break the client code.

How to solve this? Image if the client follow the action told by the server and the code is adaptive enough to the actions. Then the server can evolve independently.

HATEOAS, an abbreviation for Hypermedia as the Engine of Application State, is a constraint on the REST application architecture.

Check out some examples: