Hadoop Ecosystem

7 years ago, the Hadoop ecosystem was under rapid development. Now lots of projects are mature enough and ready for production deployment.

Credit to Mercy (Ponnupandy) Beckham

Here is my personal pick for you to get start your Hadoop journey.

“The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.”

The project includes these modules:

Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

  • YARN (Distributed Resource Management): Part of the core Hadoop project, YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform, unlocking an entirely new approach to analytics. YARN is the foundation of the new generation of Hadoop and is enabling organizations everywhere to realize a modern data architecture.
  • Spark (Distributed Programming): Apache Spark™ is a fast and general engine for large-scale data processing. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
  • Tez (Distributed Programming): Apache™ Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.
  • Hive (SQL-On-Hadoop): The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.
  • Hbase (Column Data Model NoSQL): Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.
  • Cassandra (Column Data Model NoSQL): The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.Cassandra’s support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.
  • MongoDB (Document Data Model NoSQL): MongoDB is a document database with the scalability and flexibility that you want with the querying and indexing that you need. MongoDB stores data in flexible, JSON-like documents.
  • Redis (Key-Value Data Model NoSQL): Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache and message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs and geospatial indexes with radius queries. Redis has built-in replication, Lua scripting, LRU eviction, transactions and different levels of on-disk persistence, and provides high availability via Redis Sentinel and automatic partitioning with Redis Cluster.
  • Flume (Data Ingestion): Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
  • Sqoop (Data Ingestion): Apache Sqoop™ is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
  • Kafka (Data Ingestion): Kafka™ is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.
  • Thrift (Service Programming): The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.
  • ZooKeeper (Service Programming): A high-performance coordination service for distributed applications.
  • Mahout (Machine Learning): The Apache Mahout™ project’s goal is to build an environment for quickly creating scalable performant machine learning applications.
  • Oozie (Scheduling): Oozie is a workflow scheduler system to manage Apache Hadoop jobs.


  1. http://hadoopecosystemtable.github.io
  2. https://mydataexperiments.com/2017/04/11/hadoop-ecosystem-a-quick-glance/

some thoughts on Microservice

Be in the Horzion project for months now, learn lots of new things and it might be a good time to document some learnings.

To switch from monolithic service to micro service means a total mightset change. The frist thing is to become a DevOps. You need to write/debug/test your code, build/deploy the code locally and remotely and monitor the service. This is very chanllenging in mature company because thses functions are isolated into different teams/organizations.

Here are some components in building microservice:

  • service register/discover
  • config management
  • (distributed) cache
  • log/alert/monitor
  • (distrobued) database
  • authentication for internal and external users

something need special handling:

  • how to handle exception: you can throws an exception in one service and hope another service catch it in old fasion.
  • how to pass data between services: if several services need to process the same request at different stage. if you need to merge the reply from couple services? or some service need the same intermedian data?
  • how to resolve service dependency: the business logic can be sync or async.
  • how to handle failover: if a service not available, will it retry and how?
  • how to document your service: this will impact how user can easily use the APIs

More to add later…

I love APIs


Here are some nice topics or “buzz words”.  The first one is from Amazon Micro-service dev-ops team. Actually, building micro service is a company culture not a practice. To achieve it, Amazon’s Jeff Bezos has a solution for this problem. He calls it the “two pizza rule”: Never have a meeting where two pizzas couldn’t feed the entire group. That is also true for a dev team. And also the team in charge of the whole life circle of the service: road map, technology, dev, QA, on-call, etc. In this way, the team is empowered and responsible for it.

Buzz words:

Another session is from Netflix, Daniel Jacobson. He describes how the micro servers overall picture in Netflix.

Buzz words:

Extra reading

Other buzz words:


Java Queue implementations

In general, Queue is “A collection designed for holding elements prior to processing. Besides basic Collection operations, queues provide additional insertion, extraction, and inspection operations.”

There are two main catalogs:


ConcurrentHashMap, Hashtable and Synchronized Map in Java

All these three class can be used in the multiple thread environment. Here is a summary of the difference:

  • Hashtable is since Java 1.0. The implementation lock the whole table.
  • Synchronized Map is since Java 1.2 under Collection framework. Please refer to the best practice. You also need to lock the whole map.
  • ConcurrentHashMap is since Java 1.5 and also a member of Collection framework. The map is partitioned into segments based on concurrencyLevel. So each segment can have a lock to improve the performance.

A detail comparison is at: http://javarevisited.blogspot.sg/2011/04/difference-between-concurrenthashmap.html


“This class implements a hash table, which maps keys to values. Any non-null object can be used as a key or as a value.”

Synchronized Map

“It is imperative that the user manually synchronize on the returned map when iterating over any of its collection views:”

Map m = Collections.synchronizedMap(new HashMap());
  Set s = m.keySet();  // Needn't be in synchronized block
  synchronized (m) {  // Synchronizing on m, not s!
      Iterator i = s.iterator(); // Must be in synchronized block
      while (i.hasNext())


“A hash table supporting full concurrency of retrievals and high expected concurrency for updates. This class obeys the same functional specification as Hashtable, and includes versions of methods corresponding to each method of Hashtable. However, even though all operations are thread-safe, retrieval operations do notentail locking, and there is not any support for locking the entire table in a way that prevents all access. This class is fully interoperable with Hashtable in programs that rely on its thread safety but not on its synchronization details.”

Always use putIfAbsent instead of following code to achieve atomic action and better performance:

if (!map.containsKey(key))
   return map.put(key, value);
   return map.get(key);


Servlet in nutshell

What is Servlet

A servlet is a Java program that runs in a Web server, as opposed to an applet that runs in a client browser. Typically, the servlet takes an HTTP request from a browser, generates dynamic content (such as by querying a database), and provides an HTTP response back to the browser. Alternatively, the servlet can be accessed directly from another application component or send its output to another component. Most servlets generate HTML text, but a servlet may instead generate XML to encapsulate data.

More specifically, a servlet runs in a J2EE application server. Servlets are one of the main application component types of a J2EE application, along with JavaServer Pages (JSP) and EJB modules, which are also server-side J2EE component types.

Prior to servlets, Common Gateway Interface (CGI) technology was used for dynamic content, with CGI programs being written in languages such as Perl and being called by a Web application through the Web server. CGI ultimately proved less than ideal, however, due to its architecture and scalability limitations.

Servlet containers, sometimes referred to as servlet engines, execute and manage servlets. The servlet container calls servlet methods and provides services that the servlet needs while executing. A servlet container is usually written in Java and is either part of a Web server (if the Web server is also written in Java) or is otherwise associated with and used by a Web server.

key class

  • javax.servlet.Servlet -> javax.servlet.GenericServlet -> javax.servlet.http.HttpServlet
  • javax.servlet.http.HttpSession
  • javax.servlet.ServletContext
  • HttpRequest
  • HttpResponse

What is filter

A Servlet filter is an object that can intercept HTTP requests targeted at your web application.

Description of Figure 3-1  follows

The order in which filters are invoked depends on the order in which they are configured in the web.xml file. The first filter in web.xml is the first one invoked during the request, and the last filter in web.xml is the first one invoked during the response. Note the reverse order during the response.

What is event listener

The servlet specification includes the capability to track key events in your Web applications through event listeners. This functionality allows more efficient resource management and automated processing based on event status.

The event classes are as follows:

  1. ServletRequestEvent
  2. ServletContextEvent
  3. ServletRequestAttributeEvent
  4. ServletContextAttributeEvent
  5. HttpSessionEvent
  6. HttpSessionBindingEvent
Event Listener Categories and Interfaces
Event Category Event Descriptions Java Interface
Servlet context lifecycle changes Servlet context creation, at which point the first request can be serviced

Imminent shutdown of the servlet context

javax.servlet. ServletContextListener
Servlet context attribute changes Addition of servlet context attributes

Removal of servlet context attributes

Replacement of servlet context attributes

javax.servlet. ServletContextAttributeListener
Session lifecycle changes Session creation

Session invalidation

Session timeout

javax.servlet.http. HttpSessionListener
Session attribute changes Addition of session attributes

Removal of session attributes

Replacement of session attributes

javax.servlet.http. HttpSessionAttributeListener