Hadoop Ecosystem

7 years ago, the Hadoop ecosystem was under rapid development. Now lots of projects are mature enough and ready for production deployment.

hadoop-ecosystem
Credit to Mercy (Ponnupandy) Beckham

Here is my personal pick for you to get start your Hadoop journey.

“The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.”

The project includes these modules:

Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

  • YARN (Distributed Resource Management): Part of the core Hadoop project, YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform, unlocking an entirely new approach to analytics. YARN is the foundation of the new generation of Hadoop and is enabling organizations everywhere to realize a modern data architecture.
  • Spark (Distributed Programming): Apache Spark™ is a fast and general engine for large-scale data processing. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
  • Tez (Distributed Programming): Apache™ Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.
  • Hive (SQL-On-Hadoop): The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.
  • Hbase (Column Data Model NoSQL): Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.
  • Cassandra (Column Data Model NoSQL): The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.Cassandra’s support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.
  • MongoDB (Document Data Model NoSQL): MongoDB is a document database with the scalability and flexibility that you want with the querying and indexing that you need. MongoDB stores data in flexible, JSON-like documents.
  • Redis (Key-Value Data Model NoSQL): Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache and message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs and geospatial indexes with radius queries. Redis has built-in replication, Lua scripting, LRU eviction, transactions and different levels of on-disk persistence, and provides high availability via Redis Sentinel and automatic partitioning with Redis Cluster.
  • Flume (Data Ingestion): Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
  • Sqoop (Data Ingestion): Apache Sqoop™ is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
  • Kafka (Data Ingestion): Kafka™ is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.
  • Thrift (Service Programming): The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.
  • ZooKeeper (Service Programming): A high-performance coordination service for distributed applications.
  • Mahout (Machine Learning): The Apache Mahout™ project’s goal is to build an environment for quickly creating scalable performant machine learning applications.
  • Oozie (Scheduling): Oozie is a workflow scheduler system to manage Apache Hadoop jobs.

Reference:

  1. http://hadoopecosystemtable.github.io
  2. https://mydataexperiments.com/2017/04/11/hadoop-ecosystem-a-quick-glance/
Advertisements

Rule based, Machine learning or a hyper?

Fraud detection and prevention is always a batter field. Fraudsters will keep finding new ways to game the system. On the other hand, that is a big business opportunity. Let see how crowded the market is:

fraud

This is a hand selected list of some notable players in this markets. There are couple interesting observations:

  • Machine learning (or AI) is a must-have feature to the product.
  • Big data is still a treasure box with big potential.
  • Rule based system is still popular (it is old fashion?).
  • A hyper system combining rules and AI will be promising.
  • Visualization can be helpful and user friendly for customers.

Anyway, here are two nice summaries from seon.io and unfraud.com:

Screen Shot 2017-06-04 at 10.28.59 AM

Screen Shot 2017-06-05 at 10.30.06 AM

Fraud detection for eCommerce

If you work on e-commerce, beside making sure the online payment is smooth, another critical task is to deal with fraud! Fraud shares lots of common characteristics with (information) security. Good guys and bad guys are always fighting endlessly, just like Marvel comics: super heroes vs. super villains.

marvel
Credit: Marvel Comics

Most company either builds an in-house solution or use some market available solution. So what is the hard core of a fraud management system?

Before answering this question, maybe we can go through a simple e-commerce checkout flow (you know in reality, it will be much more complex):

checkout

In last 5 years, I have worked on three fraud management systems. In a plaintext, we gather all possible “evidences” of fraudsters and try to convict the “crime”. Translate the previous sentence to technical words: a payment transaction comes to a rule engine, it will run bunch of rules at real time,  then outputs a decision like reject, approve, review, etc. Based on the configuration and the business model (Merchant on Record or not, etc.) , the payment system will take corresponding action.

To build the rule engine, Drools is a popular choice. Of cause, we can build a similar in-house version too. The key is the rules. Here is a list of some rules:

Any single item above in details can be an individual post. But hope you get some basic ideas.

(Updated) Java Example Code using HBase Data Model Operations

This is the updated version to previous post on 2012: Java Example Code using HBase Data Model Operations

This time we will use latest stable Hbase (1.2.5), IntelliJ and Maven.

  • My local test environment: Virtual Box of Ubuntu 16 Desktop version.
  • Install HBase: https://hbase.apache.org/book.html#quickstart
    • Note: I use stable version 1.2.5 instead of 2.0.0
    • Open a terminal
      1. start HBase in terminal: bin/start-hbase.sh
      2. start HBase shell: bin/hbase shell
      3. create a table: create ‘myLittleHBaseTable’, ‘myLittleFamily’
  • To install JDK: sudo apt install openjdk-8-jre-headless
  • To install Maven: sudo apt install maven
  • Create a maven project (replace “autofei” to whatever you like): mvn archetype:generate -DgroupId=autofei -DartifactId=autofei -DarchetypeArtifactId=maven-archetype-quickstart
  • Download IntelliJ community version

Open the project in IntelliJ and add following dependencies to the pom.xml:

<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-common</artifactId>
  <version>2.7.3</version>
</dependency>
<dependency>
  <groupId>org.apache.hbase</groupId>
  <artifactId>hbase</artifactId>
  <version>1.2.5</version>
</dependency>
<dependency>
  <groupId>org.apache.hbase</groupId>
  <artifactId>hbase-client</artifactId>
  <version>1.2.5</version>
</dependency>

Enable auto-import function for IntelliJ to easy your life 🙂

The example code in the old post will still work but with some deprecation warning.

 

Logstash grok sample

Of cause, the best place is the official guide: https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html

Instead of reinventing the wheel, check out existing patterns: https://github.com/logstash-plugins/logstash-patterns-core/tree/master/patterns

You can also define your own pattern, here is the online tool for you to test it: http://grokdebug.herokuapp.com/

Here is an example:

  • log record: 09:33:45,416 (metrics-logger-reporter-1-thread-1) type=GAUGE, name=notifications.received, value=2
  • pattern: (?<logtime>%{HOUR}:%{MINUTE}:%{SECOND}) (?<logthread>[()a-zA-Z0-9-]+) type=(?<type>[A-Z]+), name=(?<name>[A-Za-z.]*), value=(?<value>[0-9]+)

More example: https://www.elastic.co/guide/en/logstash/current/config-examples.html

Have fun to play it!

Reference:

Visualize Geo location of log using Elasticsearch + Logstash + Kibana

Visualize Geo location of log using Elasticsearch + Logstash + Kibana

Here is a visualization of an access log based on the sample access log data.

So it looks pretty cool and if you have ELK stack in your local. It will take only a little time for you to achieve this.

Please first refer to this article: http://gro.solr.pl/elasticsearch-logstash-kibana-to-geo-identify-our-users/

If everything works fine for you, that is great! If the visualization doesn’t load, please continue your reading.

Here is the software version, just in case you want to know:

  • elasticsearch-5.1.1
  • kibana-5.1.1-darwin-x86_64
  • logstash-5.1.1

I guess you might get error:No Compatible Fields: The “logs_*” index pattern does not contain any of the following field types: geo_point

The reason is there is no template to match this index. But logstach load a default template to elasticsearch which actually contain the geo mapping.  In Kibana “Dev Tools”, inside Console, type: “GET /_template/” and you will see “logstach” contains “geoip” section. So make sure the output index has “logstash-” as the prefix.

Also, if you want to use the latest Geo IP data, instead of the preload one. You can download “GeoLite2-City.mmdb.gz” from here: http://dev.maxmind.com/geoip/geoip2/geolite2/

So finally, here is my logstach config file:

input {
  file {
    path => "path to your log, for example: ~/Downloads/Tools/log/apache/*.log"
    type => "apache"
    start_position => "beginning"
  }
}

filter {
    grok {
      match => {
        "message" => "%{COMBINEDAPACHELOG}"
      }
    }

   geoip {
    source => "clientip"
    database => "path to your Geo IP data file, for example: ~/Downloads/Tools/GeoLite2-City.mmdb"
  }
}

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "logstash-logs_%{+YYYY.MM.dd}"
    manage_template => true
  }
}

Update [2017-01-13]

Dig a little more on this issue and here are some new founding. In the Kibana UI, it is looking for  Geohash -> geoip.location in the buckets.  (if you know how to change the config, please let me know, thanks!)

So you have to have that field in the index. Otherwise, the tile map can’t find any record. This explains why it will work with index “ligstash-” prefix. In logstach log, you can find this template:

[2017-01-13T09:46:35,015][INFO ][logstash.outputs.elasticsearch] Attempting to install template {:manage_template=>{"template"=>"logstash-*", "version"=>50001, "settings"=>{"index.refresh_interval"=>"5s"}, "mappings"=>{"_default_"=>{"_all"=>{"enabled"=>true, "norms"=>false}, "dynamic_templates"=>[{"message_field"=>{"path_match"=>"message", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false}}}, {"string_fields"=>{"match"=>"*", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false, "fields"=>{"keyword"=>{"type"=>"keyword"}}}}}], "properties"=>{"@timestamp"=>{"type"=>"date", "include_in_all"=>false}, "@version"=>{"type"=>"keyword", "include_in_all"=>false}, "geoip"=>{"dynamic"=>true, "properties"=>{"ip"=>{"type"=>"ip"}, "location"=>{"type"=>"geo_point"}, "latitude"=>{"type"=>"half_float"}, "longitude"=>{"type"=>"half_float"}}}}}}}}

Reference: