Rule based, Machine learning or a hyper?

Fraud detection and prevention is always a batter field. Fraudsters will keep finding new ways to game the system. On the other hand, that is a big business opportunity. Let see how crowded the market is:


This is a hand selected list of some notable players in this markets. There are couple interesting observations:

  • Machine learning (or AI) is a must-have feature to the product.
  • Big data is still a treasure box with big potential.
  • Rule based system is still popular (it is old fashion?).
  • A hyper system combining rules and AI will be promising.
  • Visualization can be helpful and user friendly for customers.

Anyway, here are two nice summaries from and

Screen Shot 2017-06-04 at 10.28.59 AM

Screen Shot 2017-06-05 at 10.30.06 AM


Logstash grok sample

Of cause, the best place is the official guide:

Instead of reinventing the wheel, check out existing patterns:

You can also define your own pattern, here is the online tool for you to test it:

Here is an example:

  • log record: 09:33:45,416 (metrics-logger-reporter-1-thread-1) type=GAUGE, name=notifications.received, value=2
  • pattern: (?<logtime>%{HOUR}:%{MINUTE}:%{SECOND}) (?<logthread>[()a-zA-Z0-9-]+) type=(?<type>[A-Z]+), name=(?<name>[A-Za-z.]*), value=(?<value>[0-9]+)

More example:

Have fun to play it!


Visualize Geo location of log using Elasticsearch + Logstash + Kibana

Visualize Geo location of log using Elasticsearch + Logstash + Kibana

Here is a visualization of an access log based on the sample access log data.

So it looks pretty cool and if you have ELK stack in your local. It will take only a little time for you to achieve this.

Please first refer to this article:

If everything works fine for you, that is great! If the visualization doesn’t load, please continue your reading.

Here is the software version, just in case you want to know:

  • elasticsearch-5.1.1
  • kibana-5.1.1-darwin-x86_64
  • logstash-5.1.1

I guess you might get error:No Compatible Fields: The “logs_*” index pattern does not contain any of the following field types: geo_point

The reason is there is no template to match this index. But logstach load a default template to elasticsearch which actually contain the geo mapping.  In Kibana “Dev Tools”, inside Console, type: “GET /_template/” and you will see “logstach” contains “geoip” section. So make sure the output index has “logstash-” as the prefix.

Also, if you want to use the latest Geo IP data, instead of the preload one. You can download “GeoLite2-City.mmdb.gz” from here:

So finally, here is my logstach config file:

input {
  file {
    path => "path to your log, for example: ~/Downloads/Tools/log/apache/*.log"
    type => "apache"
    start_position => "beginning"

filter {
    grok {
      match => {
        "message" => "%{COMBINEDAPACHELOG}"

   geoip {
    source => "clientip"
    database => "path to your Geo IP data file, for example: ~/Downloads/Tools/GeoLite2-City.mmdb"

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "logstash-logs_%{+YYYY.MM.dd}"
    manage_template => true

Update [2017-01-13]

Dig a little more on this issue and here are some new founding. In the Kibana UI, it is looking for  Geohash -> geoip.location in the buckets.  (if you know how to change the config, please let me know, thanks!)

So you have to have that field in the index. Otherwise, the tile map can’t find any record. This explains why it will work with index “ligstash-” prefix. In logstach log, you can find this template:

[2017-01-13T09:46:35,015][INFO ][logstash.outputs.elasticsearch] Attempting to install template {:manage_template=>{"template"=>"logstash-*", "version"=>50001, "settings"=>{"index.refresh_interval"=>"5s"}, "mappings"=>{"_default_"=>{"_all"=>{"enabled"=>true, "norms"=>false}, "dynamic_templates"=>[{"message_field"=>{"path_match"=>"message", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false}}}, {"string_fields"=>{"match"=>"*", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false, "fields"=>{"keyword"=>{"type"=>"keyword"}}}}}], "properties"=>{"@timestamp"=>{"type"=>"date", "include_in_all"=>false}, "@version"=>{"type"=>"keyword", "include_in_all"=>false}, "geoip"=>{"dynamic"=>true, "properties"=>{"ip"=>{"type"=>"ip"}, "location"=>{"type"=>"geo_point"}, "latitude"=>{"type"=>"half_float"}, "longitude"=>{"type"=>"half_float"}}}}}}}}


Install and run Mahout at single Linux box

This post show you how to install and run Mahout at a stand-alone Linux box.

Prerequisites for Building Mahout

  • Java JDK >=1.6
  • Maven
  • SVN


  • svn co
  • change directory to the checked out directory
  • mvn install
  • change to the core directory
  • mvn compile
  • mvn  install
  • change to the examples directory
  • mvn compile
  • mvn  install

Download test data from: Please download “MovieLens 1M” one.

Run test example

Note: replace the test data file path to yours.

  • mvn -e exec:java -Dexec.mainClass=”” -Dexec.args=”-i /home/hduser/trunk/examples/ml-1m/ratings.dat
    + Error stacktraces are turned on.
    [INFO] Scanning for projects...
    [INFO] Searching repository for plugin with prefix: 'exec'.
    [INFO] ------------------------------------------------------------------------
    [INFO] Building Mahout Examples
    [INFO]    task-segment: [exec:java]
    [INFO] ------------------------------------------------------------------------
    [INFO] Preparing exec:java
    [INFO] No goals needed for project - skipping
    [INFO] [exec:java {execution: default-cli}]
    12/03/28 14:08:33 INFO file.FileDataModel: Creating FileDataModel for file /tmp/ratings.txt
    12/03/28 14:08:33 INFO file.FileDataModel: Reading file info...
    12/03/28 14:08:34 INFO file.FileDataModel: Processed 1000000 lines
    12/03/28 14:08:34 INFO file.FileDataModel: Read lines: 1000209
    12/03/28 14:08:35 INFO model.GenericDataModel: Processed 6040 users
    12/03/28 14:08:35 INFO eval.AbstractDifferenceRecommenderEvaluator: Beginning evaluation using 0.9 of GroupLensDataModel
    12/03/28 14:08:35 INFO model.GenericDataModel: Processed 1753 users
    12/03/28 14:08:36 INFO slopeone.MemoryDiffStorage: Building average diffs...
    12/03/28 14:09:36 INFO eval.AbstractDifferenceRecommenderEvaluator: Beginning evaluation of 1719 users
    12/03/28 14:09:36 INFO eval.AbstractDifferenceRecommenderEvaluator: Starting timing of 1719 tasks in 1 threads
    12/03/28 14:09:36 INFO eval.StatsCallable: Average time per recommendation: 343ms
    12/03/28 14:09:36 INFO eval.StatsCallable: Approximate memory used: 448MB / 798MB
    12/03/28 14:09:36 INFO eval.StatsCallable: Unable to recommend in 0 cases
    12/03/28 14:09:43 INFO eval.StatsCallable: Average time per recommendation: 7ms
    12/03/28 14:09:43 INFO eval.StatsCallable: Approximate memory used: 510MB / 798MB
    12/03/28 14:09:43 INFO eval.StatsCallable: Unable to recommend in 13 cases
    12/03/28 14:09:52 INFO eval.AbstractDifferenceRecommenderEvaluator: Evaluation result: 0.7149488038906546
    12/03/28 14:09:52 INFO grouplens.GroupLensRecommenderEvaluatorRunner: 0.7149488038906546
    [INFO] ------------------------------------------------------------------------
    [INFO] ------------------------------------------------------------------------
    [INFO] Total time: 1 minute 26 seconds
    [INFO] Finished at: Wed Mar 28 14:09:53 PDT 2012
    [INFO] Final Memory: 53M/761M
    [INFO] ------------------------------------------------------------------------

Creating a simple recommender

Create a Maven project

mvn archetype:create -DarchetypeGroupId=org.apache.maven.archetypes -DgroupId=com.autofei -DartifactId=mahoutrec

This creates an empty project called mahoutrec with the package namespace com.autofei. Now change to the mahoutrec directory. You can try out the new project by running:

mvn compile
mvn exec:java -Dexec.mainClass="com.autofei.App"

Set the project dependencies
edit pom.xml, remember to change your Mahout version, in my case, it is 0.7-SNAPSHOT. an example file:

<?xml version="1.0"?>
<project xsi:schemaLocation="" xmlns=""

Test data
Put these data into a file dummy-bool.csv under datasets directory


Create a java file under src/main/java/com/autofei/, named

 package com.autofei;

import java.util.List;

import org.apache.commons.cli2.OptionException;

public class UnresystBoolRecommend {

public static void main(String... args) throws FileNotFoundException, TasteException, IOException, OptionException {

// create data source (model) - from the csv file
File ratingsFile = new File("datasets/dummy-bool.csv");
DataModel model = new FileDataModel(ratingsFile);

// create a simple recommender on our data
CachingRecommender cachingRecommender = new CachingRecommender(new SlopeOneRecommender(model));

// for all users
for (LongPrimitiveIterator it = model.getUserIDs(); it.hasNext();){
long userId = it.nextLong();

// get the recommendations for the user
List<RecommendedItem> recommendations = cachingRecommender.recommend(userId, 10);

// if empty write something
if (recommendations.size() == 0){
System.out.print("User ");
System.out.println(": no recommendations");

// print the list of recommendations for each
for (RecommendedItem recommendedItem : recommendations) {
System.out.print("User ");
System.out.print(": ");

Run the code

  • mvn compile
  • mvn exec:java -Dexec.mainClass="com.autofei.UnresystBoolRecommend"
    [INFO] Scanning for projects...
    [INFO] Searching repository for plugin with prefix: 'exec'.
    [INFO] ------------------------------------------------------------------------
    [INFO] Building mahoutrec
    [INFO]    task-segment: [exec:java]
    [INFO] ------------------------------------------------------------------------
    [INFO] Preparing exec:java
    [INFO] No goals needed for project - skipping
    [INFO] [exec:java {execution: default-cli}]
    SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
    SLF4J: Defaulting to no-operation (NOP) logger implementation
    SLF4J: See for further details.
    User 1: RecommendedItem[item:5, value:1.0]
    User 2: RecommendedItem[item:5, value:1.0]
    User 2: RecommendedItem[item:3, value:1.0]
    User 3: no recommendations
    User 4: no recommendations
    User 5: RecommendedItem[item:5, value:1.0]
    User 5: RecommendedItem[item:3, value:1.0]
    [INFO] ------------------------------------------------------------------------
    [INFO] ------------------------------------------------------------------------
    [INFO] Total time: 3 seconds
    [INFO] Finished at: Wed Mar 28 16:18:31 PDT 2012
    [INFO] Final Memory: 14M/35M
    [INFO] ------------------------------------------------------------------------

From now, you can test other algorithm inside Mahout.


Real-time data analysis frameworks (or stream system)

Kafka: Kafka is a messaging system that was originally developed at LinkedIn to serve as the foundation for LinkedIn’s activity stream processing pipeline. Nice talk

S4: S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.

Hedwig: Hedwig is a publish-subscribe system designed to carry large amounts of data across the internet in a guaranteed-delivery fashion from those who produce it (publishers) to those who are interested in it (subscribers).

Storm: Storm is a distributed, reliable, and fault-tolerant stream processing system. Its use cases are so broad that we consider it to be a fundamental new primitive for data processing. Introduction slide

Flume: Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoop’s HDFS.

Scribe: Scribe is a server for aggregating streaming log data. It is designed to scale to a very large number of nodes and be robust to network and node failures.

Data analysis in Excel

Excel is a common tool we use it very frequently. But some power functions are hidden inside until we mine them out. (I will keep this post updated.)

FREQUENCY function [1]

“this useful function can analyse a series of values and summarise them into a number of specified ranges.” Or in a simple word: frequency distribution. Once you have the distribution, you can draw Histograms[2] to show it.