Saturday, November 22, 2014

Data Mining using Java

Overview

Data Mining is the process of sorting useful insights through large volume of data by identifying patterns and establish relationship among them. This is very generalize term, which is used to solve lot of company challenges such as determining the best selling price of the commodity or suppliers of a commodity or finding pattern of purchase of customers or understand the browsing trends or determining recommendations to customers.

Mostly, the data mining process is performed on:

  1. Data warehouses/Database (Transactional/Sales/Purchase) .

  2. ClickStream/Internet (Scrapping,Extracting)/Logs.


In this tutorial, I will focus on a use case wherein extracting useful information from web.

Web Mining

This is useful process wherein data mining algorithms are applied to large volume of data captured from world wide web.

Use Case

Lastly, worked on the java based project where we were scrapping the data and then mining useful information from it. Now, in java there are numerous tools exists to scrap the web page and then parse the page and extract information from it. Below are some of approaches we started to solved our business scenario.

DOM Parsing

We initially create the XML based configuration file for each web page.Now, when scrapped data is parsed to DOM tree and then can matched against the configuration file and extract useful information. This approach is very CPU and Memory intensive. Also, for each new web page we need to configure the XML, which seems to be a pain.

Pattern Parsing

We slowly and gradually moved to better solution, in this solution we are initially extracting the pattern from each web page (and then stored in Elastic Search), we know the product availability status at that point of time using the code below:

[code language="java"]
public static String getPatterns(String productSupply, String htmlSource, int RANGE) {

if (htmlSource == null) {
return "";
}
String match_patterns = "";

int index_price = -1;
Pattern pattern = Pattern.compile(productSupply);
Matcher matcher = pattern.matcher(htmlSource);

while (matcher.find()) {

index_price = matcher.start();
int beginIndex = index_price - RANGE > 0 ? index_price - RANGE : 0;
int endIndex = index_price + RANGE < htmlSource.length() ? index_price + RANGE : htmlSource.length();
match_patterns = match_patterns + htmlSource.substring(beginIndex, endIndex) + "^^^";

}

return match_patterns;
}
[/code]

And then we were matching the each extracted pattern with the scrapped page using Java Regex code and finding the product availability status using the code below:

[code language="java"]
public static String[] getProductSupplyMatches(String productPatterns, String htmlSource,int RANGE){

if (htmlSource == null) {
return null;
}

int count=0;

String[] patternsMatch=productPatterns.split("^^^");
String[] productSupply=new String[patternsMatch.length];

for(String patternMatch:patternsMatch){
Pattern pattern = Pattern.compile(patternMatch);
Matcher matcher = pattern.matcher(htmlSource);
int index_price = -1;
while (matcher.find()) {

index_price = matcher.start();

int beginIndex = index_price + RANGE > htmlSource.length() ? index_price + RANGE : 0;
int endIndex = index_price - RANGE < 0 ? index_price - RANGE : htmlSource.length();
productSupply [count]= htmlSource.substring(beginIndex, endIndex);
}
}

return productSupply;
}
[/code]

These are few of the approaches we adopted while development. Will keep you posted on more changes.

Conclusion

In this tutorial, I summarized my experience working on a web mining algorithms.