matfax/dbscan-on-spark2


An implementation of DBSCAN running on top of Apache Spark 2.0

Download


Step 1. Add the JitPack repository to your build file

Add it in your root settings.gradle at the end of repositories:

	dependencyResolutionManagement {
		repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
		repositories {
			mavenCentral()
			maven { url 'https://jitpack.io' }
		}
	}

Add it in your settings.gradle.kts at the end of repositories:

	dependencyResolutionManagement {
		repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
		repositories {
			mavenCentral()
			maven { url = uri("https://jitpack.io") }
		}
	}

Add to pom.xml

	<repositories>
		<repository>
		    <id>jitpack.io</id>
		    <url>https://jitpack.io</url>
		</repository>
	</repositories>

Add it in your build.sbt at the end of resolvers:

 
    resolvers += "jitpack" at "https://jitpack.io"
        
    

Add it in your project.clj at the end of repositories:

 
    :repositories [["jitpack" "https://jitpack.io"]]
        
    

Step 2. Add the dependency

	dependencies {
		implementation 'com.github.matfax:dbscan-on-spark:v0.3.1'
	}
	dependencies {
		implementation("com.github.matfax:dbscan-on-spark:v0.3.1")
	}
	<dependency>
	    <groupId>com.github.matfax</groupId>
	    <artifactId>dbscan-on-spark</artifactId>
	    <version>v0.3.1</version>
	</dependency>

                            
    libraryDependencies += "com.github.matfax" % "dbscan-on-spark" % "v0.3.1"
        
        

                            
    :dependencies [[com.github.matfax/dbscan-on-spark "v0.3.1"]]
        
        

Readme


DBSCAN on Spark

Overview

This is an implementation of the DBSCAN clustering algorithm on top of Apache Spark. It is loosely based on the paper from He, Yaobin, et al. "MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data".

I have also created a visual guide that explains how the algorithm works.

DBSCAN on Spark is built against Scala 2.11.

Example usage

I have created a sample project showing how DBSCAN on Spark can be used. The following however should give you a good idea of how it should be included in your application.

import org.apache.spark.mllib.clustering.dbscan.DBSCAN

object DBSCANSample {

  def main(args: Array[String]) {

    val conf = new SparkConf().setAppName("DBSCAN Sample")
    val sc = new SparkContext(conf)

    val data = sc.textFile(src)

    val parsedData = data.map(s => Vectors.dense(s.split(',').map(_.toDouble))).cache()

    log.info(s"EPS: $eps minPoints: $minPoints")

    val model = DBSCAN.train(
      parsedData,
      eps = eps,
      minPoints = minPoints,
      maxPointsPerPartition = maxPointsPerPartition)

    model.labeledPoints.map(p =>  s"${p.x},${p.y},${p.cluster}").saveAsTextFile(dest)

    sc.stop()
  }
}

License

DBSCAN on Spark is available under the Apache 2.0 license. See the LICENSE file for details.

Credits

DBSCAN on Spark is maintained by Irving Cordova (irving@irvingc.com).