Step 1. Add the JitPack repository to your build file
Add it in your root settings.gradle at the end of repositories:
dependencyResolutionManagement {
repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
repositories {
mavenCentral()
maven { url 'https://jitpack.io' }
}
}
Add it in your settings.gradle.kts at the end of repositories:
dependencyResolutionManagement {
repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
repositories {
mavenCentral()
maven { url = uri("https://jitpack.io") }
}
}
Add to pom.xml
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
Add it in your build.sbt at the end of resolvers:
resolvers += "jitpack" at "https://jitpack.io"
Add it in your project.clj at the end of repositories:
:repositories [["jitpack" "https://jitpack.io"]]
Step 2. Add the dependency
dependencies {
implementation 'com.github.matfax:dbscan-on-spark:v0.3.1'
}
dependencies {
implementation("com.github.matfax:dbscan-on-spark:v0.3.1")
}
<dependency>
<groupId>com.github.matfax</groupId>
<artifactId>dbscan-on-spark</artifactId>
<version>v0.3.1</version>
</dependency>
libraryDependencies += "com.github.matfax" % "dbscan-on-spark" % "v0.3.1"
:dependencies [[com.github.matfax/dbscan-on-spark "v0.3.1"]]
This is an implementation of the DBSCAN clustering algorithm on top of Apache Spark. It is loosely based on the paper from He, Yaobin, et al. "MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data".
I have also created a visual guide that explains how the algorithm works.
DBSCAN on Spark is built against Scala 2.11.
I have created a sample project showing how DBSCAN on Spark can be used. The following however should give you a good idea of how it should be included in your application.
import org.apache.spark.mllib.clustering.dbscan.DBSCAN
object DBSCANSample {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("DBSCAN Sample")
val sc = new SparkContext(conf)
val data = sc.textFile(src)
val parsedData = data.map(s => Vectors.dense(s.split(',').map(_.toDouble))).cache()
log.info(s"EPS: $eps minPoints: $minPoints")
val model = DBSCAN.train(
parsedData,
eps = eps,
minPoints = minPoints,
maxPointsPerPartition = maxPointsPerPartition)
model.labeledPoints.map(p => s"${p.x},${p.y},${p.cluster}").saveAsTextFile(dest)
sc.stop()
}
}
DBSCAN on Spark is available under the Apache 2.0 license. See the LICENSE file for details.
DBSCAN on Spark is maintained by Irving Cordova (irving@irvingc.com).