danneu/reddit

👽 a reddit crawler for kotlin

Download

Step 1. Add the JitPack repository to your build file

Add it in your root settings.gradle at the end of repositories:

	dependencyResolutionManagement {
		repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
		repositories {
			mavenCentral()
			maven { url 'https://jitpack.io' }
		}
	}

Add it in your settings.gradle.kts at the end of repositories:

	dependencyResolutionManagement {
		repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
		repositories {
			mavenCentral()
			maven { url = uri("https://jitpack.io") }
		}
	}

Add to pom.xml

	<repositories>
		<repository>
		    <id>jitpack.io</id>
		    <url>https://jitpack.io</url>
		</repository>
	</repositories>

Add it in your build.sbt at the end of resolvers:

 
    resolvers += "jitpack" at "https://jitpack.io"

Add it in your project.clj at the end of repositories:

 
    :repositories [["jitpack" "https://jitpack.io"]]

Step 2. Add the dependency

	dependencies {
		implementation 'com.github.danneu:reddit:0.0.2'
	}

	dependencies {
		implementation("com.github.danneu:reddit:0.0.2")
	}

	<dependency>
	    <groupId>com.github.danneu</groupId>
	    <artifactId>reddit</artifactId>
	    <version>0.0.2</version>
	</dependency>


                            
    libraryDependencies += "com.github.danneu" % "reddit" % "0.0.2"


                            
    :dependencies [[com.github.danneu/reddit "0.0.2"]]

Readme

reddit-iterator

A Reddit crawler implemented in Kotlin for consuming submissions and comments with a flattened iterator.

The crawler's methods return Iterator<Submission> or Iterator<Comment>.

Useful for consuming the entire set of submissions and comments for a subreddit.

Internally, the iterator lazily unrolls pagination including "Continue Thread" and "Load More" nodes in comment trees as you iterate.

By default, an ApiClient instance never sends more than one request per second to comply with Reddit's API requirements.

Install

repositories {
    maven { url "https://jitpack.io" }
}

dependencies {
    // Always get latest:
    compile "com.danneu:reddit:master-SNAPSHOT"
    // Or get a specific GitHub release:
    compile "com.danneu:reddit:0.0.1"
}

You can see all releases here: https://github.com/danneu/reddit/releases

Usage

Crawl latest comments across all subreddits

The server only responds with the latest 1,000 comments, but since you hit CDN cache, this endpoint will paginate 10,000+ comments.

Though due to the cache, the farther you paginate, the more sparse the comments become. For instance, fewer comments per hour.

import com.danneu.reddit.ApiClient

fun main(args: Array<String>) {
    val client = ApiClient()

    client.recentComments().forEach { comment ->
        println(comment)
    }
}

However, if you run this 24/7 in a loop, you can consume all Reddit comments as they are published.

You'll need to use the following subreddit-scoped methods to consume historical comments.

Crawl all submissions in a subreddit

import com.danneu.reddit.ApiClient

fun main(args: Array<String>) {
    val client = ApiClient()

    client.submissionsOf("futurology").forEach { submission ->
        println(submission)
    }
}

Crawl all comments in a subreddit

Iterates comments depth-first.

import com.danneu.reddit.ApiClient

fun main(args: Array<String>) {
    val client = ApiClient()

    client.commentsOf("futurology").forEach { comment ->
        println(comment)
    }
}

Or:

import com.danneu.reddit.ApiClient

fun main(args: Array<String>) {
    val client = ApiClient()

    client.submissionsOf("futurology").forEach { submission ->
        println("Now crawling: ${submission.url()}")
        client.commentsOf(submission).forEach { comment ->
            println(comment)
        }
    }
}

Get a new ApiClient based on an existing one

ApiClient#fork() returns a new ApiClient with the same configuration as the original one. Pass in a builder lambda to customize the copy.

import com.danneu.reddit.ApiClient
import java.net.InetSocketAddress
import java.time.Duration
import java.net.Proxy

fun main(args: Array<String>) {
    val client1 = ApiClient {
        proxy = Proxy(Proxy.Type.HTTP, InetSocketAddress("1.2.3.4", 8080))
    }

    val client2 = client1.fork {
        throttle = Duration.ofSeconds(2)
    }
}

client2 uses the same proxy as client1, but it waits two seconds before each request instead of the default one second.

Examples

Extract all absolute URLs from submissions and comments

import com.danneu.reddit.ApiClient
import com.danneu.reddit.HasContent
import java.net.URI

// Extend Submissions and Comments with a method that returns only the absolute URLs
fun HasContent.absoluteUrls(): List<URI> = urls().filter { it.isAbsolute }

fun processUrl(uri: URI) = println("found absolute url: ${uri}")

fun main(args: Array<String>) {
    val client = ApiClient()

    client.submissionsOf("futurology").forEach { submission ->
        // Process URLs in the submission body
        submission.absoluteUrls().forEach(::processUrl)

        client.commentsOf(submission).forEach { comment ->
            // Process URLs in each comment body
            comment.absoluteUrls().forEach(::processUrl)
        }
    }
}

Even better, in Kotlin, we can rewrite the previous example with a client extension #urlsOf(subreddit) that returns a lazy sequence of urls found in the submission and comment bodies.

import com.danneu.reddit.ApiClient
import java.net.URI

fun ApiClient.urlsOf(subreddit: String): Sequence<URI> {
    return submissionsOf(subreddit).asSequence().flatMap { submission ->
        submission.urls().asSequence().plus(
            commentsOf(submission).asSequence().flatMap { comment ->
                comment.urls().asSequence()
            }
        )
    }
}

fun main(args: Array<String>) {
    val client = ApiClient()

    client.urlsOf("futurology").filter { it.isAbsolute }.forEach { url ->
        println("found absolute url: $url")
    }
}