Step 1. Add the JitPack repository to your build file
Add it in your root settings.gradle at the end of repositories:
dependencyResolutionManagement {
repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
repositories {
mavenCentral()
maven { url 'https://jitpack.io' }
}
}
Add it in your settings.gradle.kts at the end of repositories:
dependencyResolutionManagement {
repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
repositories {
mavenCentral()
maven { url = uri("https://jitpack.io") }
}
}
Add to pom.xml
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
Add it in your build.sbt at the end of resolvers:
resolvers += "jitpack" at "https://jitpack.io"
Add it in your project.clj at the end of repositories:
:repositories [["jitpack" "https://jitpack.io"]]
Step 2. Add the dependency
dependencies {
implementation 'com.github.milosmns:goose:2.1.26_2.11'
}
dependencies {
implementation("com.github.milosmns:goose:2.1.26_2.11")
}
<dependency>
<groupId>com.github.milosmns</groupId>
<artifactId>goose</artifactId>
<version>2.1.26_2.11</version>
</dependency>
libraryDependencies += "com.github.milosmns" % "goose" % "2.1.26_2.11"
:dependencies [[com.github.milosmns/goose "2.1.26_2.11"]]
Goose is an article extractor for web pages. This means that the algorithm is capable of determining where to look for relevant article information from a website,
properly extracts "interesting" data, picks out the best images from the page and determines a confidence factor for the top-picked image based on various meta
information from the given page.
The command line Goose library (written in Scala) is licensed by Gravity.com under Apache 2.0 license, see the LICENSE file for more details.
Android implementation addresses the issues Scala imposes on Android - such as using external programs (image-magick), using outdated HTTP libraries (Apache),
downloading images to any location on the disk (no open disk on Android), managing cache, SD cards, battery consumption, network issues, redirects, etc.
Since version 1.5.0, Goose for Android also uses the official HttpURLConnection that makes it compliant with Google's requests about deprecating outdated
SSL libraries such as OpenSSL.
A sample of how to use Goose for Android can be found in DemoActivity.java in the app folder source.
Document cleaning
When you pass a URL to Goose, the first thing it starts to do is clean up the document to make it easier to parse. It will go through the whole document and remove comments, common social network sharing elements, convert em and other tags to plain text nodes, try to convert divs used as text nodes to paragraphs, as well as do a general document cleanup (spaces, new lines, quotes, encoding, etc).
Content / Images Extraction
When dealing with random article links you're bound to come across the craziest of HTML files. Some sites even like to include 2 or more HTML files per site. Goose uses a scoring system based on clustering of English stop words and other factors that you can find in the code. Goose also does descending scoring so as the nodes move down - the lower their scores become. The goal is to find the strongest grouping of text nodes inside a parent container and assume that's the relevant group of content as long as it's high enough (up) on the page.
Image extraction is the one that takes the longest. Trying to find the most important image on a page proved to be challenging and required to download all the images to manually inspect them using external tools (not all images are considered, Goose checks mime types, dimensions, byte sizes, compression quality, etc). Java's Image functions were just too unreliable and inaccurate. On Android, Goose uses the BitmapFactory class, it is well documented, tested, and is fast and accurate. Images are analyzed from the top node that Goose finds the content in, then comes a recursive run outwards trying to find good images - Goose also checks if those images are ads, banners or author logos, and ignores them if so.
Output Formatting
Once Goose has the top node where we think the content is, Goose will try to format the content of that node for the output. For example, for NLP-type applications, Goose's output formatter will just suck all the text and ignore everything else, and other (custom) extractors can be built to offer a more Flipboardy-type experience.
compile 'me.angrybyte.goose:goose:1.8.0'This will run the app on your device. You may need to download a newer version of Gradle, which will be available in the Android Studio UI if compile fails.
Make sure you do this on a background thread (or in an AsyncTask). Some of the results are in the screenshots directory in the project root.
Configuration config = new Configuration(getCacheDir().getAbsolutePath());
ContentExtractor extractor = new ContentExtractor(config);
Article article = extractor.extractContent(url);
if (article == null) {
Log.e(TAG, "Couldn't load the article, is your URL correct, is your Internet working?");
return;
}
String details = article.getCleanedArticleText();
if (details == null) {
Log.w(TAG, "Couldn't load the article text, the page is messy. Trying with page description...");
details = article.getMetaDescription();
}
Bitmap photo = null;
if (article.getTopImage() != null) { // topImage value holds an absolute URL
photo = GooseDownloader.getPhoto(article.getTopImage().getImageSrc());
}
Data you might be interested in is always available in the Article object:
<title> tag<head> parent<head> parent<article> tag, but Goose checks for good <div> and <p> tags as wellIf you found an error while using the library, please file an issue. All patches are encouraged, and may be submitted by forking this project and submitting a pull request through GitHub. Some more help can be found here:
This is still not very clean and nice. It works, but it is not clean. Some of the things to address:
jar archiveTip: to comment all logs use:
find . -name "*\.java" | xargs grep -l 'Log\.' | xargs sed -i 's/Log\./;\/\/ Log\./g'
Tip: to uncomment all logs use:
find . -name "*\.java" | xargs grep -l 'Log\.' | xargs sed -i 's/;\/\/ Log\./Log\./g'
Any help is appreciated.