Nutch: using java calls instead of the command line?
•
Java
Am I thick or really unable to programmatically call Apache nutch through some java code? Where are the documents (or guides or tutorials) on how to do this? Google let me down So I actually tried Bing Yes, I know, sad Ideas? Thank you in advance
(in addition, if nutch is nonsense, can any other crawler written in Java prove reliable on the Internet scale with actual documents?)
Solution
If you look at the bin / nutch script, you will see that it calls the Java class corresponding to your command:
# figure out which class to run if [ "$COMMAND" = "crawl" ] ; then CLASS=org.apache.nutch.crawl.Crawl elif [ "$COMMAND" = "inject" ] ; then CLASS=org.apache.nutch.crawl.Injector elif [ "$COMMAND" = "generate" ] ; then CLASS=org.apache.nutch.crawl.Generator elif [ "$COMMAND" = "freegen" ] ; then CLASS=org.apache.nutch.tools.FreeGenerator elif [ "$COMMAND" = "fetch" ] ; then CLASS=org.apache.nutch.fetcher.Fetcher elif [ "$COMMAND" = "fetch2" ] ; then CLASS=org.apache.nutch.fetcher.Fetcher2 elif [ "$COMMAND" = "parse" ] ; then CLASS=org.apache.nutch.parse.ParseSegment elif [ "$COMMAND" = "readdb" ] ; then CLASS=org.apache.nutch.crawl.CrawlDbReader elif [ "$COMMAND" = "convdb" ] ; then CLASS=org.apache.nutch.tools.compat.CrawlDbConverter elif [ "$COMMAND" = "mergedb" ] ; then CLASS=org.apache.nutch.crawl.CrawlDbMerger elif [ "$COMMAND" = "readlinkdb" ] ; then CLASS=org.apache.nutch.crawl.LinkDbReader elif [ "$COMMAND" = "readseg" ] ; then CLASS=org.apache.nutch.segment.SegmentReader elif [ "$COMMAND" = "segread" ] ; then echo "[DEPRECATED] Command 'segread' is deprecated,use 'readseg' instead." CLASS=org.apache.nutch.segment.SegmentReader elif [ "$COMMAND" = "mergesegs" ] ; then CLASS=org.apache.nutch.segment.SegmentMerger elif [ "$COMMAND" = "updatedb" ] ; then CLASS=org.apache.nutch.crawl.CrawlDb elif [ "$COMMAND" = "invertlinks" ] ; then CLASS=org.apache.nutch.crawl.LinkDb elif [ "$COMMAND" = "mergelinkdb" ] ; then CLASS=org.apache.nutch.crawl.LinkDbMerger elif [ "$COMMAND" = "index" ] ; then CLASS=org.apache.nutch.indexer.Indexer elif [ "$COMMAND" = "solrindex" ] ; then CLASS=org.apache.nutch.indexer.solr.solrIndexer elif [ "$COMMAND" = "dedup" ] ; then CLASS=org.apache.nutch.indexer.DeleteDuplicates elif [ "$COMMAND" = "solrdedup" ] ; then CLASS=org.apache.nutch.indexer.solr.solrDeleteDuplicates elif [ "$COMMAND" = "merge" ] ; then CLASS=org.apache.nutch.indexer.IndexMerger elif [ "$COMMAND" = "plugin" ] ; then CLASS=org.apache.nutch.plugin.PluginRepository elif [ "$COMMAND" = "server" ] ; then CLASS='org.apache.nutch.searcher.DistributedSearch$Server' else CLASS=$COMMAND fi # run it exec "$JAVA" $JAVA_HEAP_MAX $NUTCH_OPTS -classpath "$CLASSPATH" $CLASS "$@"
Since then, the only problem is to check the API docs and the source code of these classes if necessary
The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
二维码