Nutch: using java calls instead of the command line?

Am I thick or really unable to programmatically call Apache nutch through some java code? Where are the documents (or guides or tutorials) on how to do this? Google let me down So I actually tried Bing Yes, I know, sad Ideas? Thank you in advance

(in addition, if nutch is nonsense, can any other crawler written in Java prove reliable on the Internet scale with actual documents?)

Solution

If you look at the bin / nutch script, you will see that it calls the Java class corresponding to your command:

# figure out which class to run
if [ "$COMMAND" = "crawl" ] ; then
  CLASS=org.apache.nutch.crawl.Crawl
elif [ "$COMMAND" = "inject" ] ; then
  CLASS=org.apache.nutch.crawl.Injector
elif [ "$COMMAND" = "generate" ] ; then
  CLASS=org.apache.nutch.crawl.Generator
elif [ "$COMMAND" = "freegen" ] ; then
  CLASS=org.apache.nutch.tools.FreeGenerator
elif [ "$COMMAND" = "fetch" ] ; then
  CLASS=org.apache.nutch.fetcher.Fetcher
elif [ "$COMMAND" = "fetch2" ] ; then
  CLASS=org.apache.nutch.fetcher.Fetcher2
elif [ "$COMMAND" = "parse" ] ; then
  CLASS=org.apache.nutch.parse.ParseSegment
elif [ "$COMMAND" = "readdb" ] ; then
  CLASS=org.apache.nutch.crawl.CrawlDbReader
elif [ "$COMMAND" = "convdb" ] ; then
  CLASS=org.apache.nutch.tools.compat.CrawlDbConverter
elif [ "$COMMAND" = "mergedb" ] ; then
  CLASS=org.apache.nutch.crawl.CrawlDbMerger
elif [ "$COMMAND" = "readlinkdb" ] ; then
  CLASS=org.apache.nutch.crawl.LinkDbReader
elif [ "$COMMAND" = "readseg" ] ; then
  CLASS=org.apache.nutch.segment.SegmentReader
elif [ "$COMMAND" = "segread" ] ; then
  echo "[DEPRECATED] Command 'segread' is deprecated,use 'readseg' instead."
  CLASS=org.apache.nutch.segment.SegmentReader
elif [ "$COMMAND" = "mergesegs" ] ; then
  CLASS=org.apache.nutch.segment.SegmentMerger
elif [ "$COMMAND" = "updatedb" ] ; then
  CLASS=org.apache.nutch.crawl.CrawlDb
elif [ "$COMMAND" = "invertlinks" ] ; then
  CLASS=org.apache.nutch.crawl.LinkDb
elif [ "$COMMAND" = "mergelinkdb" ] ; then
  CLASS=org.apache.nutch.crawl.LinkDbMerger
elif [ "$COMMAND" = "index" ] ; then
  CLASS=org.apache.nutch.indexer.Indexer
elif [ "$COMMAND" = "solrindex" ] ; then
  CLASS=org.apache.nutch.indexer.solr.solrIndexer
elif [ "$COMMAND" = "dedup" ] ; then
  CLASS=org.apache.nutch.indexer.DeleteDuplicates
elif [ "$COMMAND" = "solrdedup" ] ; then
  CLASS=org.apache.nutch.indexer.solr.solrDeleteDuplicates
elif [ "$COMMAND" = "merge" ] ; then
  CLASS=org.apache.nutch.indexer.IndexMerger
elif [ "$COMMAND" = "plugin" ] ; then
  CLASS=org.apache.nutch.plugin.PluginRepository
elif [ "$COMMAND" = "server" ] ; then
  CLASS='org.apache.nutch.searcher.DistributedSearch$Server'
else
  CLASS=$COMMAND
fi

# run it
exec "$JAVA" $JAVA_HEAP_MAX $NUTCH_OPTS -classpath "$CLASSPATH" $CLASS "$@"

Since then, the only problem is to check the API docs and the source code of these classes if necessary

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>