Java – how to use spark to process a series of HBase rows?

I tried to use HBase as the spark data source So the first step is to create an RDD from the HBase table Because spark uses Hadoop input format, I can create RDD http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase Find a way to use all rows But how to create an RDD for range scanning?

All suggestions are welcome

Solution

The following is an example of using a scan in spark:

import java.io.{DataOutputStream,ByteArrayOutputStream}
import java.lang.String
import org.apache.hadoop.hbase.client.Scan
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.client.Result
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.util.Base64

def convertScanToString(scan: Scan): String = {
  val out: ByteArrayOutputStream = new ByteArrayOutputStream
  val dos: DataOutputStream = new DataOutputStream(out)
  scan.write(dos)
  Base64.encodeBytes(out.toByteArray)
}

val conf = HBaseConfiguration.create()
val scan = new Scan()
scan.setCaching(500)
scan.setCacheBlocks(false)
conf.set(TableInputFormat.INPUT_TABLE,"table_name")
conf.set(TableInputFormat.SCAN,convertScanToString(scan))
val rdd = sc.newAPIHadoopRDD(conf,classOf[TableInputFormat],classOf[ImmutableBytesWritable],classOf[Result])
rdd.count

You need to add relevant libraries to the spark classpath and ensure that they are compatible with your spark Tip: you can find them using HBase classpath

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>