An in-depth analysis of the advantages of HBase

HBase is a NoSQL database running on Hadoop. It is a distributed and scalable big data warehouse. In other words, HBase can make use of the distributed processing mode of HDFS and benefit from the MapReduce program model of Hadoop. This means storing many large tables with billions of rows and millions of columns on a set of commercial hardware. In addition to the advantages of Hadoop, HBase itself is a very powerful database. It can integrate the ability of key / value storage mode to bring real-time query and the ability of offline processing or batch processing through MapReduce. In general, HBase enables you to query records in a large amount of data and obtain comprehensive analysis reports.

Google once faced a challenging problem: how to provide real-time search results on the whole Internet? The answer is that it essentially needs to cache the Internet and redefine a new method of fast search on such a huge cache. For this purpose, the following technologies are defined:

・ Google File System GFS: scalable distributed file system for large, distributed and data intensive applications.

・ BigTable: distributed storage system for managing structured data designed to be large-scale: Pb level data from thousands of commercial servers.

・ MapReduce: a program model used to process and generate related implementations of large data sets.

Soon after Google released the documents of these technologies, we saw their open source implementation versions. In 2007, Mike cafarella released the code for the open source implementation of BigTable, which he called HBase. Since then, HBase has become the top project of Apache and runs on Facebook, twitter, Adobe... To name just a few examples.

HBase is not a relational database. It requires different methods to define your data model. HBase actually defines a four-dimensional data model. The following is the definition of each dimension:

・ line key: each line has a unique line key. The line key has no data type. It is internally considered as a byte array.

・ column cluster: data is organized into column clusters in rows. Each row has the same column cluster, but between rows, the same column cluster does not need to have the same column modifier. In the engine, HBase stores column clusters in its own data file, so they need to be defined in advance. In addition, it is not easy to change column clusters.

・ column modifier: the column cluster defines the real column, which is called column modifier. You can think that the column modifier is the column itself.

・ version: each column can have a configurable version quantity. You can obtain data through the specified version of the column modifier.

figure1. HBaseFour-Dimensional Data Model

As shown in Figure 1, a specified row is obtained through the row key, which is composed of one or more column clusters, Each column cluster has one or more column modifiers (called columns in Figure 1), and each column can have one or more versions. In order to obtain the specified data, you need to know its row key, column cluster, column modifier and version. When designing HBase data model, it is very helpful to consider how the data is obtained. You can obtain HBase data in the following two ways:

・ table scanning through their row keys or a series of row keys.

・ batch operation using map reduce

This dual data acquisition method makes HBase very powerful. Typically, Storing data in Hadoop means that it is useful for offline or batch analysis (especially batch analysis), but it is not necessary for real-time acquisition. HBase supports real-time analysis through key / value storage and batch analysis through map reduce. Let's first look at real-time data acquisition. As a key / value storage, key is a row key and value is a collection of column clusters, as shown in figure 2.

figure2. HBaseas a Key/Value Store

As you can see in Figure 2, key is the row key we mentioned, and value is the set of column clusters. You can retrieve value through key, or in other words, you can "get" rows through row key, or you can retrieve a series of rows through given start and end row keys, which is the table scan mentioned above. You can't query the value of a column in real time, which leads to an important topic: the design of row keys.

There are two reasons why the design of row keys is important:

・ table scanning is the operation of row keys, so the design of row keys controls the real-time / direct acquisition you can perform through HBase.

・ when HBase is running in the production environment, it runs on the upper part of HDFS, and the data passes through HDFS based on row keys. If all your row keys start with user -, it is likely that most of your data will be allocated to one node (contrary to the original intention of distributed data). Therefore, your row keys should be sufficiently different to pass through the whole deployment in a distributed manner.

The way you define row keys depends on how you want to access those rows. If you want to store data on a user basis, one strategy is to use byte queues to store row keys in HBase, Therefore, we can create a hash of user ID (such as MD5 or SHA-1), and then attach time (long type) to the hash. Using hash has two key points: (1) it can disperse values, and data can pass through clusters in a distributed manner, (2) it ensures that the length of keys is consistent, so that it can be used more easily in table scanning.

Having talked about enough theories, the following section shows you how to build an HBase environment and how to use it from the command line.

You can download HBase from the Apache website. At the time of writing, the latest version is 0.98 5. HBase team recommends that you install HBase under UNIX / Linux environment. If you want to run under windows, you need to install cygwin first and run HBase on it. When you finish downloading these files, unzip them to your hard disk. In addition, you also need to install the Java environment. If you haven't, download the Java environment from the Oracle website. Add an environment configuration named HBase_ The value of the home variable is the root directory where you unzip the HBase file, and then execute start HBase. Exe in the bin folder SH script, which will output log files in the following directory:

$HBASE_ HOME/logs/

You can enter the following URL in the browser to test whether it is installed correctly:

http://localhost:60010

If the installation is correct, you should see the following interface.

figure3. HBaseManagement Screen

Let's start to operate HBase with the command line and execute the following command in the hbasebin Directory:

./ hbase shell

You should see the following similar output:

Create a table named pageviews with a column cluster named Info:

Each table must have at least one column cluster, so we created info. Now, look at our table and execute the following list command:

As you can see, the list command returns a table named pageviews. We can get more information about the table through the describe command:

The describe command returns the detailed information of the table, including the list of column clusters. Here we create only one: info. Now add the following data to the table. The following command is to add a new row in Info:

The put command inserts a new record with the row key rowkey1, specifies the page column under info, and inserts a record with the value / mypage. We can then query this record with the row key rowkey1 through the get command:

You can see the column info: page, or more specific columns, with the value of / mypage and a timestamp indicating when the record was inserted. Let's add another row before scanning the table:

Now that we have two rows of records, let's query all the records in the pageviews table:

As mentioned earlier, we can't query itself, but we can scan the table. If you execute the scannable command, it will return all rows in the table, which is probably not what you want to do. You can limit the returned results by giving the travel range. Let's insert a new record with the line key beginning with s:

Now, if I add some restrictions and want to query the records with row keys between R and s, I can use the following structure:

This scan returns a record that only starts with S. this analogy is based on the full line key, so rowkey1 is larger than R, so it is returned. In addition, the result of scan includes the startRow of the specified range, but does not include endRow. Note that endRow does not have to be specified. If we execute the same query and only give startRow, we will get all records with row key greater than R.

HBase is a NoSQL, commonly known as Hadoop database. It is open source and based on Google BigTable white paper. HBase runs on HDFS, so it has high scalability and supports Hadoop map reduce programming model. HBase has two access modes: random access through line keys; Offline or batch access via map reduce.

This paper describes the characteristics and advantages of HBase, briefly reviews the key points of row key design, and shows you how to configure HBase environment locally, use commands to create tables, insert data, retrieve specified rows, and finally how to scan

summary

The above is the advantages of HBase introduced by Xiaobian. I hope it will help you. If you have any questions, please leave me a message, and Xiaobian will reply to you in time. Thank you very much for your support for the programming tips website!

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>