Java – the difference between row oriented and column oriented databases in processing information retrieval

Recently, I began to study HBase (a column oriented database) When I passed the source code, a problem kept popping up in my mind Think of this problem

If I make a mistake in any part of this problem, please correct it

Greetings, Krishna

Solution

If I understand you correctly, you are more interested in the basic storage and retreival problems, and less in the DDL and definition problems are the categories of column oriented DBS, right?

I will assume that you understand that almost all storage, regardless of the supplier, is in some form:

>Exponential B-tree > a heap of unorganized data

On this basis, each supplier has optimization and patent expertise For example Sybase (row) has:

>Cluster index, which combines data rows with B - tree and eliminates heap

The next problem is that all vendors (except Oracle) have quite complex engines, adopt modular design, and I / O is asynchronous and at a low level to obtain speed I / O units are pages OLTP system is usually 2-8kb and DSS is usually 8-64kb (note that I avoided the row vs column problem.) Therefore, regardless of row / column, DSS engine is built for a large number of retrieval, because more index / data rows or columns are obtained in large blocks and there are fewer I / O requests

You can read memory (8 pages) and larger allocation units (256 pages) into memory through an I / O request to perform "large I / O" But the basic unit is page

Row vs column

>OK

>Each row is a continuous unit on the page, and multiple rows are packaged into the page. > For the index, this is not important because the entire data structure is a composite column in the keyword; An index entry or record is a small index entry pointer, and more index entries are packaged into the same page. > They are very fast for small lines; Summary columns aggregate slowly

>Column

>Each column is a continuous unit on the page And because columns can be millions of entries (rows) long, they run a lot of pages. > The index is the same as the above line By adding a special index form, columnar navigation should be faster They are very significant for columnar aggregates; Constructing rows from column based data is very slow

All queries executed against the engine must navigate the index to retrieve data rows / columns from the above data storage structure

The result is the above multiplication;

>Small / large size, times > underlying physical structure, times > row / column direction

What are you looking for? For SYBASE ASE, there is a set of technical (not warm and fuzzy) charts, an OLTP / DSS strict line oriented engine. If you are interested, I can get my hand

Respond to comments

. You mean, eventually we'll come down to pages, regardless of database type

Yes

If this is the case, how will the clustering of the database be completed Let's take a database to store data If I am clustering this type of database, how will I host the table structure to different nodes (if I have multiple nodes) Will the table structure be linked to the page or through different mechanisms

You know, before I answer this question, I must admit you For knowledgeable people, you have reached the key points and gained insight Shiva I AI!

Yes, this is the key design problem of cluster DBMS, and the key restrictive problem. First, various design problems related to cluster; If the supplier handles this problem well, the cluster runs well; If not, the cluster is a dog breakfast

Everything in it is bound by the laws of physics Nothing is free. Every function has cost, processing or storage There is no magic, except in the MS marketing manual

Good cluster database architecture

I don't know all the cluster DBMS; I know Sybase CE and Oracle RAC are really good Working knowledge of Sybase IQ

>Oracle RAC has been around longer and more mature It deals with this critical issue very seriously So it eventually competes with itself and needs more CPU power (kernel, CPU, not node) than the original estimate The more nodes, the more contention It should be noted that Oracle non RAC architecture is nonsense, or rather does not exist; So RAC has a sandy foundation Not to mention sucking dead bears Sybase CE is only one year old But the architecture is brilliant, and it handles this key problem well There is only one version of the page on the San All nodes are connected to San Any node can read or write to this page These nodes are connected by a dedicated LAN (except the normal client server LAN used by all other users on the network) Node coordination lock and a little communication between nodes to achieve balance, etc At the end of the day, even with Sybase CE, you need to logically partition the database to separate the workload on each node and access different file paths or separate physical areas of the shared database. > Sybase IQ is already 100% column oriented This is their DW product It has completed load balancing A cluster can be used, but not in the above CE certification I should include it

Poor cluster database schema

The dog breakfast type gathers DBMS to do stupid things List several:

>Store pages on each node [large-scale replication], but you must move updated pages around the cluster > use mvcc to overcome this problem (but mvcc is more expensive and actually slows down concurrency, so it is fighting)

Cluster is not suitable for dedicated DB server

Basically, clustering is great for some applications, but for dedicated database servers, this is a stupid idea (a fact is in one place; shared resources managed together; lock contention, which is most effective when managed in one place, because the data is in one place) I would not recommend a cluster of database servers

>Same as San problem Of course, many people store the database in the San, but to achieve maximum speed and isolate from the load problems of other servers connected to the San, nothing is close to the local disk. > Same problem as VMware Of course, many people establish the database server as a VMware host, but at the highest speed, the overhead of VMware is deleted; In order to isolate the load problem of other host units in the chassis, unload them from there to a dedicated hard box

Why are database vendors tired of clustering

Oh, it's valuable here, but not now, in the future Afaic, Sybase architecture will dominate over time, while all other projects will lag behind Every supplier will copy as usual

Sybase CE's strengths are:

>Real 100% uptime (a node can be added to the cluster and the old node can be shut down for maintenance) and > full dynamic load balancing (say that the existing node is 4 x quad core; add a temp 4 x quad core node; put down the old node; insert 2 x quad core; start it and drop the temp node). Then, within 60 seconds, there are no fingers on any keyboard and the whole beast is rebalanced

A store that can stagger the nighttime database maintenance plan of several single node servers can save a lot of money; They only need a few extra machines to switch in and out. > Data warehouse is different Most of them are read-only Therefore, there is no problem hosting it on the cluster (many reader nodes, only one writer node, no contention, and no one cares that these pages are written when they are being read) Sybase IQ is such a product

Sybase CE for column oriented

>Sybase IQ is column oriented and can be deployed in the cluster, but it cannot be clustered in the above CE certification Columns are mapped to pages I should have included it in the "good clustering DB architecture" above, which is now corrected. > I don't know a valuable hybrid that combines columns and rows. > But the complete answer to this question is to use pure dB (not DW), such as Sybase ASE or ASE / CE, and implement a real sixth common format database This is the final normalized, unconstrained NF, which has several significant advantages, including speed and ease of rotation It provides column - oriented storage on the page Since SQL does not fully support 6nf, you will need to provide a view to provide 5NF rows from the (stored) 6nf structure I wrote an extension to the directory, so I can generate SQL code for developers

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>