Hadoop upload file function example code

Files on HDFS are uploaded to HDFS from local Linux by manually executing commands. In the real running environment, it is impossible to manually execute the command upload every time, which is too cumbersome. Then, we can use the Java API provided by HDFS to upload files to HDFS, or directly upload files from FTP to HDFS.

However, it should be noted that before the author wanted to run Mr, it was necessary to manually execute yarn jar every time. In the actual environment, it is impossible to manually execute yarn jar every time. For example, our company uses the soca scheduling platform / task monitoring platform, which can regularly execute our programs in workflow, including ordinary Java programs and Mr programs. In fact, this scheduling platform uses quartz. Of course, the scheduling platform also provides other functions, such as web display, log viewing, etc., so it is not free.

First of all, let me give you a brief introduction to HDFS. HDFS stores large files in streaming data access mode. The construction idea of HDFS is to write once and read many times, which is the most efficient access mode. HDFS is optimized for high data throughput applications, so it will increase the time delay. For low latency access requirements, we can use HBase.

then, Also know the blocks in HDFS The default value of (block) is 64MB. Block is the smallest unit for HDFS data reading and writing. Usually, each map task processes only one block at a time. For example, we will use this concept for cluster performance evaluation, such as the number of nodes, disk space, CPU, data volume to be processed and network bandwidth of each node. I use this information to evaluate the performance You can use Hadoop fsck / - files - blocks to list which blocks each file in the file system consists of.

Then, we need to know the namenode and datanode. This has been introduced in the previous blog. Let's take a look at the managers (namenode) and workers (datanode) of HDFS in cm environment, as follows:

In the yarn environment, there can be multiple namenodes. There is no secondarynamenode in this environment. Of course, there can be.

Well, that's all about the basic concept of HDFS. Let's take a look at the specific code.

1、 Java to upload local files to HDFS

Here, you can directly use the Java API provided by HDFS. The code is as follows:

Note that if the HDFS upload directory does not exist here, HDFS will automatically create it, which is more intelligent.

After typing the package, upload it to the server and execute yarn jar mr-demo-0.0 1-SNAPSHOT-jar-with-dependencies. Jar, and then execute Hadoop FS - LS / qiyongkang to see:

2、 Upload FTP files to HDFS using java

First of all, we have to prepare an FTP server. You can refer to the information about the construction of FTP server, so I won't repeat it.

In fact, the process of pulling files from FTP and uploading them to HDFS should not be complicated. When we talk about uploading local files to HDFS, we actually use streaming. Therefore, we can directly read the file stream on FTP and then write it to HDFS as a stream.

Next, post the code directly:

Then, after the same package and upload, execute yarn jar mr-demo-0.0 1-SNAPSHOT-jar-with-dependencies. Jar, you can see:

summary

The above is the example code of Hadoop upload file function introduced by Xiaobian. I hope it will be helpful to you. If you have any questions, please leave me a message, and Xiaobian will reply to you in time. Thank you very much for your support for the programming tips website!

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>