Explain in detail how to write and run spark applications in Java

We first put forward such a simple requirement:

Now we need to analyze the access log information of a website and count the number of visits by users from different IPs, so as to obtain the distribution of visiting users in their countries and regions through geo information. Here I take the log record line of my website as an example, as shown below:

Implementing spark application in Java

The statistical analysis program we implemented has the following function points:

Read log data file from HDFS

Extract the first field (IP address) of each line

Count the number of occurrences of each IP address

A descending sort is performed according to the number of occurrences of each IP address

According to the IP address, call the geoip library to obtain the IP country

Printout results, format of each line: [country code] IP address frequency

Next, let's look at the statistical analysis application code implemented in Java, as follows:

For specific implementation logic, please refer to the comments in the code. Let's use Maven management to build Java programs. First, let's take a look at the software packages that my POM configuration depends on, as shown below:

It should be noted that when we run the program on spark cluster, it requires that the job we write can be serialized. If some fields do not need to be serialized or cannot be serialized, we can directly use transient modification. For example, the above attribute lookupservice does not implement the serialization interface, and use transient to make it not serialized. Otherwise, Errors like the following may occur:

Running Java programs on spark clusters

Here, I use Maven management to build a java program. After realizing the above code, I use Maven's Maven assembly plugin. The configuration content is as follows:

Print the relevant dependent library files into the process package, Finally, copy the jar file to the Linux system (not necessarily on the master node of the spark cluster) to ensure that the spark environment variables on the node are configured correctly. After the spark software distribution package is decompressed, you can see the script bin / run example. We can directly modify the script and point the corresponding path to the java package we implement (modify the variables examples_dir and the contents related to the storage location of our jar files). You can run the script using the script. The script contents are as follows:

Run the Java program we developed on spark and execute the following commands:

I implemented the program class org shirdrn. spark. job. Ipaddressstats requires three parameters to run:

Spark cluster master node URL: for example, mine is spark://m1:7077

Input file path: business related. Here I read the file from HDFS hdfs://m1:9000/user/shirdrn/wwwlog20140222.log

Geoip library file: service related external file used to calculate the country to which the IP address belongs

If the program has no errors and can run normally, the console outputs the program running log. An example is as follows:

We can also view the status information of the currently executing application through the web console through port 8080 of the master node (for example: http://m1:8080/ )You can see the application status information of the cluster.

In addition, if you use eclipse to develop spark applications using Java in UNIX environment, you can also directly connect to the spark cluster through eclipse, submit the developed applications, and then hand them over to the cluster for processing.

summary

The above is the full content of this article on explaining the method of writing and running spark applications in Java. I hope it will be helpful to you. If you have any questions, you can leave a message at any time. Xiaobian will reply to you in time.

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>