Java – you need to help run map to reduce wordcount jobs and data stored on Amazon S3

I'm trying to run the map reduce wordcount job on a text file stored in a bucket on Amazon S3 I have set up all the necessary authentication for the map reduce framework to communicate with Amazon, but I continue to run this error Do you know why?

13/01/20 13:22:15 ERROR security.UserGroupInformation:
PriviledgedActionException as:root
cause:org.apache.hadoop.mapred.InvalidInputException: Input path does
not exist: s3://name-bucket/test.txt
Exception in thread "main"
org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: s3://name-bucket/test.txt
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
    at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
    at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:416)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
    at org.myorg.WordCount.main(WordCount.java:55)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:616)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

Solution

In fact, you have to replace protocol S3 with s3n These are 2 different file systems with different attributes:

>S3n is S3 native file system: a native file system used to read and write regular files on S3 The advantage of this file system is that you can access files on S3 written with other tools Instead, other tools can access files written in Hadoop The disadvantage is that S3 imposes a file size limit of 5GB Therefore, it is not suitable to replace HDFS (it supports very large files). > S3 is block file system: block based file system supported by S3 Files are stored as blocks, just as they are in HDFS This allows renaming to be implemented effectively This file system requires that you use a dedicated bucket for the file system – you should not use an existing bucket that contains files or write other files to the same bucket This file system may store files larger than 5GB, but they cannot interoperate with other S3 tools

(source)

In your case, your bucket may be using the s3n file system. I believe this is the default. Most buckets I use are also s3n So you should use s3n: / / name bucket / test txt

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>