Java – general purpose card lists API calls in the file system of Hadoop

tl; Dr: in order to be able to use wildcards (globs) in the listed paths, just use globstatus (...) Instead of liststatus (...)

context

The files on my HDFS cluster are organized into partitions with the date "root" A simplified example of the file structure is as follows:

/schemas_folder
├── date=20140101
│ ├── A-schema.avsc
│ ├── B-schema.avsc
├── date=20140102
│ ├── A-schema.avsc
│ ├── B-schema.avsc
│ ├── C-schema.avsc
└── date=20140103
  ├── B-schema.avsc
  └── C-schema.avsc

In my case, the directory stores Avro patterns for different types of data (a, B, and C in this case) on different dates Over time, architectures may begin to exist, evolve and stop existing

target

I need to get all the patterns of a given type as soon as possible In the example where I want to get all the patterns that exist in type A, I want to do the following:

hdfs dfs -ls /schemas_folder/date=*/A-schema.avsc

That will give me

Found 1 items
-rw-r--r--   3 user group 1234 2014-01-01 12:34 /schemas_folder/date=20140101/A-schema.avsc
Found 1 items
-rw-r--r--   3 user group 2345 2014-01-02 23:45 /schemas_folder/date=20140102/A-schema.avsc

problem

I don't want to use shell commands. It seems that I can't find the same command as the above command in the Java API When I try to implement the cycle myself, I get terrible performance I want at least command line performance (about 3 seconds in my case)

So far I've found

It can be noted that it prints twice and finds 1 item before each result It did not print at the beginning and found 2 items This may imply that wildcards are not implemented in the filesystem, but are handled by the client I can't seem to find the correct source code to see how it is implemented

Here is my first shot. It may be a little naive

Use listfiles (...)

Code:

RemoteIterator<LocatedFileStatus> files = filesystem.listFiles(new Path("/schemas_folder"),true);
Pattern pattern = Pattern.compile("^.*/date=[0-9]{8}/A-schema\\.avsc$");
while (files.hasNext()) {
    Path path = files.next().getPath();
    if (pattern.matcher(path.toString()).matches())
    {
        System.out.println(path);
    }
}

result:

This prints out what I expected, but because it first lists all recursions and then filters, the performance is really poor Using my current dataset takes nearly 25 seconds

Use liststatus (...)

Code:

FileStatus[] statuses = filesystem.listStatus(new Path("/schemas_folder"),new PathFilter()
{
    private final Pattern pattern = Pattern.compile("^date=[0-9]{8}$");

    @Override
    public boolean accept(Path path)
    {
        return pattern.matcher(path.getName()).matches();
    }
});
Path[] paths = new Path[statuses.length];
for (int i = 0; i < statuses.length; i++) { paths[i] = statuses[i].getPath(); }
statuses = filesystem.listStatus(paths,new PathFilter()
{
    @Override
    public boolean accept(Path path)
    {
        return "A-schema.avsc".equals(path.getName());
    }
});
for (FileStatus status : statuses)
{
    System.out.println(status.getPath());
}

result:

It seems to execute faster (about 12 seconds) due to the use of pathfilters and arrays However, the code is more complex and more difficult to adapt to different situations Most importantly, the performance is 3 to 4 times slower than the command line version!

topic

What am I missing here? What is the quickest way to get the results I want?

to update

2014.07. 09 – 13:38

Answer proposed by Mukesh s is obviously the best API method

In the example given above, the final code is as follows:

FileStatus[] statuses = filesystem.globStatus(new Path("/schemas_folder/date=*/A-schema.avsc"));
for (FileStatus status : statuses)
{
    System.out.println(status.getPath());
}

This is the best looking and best code you can find so far, but it still can't execute like a shell

Solution

Instead of liststatus, you can try Hadoop globstatus Hadoop provides two file system methods for handling globs:

public FileStatus[] globStatus(Path pathPattern) throws IOException
public FileStatus[] globStatus(Path pathPattern,PathFilter filter) throws IOException

You can specify an optional pathfilter to further restrict matching

For more instructions, you can check Hadoop: the final guide here

Hope it helps!!!

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>