Java – general purpose card lists API calls in the file system of Hadoop
tl; Dr: in order to be able to use wildcards (globs) in the listed paths, just use globstatus (...) Instead of liststatus (...)
context
The files on my HDFS cluster are organized into partitions with the date "root" A simplified example of the file structure is as follows:
/schemas_folder ├── date=20140101 │ ├── A-schema.avsc │ ├── B-schema.avsc ├── date=20140102 │ ├── A-schema.avsc │ ├── B-schema.avsc │ ├── C-schema.avsc └── date=20140103 ├── B-schema.avsc └── C-schema.avsc
In my case, the directory stores Avro patterns for different types of data (a, B, and C in this case) on different dates Over time, architectures may begin to exist, evolve and stop existing
target
I need to get all the patterns of a given type as soon as possible In the example where I want to get all the patterns that exist in type A, I want to do the following:
hdfs dfs -ls /schemas_folder/date=*/A-schema.avsc
That will give me
Found 1 items -rw-r--r-- 3 user group 1234 2014-01-01 12:34 /schemas_folder/date=20140101/A-schema.avsc Found 1 items -rw-r--r-- 3 user group 2345 2014-01-02 23:45 /schemas_folder/date=20140102/A-schema.avsc
problem
I don't want to use shell commands. It seems that I can't find the same command as the above command in the Java API When I try to implement the cycle myself, I get terrible performance I want at least command line performance (about 3 seconds in my case)
So far I've found
It can be noted that it prints twice and finds 1 item before each result It did not print at the beginning and found 2 items This may imply that wildcards are not implemented in the filesystem, but are handled by the client I can't seem to find the correct source code to see how it is implemented
Here is my first shot. It may be a little naive
Use listfiles (...)
Code:
RemoteIterator<LocatedFileStatus> files = filesystem.listFiles(new Path("/schemas_folder"),true); Pattern pattern = Pattern.compile("^.*/date=[0-9]{8}/A-schema\\.avsc$"); while (files.hasNext()) { Path path = files.next().getPath(); if (pattern.matcher(path.toString()).matches()) { System.out.println(path); } }
result:
This prints out what I expected, but because it first lists all recursions and then filters, the performance is really poor Using my current dataset takes nearly 25 seconds
Use liststatus (...)
Code:
FileStatus[] statuses = filesystem.listStatus(new Path("/schemas_folder"),new PathFilter() { private final Pattern pattern = Pattern.compile("^date=[0-9]{8}$"); @Override public boolean accept(Path path) { return pattern.matcher(path.getName()).matches(); } }); Path[] paths = new Path[statuses.length]; for (int i = 0; i < statuses.length; i++) { paths[i] = statuses[i].getPath(); } statuses = filesystem.listStatus(paths,new PathFilter() { @Override public boolean accept(Path path) { return "A-schema.avsc".equals(path.getName()); } }); for (FileStatus status : statuses) { System.out.println(status.getPath()); }
result:
It seems to execute faster (about 12 seconds) due to the use of pathfilters and arrays However, the code is more complex and more difficult to adapt to different situations Most importantly, the performance is 3 to 4 times slower than the command line version!
topic
What am I missing here? What is the quickest way to get the results I want?
to update
2014.07. 09 – 13:38
Answer proposed by Mukesh s is obviously the best API method
In the example given above, the final code is as follows:
FileStatus[] statuses = filesystem.globStatus(new Path("/schemas_folder/date=*/A-schema.avsc")); for (FileStatus status : statuses) { System.out.println(status.getPath()); }
This is the best looking and best code you can find so far, but it still can't execute like a shell
Solution
Instead of liststatus, you can try Hadoop globstatus Hadoop provides two file system methods for handling globs:
public FileStatus[] globStatus(Path pathPattern) throws IOException public FileStatus[] globStatus(Path pathPattern,PathFilter filter) throws IOException
You can specify an optional pathfilter to further restrict matching
For more instructions, you can check Hadoop: the final guide here
Hope it helps!!!