Reading scattered data from multiple files in Java

2020-08-04 • Java

I am studying the reader / writer of DNG / TIFF files Since there are some options to process files (FileInputStream, filechannel, RandomAccessFile), I want to know which strategy is suitable for my needs

The DNG / TIFF file consists of the following:

>Some (5-20) small blocks (tens to hundreds of bytes) > very few (1-3) large continuous image data blocks (up to 100 MIB) > several (possibly 20-50) very small blocks (4-16 bytes)

The entire file size ranges from 15 MIB (compressed 14 bit raw data) to about 100 MIB (uncompressed floating point data) The number of files to process is 50-400

There are two modes of use:

>Read all metadata in all files (all contents except image data) > read all image data from all files

I am currently using filechannel and executing map () to get the mappedbytebuffer that overwrites the entire file If I'm just interested in reading metadata, it seems wasteful Another problem is to free the mapped memory: when I pass fragments of the mapped buffer for parsing, etc., the underlying mappedbytebuffer will not be collected

I now decide to use several read () methods to copy smaller filechannel blocks and map only large original data areas The disadvantage is that reading a single value seems very complex because there is no readshort() and so on:

short readShort(long offset) throws IOException,InterruptedException {
    return read(offset,Short.BYTES).getShort();
}

ByteBuffer read(long offset,long byteCount) throws IOException,InterruptedException {
    ByteBuffer buffer = ByteBuffer.allocate(Math.toIntExact(byteCount));
    buffer.order(GenericTiffFileReader.this.byteOrder);
    GenericTiffFileReader.this.readInto(buffer,offset);
    return buffer;
}

private void readInto(ByteBuffer buffer,long startOffset)
        throws IOException,InterruptedException {

    long offset = startOffset;
    while (buffer.hasRemaining()) {
        int bytesRead = this.channel.read(buffer,offset);
        switch (bytesRead) {
        case 0:
            Thread.sleep(10);
            break;
        case -1:
            throw new EOFException("unexpected end of file");
        default:
            offset += bytesRead;
        }
    }
    buffer.flip();
}

RandomAccessFile provides useful methods, such as readshort() or readfull(), but it cannot handle small endian byte order

So, is there a common way to deal with scattered reads of single bytes and large blocks? Memory mapping the entire 100 MIB file is just a waste of reading hundreds of bytes or slow?

Solution

OK, I finally did some rough benchmarks:

>Refresh the / proc / sys directory / VM / drop of all read cache echo 3 >_ Caches > repeat 8 times: read 1000 8 bytes from each file (about 20 files from 20 MIB to 1 GIB)

The total file size exceeds the system memory I installed

Method 1: filechannel and temporary bytebuffers:

private static long method1(Path file,long dummyUsage) throws IOException,Error {
    try (FileChannel channel = FileChannel.open(file,StandardOpenOption.READ)) {

        for (int i = 0; i < 1000; i++) {
            ByteBuffer dst = ByteBuffer.allocate(8);

            if (channel.position(i * 10000).read(dst) != dst.capacity())
                throw new Error("partial read");
            dst.flip();
            dummyUsage += dst.order(ByteOrder.LITTLE_ENDIAN).getInt();
            dummyUsage += dst.order(ByteOrder.BIG_ENDIAN).getInt();
        }
    }
    return dummyUsage;
}

result:

1. 3422 ms
2. 56 ms
3. 24 ms
4. 24 ms
5. 27 ms
6. 25 ms
7. 23 ms
8. 23 ms

Method 2: mappedbytebuffer overwrites the entire file:

private static long method2(Path file,long dummyUsage) throws IOException {

    final MappedByteBuffer buffer;
    try (FileChannel channel = FileChannel.open(file,StandardOpenOption.READ)) {
        buffer = channel.map(MapMode.READ_ONLY,0L,Files.size(file));
    }
    for (int i = 0; i < 1000; i++) {
        dummyUsage += buffer.order(ByteOrder.LITTLE_ENDIAN).getInt(i * 10000);
        dummyUsage += buffer.order(ByteOrder.BIG_ENDIAN).getInt(i * 10000 + 4);
    }
    return dummyUsage;
}

result:

1. 749 ms
2. 21 ms
3. 17 ms
4. 16 ms
5. 18 ms
6. 13 ms
7. 15 ms
8. 17 ms

Method 3: RandomAccessFile:

private static long method3(Path file,long dummyUsage) throws IOException {

    try (RandomAccessFile raf = new RandomAccessFile(file.toFile(),"r")) {
        for (int i = 0; i < 1000; i++) {

            raf.seek(i * 10000);
            dummyUsage += Integer.reverseBytes(raf.readInt());
            raf.seek(i * 10000 + 4);
            dummyUsage += raf.readInt();
        }
    }
    return dummyUsage;
}

result:

1. 3479 ms
2. 104 ms
3. 81 ms
4. 84 ms
5. 78 ms
6. 81 ms
7. 81 ms
8. 81 ms

Conclusion: mappedbytebuffer method uses more page cache memory (340 MB instead of 140 Mb), but performs better in the first and all subsequent runs, and seems to have the lowest overhead As a reward, this method provides a very comfortable interface about byte order, scattered small data and huge data blocks RandomAccessFile performed worst

Answer my own question: mappedbytebuffer covering the whole file seems to be the customary and fastest way to handle random access to large files without wasting memory

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.

THE END

Java

二维码

Java – Elliptic Curve Cryptography (ECC) with elastic castle for asymmetric encryption

< <上一篇

Java – Android WebView, zooming content to fit the screen

下一篇>>

搜索内容

Reading scattered data from multiple files in Java

Solution

热门文章