Java – applies a regular expression to a string that is a small byte array buffer that buffers a large file

I'm reading a file that I can't buffer at one time because its size ranges from 256MB to 2GB

After opening the file, I read a large block into a byte array, such as 512 bytes, convert it into a string and run a regular expression on it. If a pattern is detected, my program will record it

The problem I encountered was that my program lacked many places in the file where patterns should be detected

I'm 90% sure the problem is that although the pattern exists, it's incomplete because it exceeds the length of the buffer The pattern I am looking for is 8 bytes long. For example, the first four bytes of the pattern are located in the last four positions of the array; So when it fills again, the first four bytes of the array are the last four bytes of the pattern Therefore, my regular expressions always fail

I guess what I need to do is fill the buffer, and then when it fills again, keep the last 20 or so bytes so that it won't miss any pattern I'm looking for

Any suggestions would be appreciated Thank you in advance

Tony

Solution

First, you cannot apply Java regular expressions to byte arrays You must apply it to string Therefore, you must convert from byte [] to string, and you may (a) use the wrong encoding, or (b) truncate the middle string

Once you've done this, you need to use streaming rules to reconstruct what you're reading I can describe one that may or may not apply:

>Read a large amount of data into the buffer. > Find the last sentence boundary in the buffer. > Process from start to boundary. > Moves the remainder to the front of the buffer. > Refill the remaining buffers from the source. > Foam, rinse and repeat.

If this is an ordinary character file, modify it as follows:

Reader r = new InputStreamReader(inputByteStream,Charset.forName("utf-8"));

Then the above algorithm is applied to avoid buffer boundary conditions

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>