How to quickly search for strings in large files in Java?
I'm trying to search a large text file (400MB) for a specific string using the following:
File file = new File("fileName.txt"); try { int count = 0; Scanner scanner = new Scanner(file); while(scanner.hasNextLine()) { if(scanner.nextLine().contains("particularString")) { count++; System.out.println("Number of instances of String: " + count); } } } catch (FileNotFoundException e){ System.out.println(e); }
This applies to small files, but it takes too long (> 10 minutes) for this particular file and other large files
What is the fastest and most effective way to do this?
I will now change to the following and complete it in a few seconds –
try { int count = 0; FileReader fileIn = new FileReader(file); BufferedReader reader = new BufferedReader(fileIn); String line; while((line = reader.readLine()) != null) { if((line.contains("particularString"))) { count++; System.out.println("Number of instances of String " + count); } } }catch (IOException e){ System.out.println(e); }
Solution
First, find out how long it takes to actually read the entire file content and how long it takes to scan mode
If your results are dominated by reading time (and assuming you read correctly, then the channel or at least the buffered reader) there is nothing to do
If it takes up your scanning time, you can read all the rows, and then send the small batch of rows to be searched to the work queue, where you can have multiple threads pick up the row batch and search in it
Stadium data
>Assuming a hard disk reading speed of 50 MB / S (slow by modern standards), you should be able to read the entire file into memory in < 10 seconds. > Looking at the MD5 hash speed benchmark (for example, here) shows us that the hash rate is at least as fast as the disk read speed (usually faster) In addition, string search is faster, simpler and better parallelized than hash
Considering these two estimates, I think the correct implementation can easily provide you with a running time of about 10 seconds (if you start the search job when reading the row batch), which is mainly determined by your disk reading time