How to parse large (50 GB) XML files in Java
At present, I try to use a Sax parser, but about 3 / 4 through the file, it is completely frozen, I have tried to allocate more memory, etc., but I haven't got any improvement
Is there any way to speed up? A better way?
Strip its bare bones, so I now have the following code. When running on the command line, it still won't be as fast as I want
Running it "Java - xms-4096m - xmx8192m - jar reader. Jar", I got a GC that exceeded the limit by about 700000
Main:
public class Read { public static void main(String[] args) { pages = XMLManager.getPages(); } }
XMLManager
public class XMLManager { public static ArrayList<Page> getPages() { ArrayList<Page> pages = null; SAXParserFactory factory = SAXParserFactory.newInstance(); try { SAXParser parser = factory.newSAXParser(); File file = new File("..\\enwiki-20140811-pages-articles.xml"); PageHandler pageHandler = new PageHandler(); parser.parse(file,pageHandler); pages = pageHandler.getPages(); } catch (ParserConfigurationException e) { e.printStackTrace(); } catch (SAXException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } return pages; } }
Page processor
public class PageHandler extends DefaultHandler{ private ArrayList<Page> pages = new ArrayList<>(); private Page page; private StringBuilder stringBuilder; private boolean idSet = false; public PageHandler(){ super(); } @Override public void startElement(String uri,String localName,String qName,Attributes attributes) throws SAXException { stringBuilder = new StringBuilder(); if (qName.equals("page")){ page = new Page(); idSet = false; } else if (qName.equals("redirect")){ if (page != null){ page.setRedirecting(true); } } } @Override public void endElement(String uri,String qName) throws SAXException { if (page != null && !page.isRedirecting()){ if (qName.equals("title")){ page.setTitle(stringBuilder.toString()); } else if (qName.equals("id")){ if (!idSet){ page.setId(Integer.parseInt(stringBuilder.toString())); idSet = true; } } else if (qName.equals("text")){ String articleText = stringBuilder.toString(); articleText = articleText.replaceAll("(?s)<ref(.+?)</ref>"," "); //remove references articleText = articleText.replaceAll("(?s)\\{\\{(.+?)\\}\\}"," "); //remove links underneath headings articleText = articleText.replaceAll("(?s)==See also==.+"," "); //remove everything after see also articleText = articleText.replaceAll("\\|"," "); //Separate multiple links articleText = articleText.replaceAll("\\n"," "); //remove new lines articleText = articleText.replaceAll("[^a-zA-Z0-9- \\s]"," "); //remove all non alphanumeric except dashes and spaces articleText = articleText.trim().replaceAll(" +"," "); //convert all multiple spaces to 1 space Pattern pattern = Pattern.compile("([\\S]+\\s*){1,75}"); //get first 75 words of text Matcher matcher = pattern.matcher(articleText); matcher.find(); try { page.setSummaryText(matcher.group()); } catch (IllegalStateException se){ page.setSummaryText("None"); } page.setText(articleText); } else if (qName.equals("page")){ pages.add(page); page = null; } } else { page = null; } } @Override public void characters(char[] ch,int start,int length) throws SAXException { stringBuilder.append(ch,start,length); } public ArrayList<Page> getPages() { return pages; } }
Solution
Your parsing code may work normally, but the amount of data you load may be too large to hold the memory in the ArrayList
You need some kind of pipeline to deliver data to its actual destination without storing it in memory at any time
I sometimes do things like this for this situation
Create an interface for handling individual elements:
public interface PageProcessor { void process(Page page); }
Provide an implementation to pagehandler through constructor:
public class Read { public static void main(String[] args) { XMLManager.load(new PageProcessor() { @Override public void process(Page page) { // ObvIoUsly you want to do something other than just printing,// but I don't kNow what that is... System.out.println(page); } }) ; } } public class XMLManager { public static void load(PageProcessor processor) { SAXParserFactory factory = SAXParserFactory.newInstance(); try { SAXParser parser = factory.newSAXParser(); File file = new File("pages-articles.xml"); PageHandler pageHandler = new PageHandler(processor); parser.parse(file,pageHandler); } catch (ParserConfigurationException e) { e.printStackTrace(); } catch (SAXException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } } }
Send data to this processor instead of putting it in a list:
public class PageHandler extends DefaultHandler { private final PageProcessor processor; private Page page; private StringBuilder stringBuilder; private boolean idSet = false; public PageHandler(PageProcessor processor) { this.processor = processor; } @Override public void startElement(String uri,Attributes attributes) throws SAXException { //Unchanged from your implementation } @Override public void characters(char[] ch,int length) throws SAXException { //Unchanged from your implementation } @Override public void endElement(String uri,String qName) throws SAXException { // Elide code not needing change } else if (qName.equals("page")){ processor.process(page); page = null; } } else { page = null; } } }
Of course, you can make your interface handle multiple blocks of records, not just one, and let pagehandler collect pages locally into smaller lists, send lists regularly for processing and clear the lists
Or, perhaps better, you can implement the pageprocessor interface defined here, build the logic to buffer the data, and send it to further process the block