How to parse large (50 GB) XML files in Java
At present, I try to use a Sax parser, but about 3 / 4 through the file, it is completely frozen, I have tried to allocate more memory, etc., but I haven't got any improvement
Is there any way to speed up? A better way?
Strip its bare bones, so I now have the following code. When running on the command line, it still won't be as fast as I want
Running it "Java - xms-4096m - xmx8192m - jar reader. Jar", I got a GC that exceeded the limit by about 700000
Main:
public class Read {
public static void main(String[] args) {
pages = XMLManager.getPages();
}
}
XMLManager
public class XMLManager {
public static ArrayList<Page> getPages() {
ArrayList<Page> pages = null;
SAXParserFactory factory = SAXParserFactory.newInstance();
try {
SAXParser parser = factory.newSAXParser();
File file = new File("..\\enwiki-20140811-pages-articles.xml");
PageHandler pageHandler = new PageHandler();
parser.parse(file,pageHandler);
pages = pageHandler.getPages();
} catch (ParserConfigurationException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return pages;
}
}
Page processor
public class PageHandler extends DefaultHandler{
private ArrayList<Page> pages = new ArrayList<>();
private Page page;
private StringBuilder stringBuilder;
private boolean idSet = false;
public PageHandler(){
super();
}
@Override
public void startElement(String uri,String localName,String qName,Attributes attributes) throws SAXException {
stringBuilder = new StringBuilder();
if (qName.equals("page")){
page = new Page();
idSet = false;
} else if (qName.equals("redirect")){
if (page != null){
page.setRedirecting(true);
}
}
}
@Override
public void endElement(String uri,String qName) throws SAXException {
if (page != null && !page.isRedirecting()){
if (qName.equals("title")){
page.setTitle(stringBuilder.toString());
} else if (qName.equals("id")){
if (!idSet){
page.setId(Integer.parseInt(stringBuilder.toString()));
idSet = true;
}
} else if (qName.equals("text")){
String articleText = stringBuilder.toString();
articleText = articleText.replaceAll("(?s)<ref(.+?)</ref>"," "); //remove references
articleText = articleText.replaceAll("(?s)\\{\\{(.+?)\\}\\}"," "); //remove links underneath headings
articleText = articleText.replaceAll("(?s)==See also==.+"," "); //remove everything after see also
articleText = articleText.replaceAll("\\|"," "); //Separate multiple links
articleText = articleText.replaceAll("\\n"," "); //remove new lines
articleText = articleText.replaceAll("[^a-zA-Z0-9- \\s]"," "); //remove all non alphanumeric except dashes and spaces
articleText = articleText.trim().replaceAll(" +"," "); //convert all multiple spaces to 1 space
Pattern pattern = Pattern.compile("([\\S]+\\s*){1,75}"); //get first 75 words of text
Matcher matcher = pattern.matcher(articleText);
matcher.find();
try {
page.setSummaryText(matcher.group());
} catch (IllegalStateException se){
page.setSummaryText("None");
}
page.setText(articleText);
} else if (qName.equals("page")){
pages.add(page);
page = null;
}
} else {
page = null;
}
}
@Override
public void characters(char[] ch,int start,int length) throws SAXException {
stringBuilder.append(ch,start,length);
}
public ArrayList<Page> getPages() {
return pages;
}
}
Solution
Your parsing code may work normally, but the amount of data you load may be too large to hold the memory in the ArrayList
You need some kind of pipeline to deliver data to its actual destination without storing it in memory at any time
I sometimes do things like this for this situation
Create an interface for handling individual elements:
public interface PageProcessor {
void process(Page page);
}
Provide an implementation to pagehandler through constructor:
public class Read {
public static void main(String[] args) {
XMLManager.load(new PageProcessor() {
@Override
public void process(Page page) {
// ObvIoUsly you want to do something other than just printing,// but I don't kNow what that is...
System.out.println(page);
}
}) ;
}
}
public class XMLManager {
public static void load(PageProcessor processor) {
SAXParserFactory factory = SAXParserFactory.newInstance();
try {
SAXParser parser = factory.newSAXParser();
File file = new File("pages-articles.xml");
PageHandler pageHandler = new PageHandler(processor);
parser.parse(file,pageHandler);
} catch (ParserConfigurationException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Send data to this processor instead of putting it in a list:
public class PageHandler extends DefaultHandler {
private final PageProcessor processor;
private Page page;
private StringBuilder stringBuilder;
private boolean idSet = false;
public PageHandler(PageProcessor processor) {
this.processor = processor;
}
@Override
public void startElement(String uri,Attributes attributes) throws SAXException {
//Unchanged from your implementation
}
@Override
public void characters(char[] ch,int length) throws SAXException {
//Unchanged from your implementation
}
@Override
public void endElement(String uri,String qName) throws SAXException {
// Elide code not needing change
} else if (qName.equals("page")){
processor.process(page);
page = null;
}
} else {
page = null;
}
}
}
Of course, you can make your interface handle multiple blocks of records, not just one, and let pagehandler collect pages locally into smaller lists, send lists regularly for processing and clear the lists
Or, perhaps better, you can implement the pageprocessor interface defined here, build the logic to buffer the data, and send it to further process the block
