How to parse large (50 GB) XML files in Java

At present, I try to use a Sax parser, but about 3 / 4 through the file, it is completely frozen, I have tried to allocate more memory, etc., but I haven't got any improvement

Is there any way to speed up? A better way?

Strip its bare bones, so I now have the following code. When running on the command line, it still won't be as fast as I want

Running it "Java - xms-4096m - xmx8192m - jar reader. Jar", I got a GC that exceeded the limit by about 700000

Main:

public class Read {
    public static void main(String[] args) {       
       pages = XMLManager.getPages();
    }
}

XMLManager

public class XMLManager {
    public static ArrayList<Page> getPages() {

    ArrayList<Page> pages = null; 
    SAXParserFactory factory = SAXParserFactory.newInstance();

    try {

        SAXParser parser = factory.newSAXParser();
        File file = new File("..\\enwiki-20140811-pages-articles.xml");
        PageHandler pageHandler = new PageHandler();

        parser.parse(file,pageHandler);
        pages = pageHandler.getPages();

    } catch (ParserConfigurationException e) {
        e.printStackTrace();
    } catch (SAXException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }


    return pages;
    }    
}

Page processor

public class PageHandler extends DefaultHandler{

    private ArrayList<Page> pages = new ArrayList<>();
    private Page page;
    private StringBuilder stringBuilder;
    private boolean idSet = false;

    public PageHandler(){
        super();
    }

    @Override
    public void startElement(String uri,String localName,String qName,Attributes attributes) throws SAXException {

        stringBuilder = new StringBuilder();

         if (qName.equals("page")){

            page = new Page();
            idSet = false;

        } else if (qName.equals("redirect")){
             if (page != null){
                 page.setRedirecting(true);
             }
        }
    }

     @Override
     public void endElement(String uri,String qName) throws SAXException {

         if (page != null && !page.isRedirecting()){

             if (qName.equals("title")){

                 page.setTitle(stringBuilder.toString());

             } else if (qName.equals("id")){

                 if (!idSet){

                     page.setId(Integer.parseInt(stringBuilder.toString()));
                     idSet = true;

                 }

             } else if (qName.equals("text")){

                 String articleText = stringBuilder.toString();

                 articleText = articleText.replaceAll("(?s)<ref(.+?)</ref>"," "); //remove references
                 articleText = articleText.replaceAll("(?s)\\{\\{(.+?)\\}\\}"," "); //remove links underneath headings
                 articleText = articleText.replaceAll("(?s)==See also==.+"," "); //remove everything after see also
                 articleText = articleText.replaceAll("\\|"," "); //Separate multiple links
                 articleText = articleText.replaceAll("\\n"," "); //remove new lines
                 articleText = articleText.replaceAll("[^a-zA-Z0-9- \\s]"," "); //remove all non alphanumeric except dashes and spaces
                 articleText = articleText.trim().replaceAll(" +"," "); //convert all multiple spaces to 1 space

                 Pattern pattern = Pattern.compile("([\\S]+\\s*){1,75}"); //get first 75 words of text
                 Matcher matcher = pattern.matcher(articleText);
                 matcher.find();

                 try {
                     page.setSummaryText(matcher.group());
                 } catch (IllegalStateException se){
                     page.setSummaryText("None");
                 }
                 page.setText(articleText);

             } else if (qName.equals("page")){

                 pages.add(page);
                 page = null;

            }
        } else {
            page = null;
        }
     }

     @Override
     public void characters(char[] ch,int start,int length) throws SAXException {
         stringBuilder.append(ch,start,length); 
     }

     public ArrayList<Page> getPages() {
         return pages;
     }
}

Solution

Your parsing code may work normally, but the amount of data you load may be too large to hold the memory in the ArrayList

You need some kind of pipeline to deliver data to its actual destination without storing it in memory at any time

I sometimes do things like this for this situation

Create an interface for handling individual elements:

public interface PageProcessor {
    void process(Page page);
}

Provide an implementation to pagehandler through constructor:

public class Read  {
    public static void main(String[] args) {

        XMLManager.load(new PageProcessor() {
            @Override
            public void process(Page page) {
                // ObvIoUsly you want to do something other than just printing,// but I don't kNow what that is...
                System.out.println(page);
           }
        }) ;
    }

}


public class XMLManager {

    public static void load(PageProcessor processor) {
        SAXParserFactory factory = SAXParserFactory.newInstance();

        try {

            SAXParser parser = factory.newSAXParser();
            File file = new File("pages-articles.xml");
            PageHandler pageHandler = new PageHandler(processor);

            parser.parse(file,pageHandler);

        } catch (ParserConfigurationException e) {
            e.printStackTrace();
        } catch (SAXException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }

    }
}

Send data to this processor instead of putting it in a list:

public class PageHandler extends DefaultHandler {

    private final PageProcessor processor;
    private Page page;
    private StringBuilder stringBuilder;
    private boolean idSet = false;

    public PageHandler(PageProcessor processor) {
        this.processor = processor;
    }

    @Override
    public void startElement(String uri,Attributes attributes) throws SAXException {
         //Unchanged from your implementation
    }

    @Override
    public void characters(char[] ch,int length) throws SAXException {
         //Unchanged from your implementation
    }

    @Override
    public void endElement(String uri,String qName) throws SAXException {
            //  Elide code not needing change

            } else if (qName.equals("page")){

                processor.process(page);
                page = null;

            }
        } else {
            page = null;
        }
    }

}

Of course, you can make your interface handle multiple blocks of records, not just one, and let pagehandler collect pages locally into smaller lists, send lists regularly for processing and clear the lists

Or, perhaps better, you can implement the pageprocessor interface defined here, build the logic to buffer the data, and send it to further process the block

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>