Java – use VTD-XML to optimize the speed of parsing XML files
I am using VTD - XML to parse a large number of XML files I'm not sure if I used the tool correctly - I think so, but parsing the file took me too long
XML file (dataxii format) is a compressed file on HD Open the package, they are about 31MB and contain more than 850.000 lines of text I just need to extract a few fields and store them in the database
import org.apache.commons.lang3.math.NumberUtils;
...
private static void test(File zipFile) throws XPathEvalException,NavException,XPathParseException {
// init timer
long step1=System.currentTimeMillis();
// open file to output extracted fragments
VTDGen vg = new VTDGen();
vg.parseZIPFile(zipFile.getAbsolutePath(),zipFile.getName().replace(".zip",".xml"),true);
VTDNav vn = vg.getNav();
AutoPilot apSites = new AutoPilot();
apSites.declareXPathNameSpace("ns1","http://schemas.xmlsoap.org/soap/envelope/");
apSites.selectXPath("/ns1:Envelope/ns1:Body/d2LogicalModel/payloadPublication/siteMeasurements");
apSites.bind(vn);
long step2=System.currentTimeMillis();
System.out.println("Prep took "+(step2-step1)+"ms; ");
// init variables
String siteID,timeStr;
boolean reliable;
int index,flow,ctr=0;
short speed;
while(apSites.evalXPath()!=-1) {
vn.toElement(VTDNav.FIRST_CHILD,"measurementSiteReference");
siteID = vn.toString(vn.getText());
// loop all measured values of this measurement site
while(vn.toElement(VTDNav.NEXT_SIBLING,"measuredValue")) {
ctr++;
// extract index attribute
index = NumberUtils.toInt(vn.toString(vn.getAttrVal("index")));
// go one level deeper into basicDataValue
vn.toElement(VTDNav.FIRST_CHILD,"basicDataValue");
// we need either FIRST_CHILD or NEXT_SIBLING depending on whether we find something
int next = VTDNav.FIRST_CHILD;
if(vn.toElement(next,"time")) {
timeStr = vn.toString(vn.getText());
next = VTDNav.NEXT_SIBLING;
}
if(vn.toElement(next,"averageVehicleSpeed")) {
speed = NumberUtils.toShort(vn.toString(vn.getText()));
next = VTDNav.NEXT_SIBLING;
}
if(vn.toElement(next,"vehicleFlow")) {
flow = NumberUtils.toInt(vn.toString(vn.getText()));
next = VTDNav.NEXT_SIBLING;
}
if(vn.toElement(next,"fault")) {
reliable = vn.toString(vn.getText()).equals("0");
}
// insert into database here...
if(next==VTDNav.NEXT_SIBLING) {
vn.toElement(VTDNav.PARENT);
}
vn.toElement(VTDNav.PARENT);
}
}
System.out.println("Loop took "+(System.currentTimeMillis()-step2)+"ms; ");
System.out.println("Total number of measured values: "+ctr);
}
The output of the above function of my XML file is:
Prep took 25756ms; Loop took 26889ms; Total number of measured values: 112611
There is no data actually inserted into the database The problem now is that I receive one such file every minute The total parsing time is now close to 1 minute, because it takes about 10 seconds to download the file. I need to store the data in the database, and I am running in real time now
Is there any way to speed up? What I tried didn't help:
>Using autopilots for all fields actually slows down the second step by 30000 MS > decompress the file yourself and parse the byte array into VTD, which makes no difference > loop the file yourself using BufferedReader readline(), but it's not fast enough
Does anyone see the possibility of speeding up, or do I need to start thinking about heavier machines / multithreading? Of course, there are a lot of 850.000 lines per minute (1.2 billion lines per day), but I still think it shouldn't take a minute to parse 31MB data
Solution
You can try to unzip the folder immediately and store the value of each XML file in an array
File[] files = new File("foldername").listFiles();
Then you can make a loop through each file. I'm not sure if it will speed up, but it's worth trying
