Handling large datasets in Java / clojure: Littlebig data

I've been using clojure for graphics / data processing applications (you can see a screenshot here) (although usually, I think I use more Java than clojure) and started testing my applications with larger datasets I have no problem, about 100k, but when I start getting higher, I have a heap space problem

Now, in theory, about half of the GB should be enough to accommodate 70 million doubles Of course, I'm doing a lot of things that need some overhead. I can actually save 2-3 data in memory at the same time, but I haven't optimized it yet. There are about 500000 orders smaller than I should be able to load

I understand that Java has artificial limits on the size of the heap (which can be changed). I understand that these limits can be changed through the options that can be specified at JVM startup This leads to my first question:

>If I start the JVM using Swan clojure (via leiningen), can I change the maximum heap space allowed? > If I package this application (as I intend) as uberjar, can I ensure that my JVM has some minimum heap space?

But I am not satisfied with relying on the JVM heap to power my application I don't know the size of the data I can eventually use, but it can reach millions of points. Maybe the heap can't adapt to this data Therefore, I am interested in finding an alternative to just packaging data Here are some ideas and questions about them:

>Only part of a large (text) file can be read at a time, so I can import and process the data in the "block" at one time, such as N lines? What if so? > Is there some faster way to access the file (which may soon depend on the implementation) rather than simply reading a little from it? I think I'm here to ask any skills / hackers who have worked for you in the past if you have done something similar. > I can "sample" from the file, for example, read only Z rows per row, effectively sampling my data?

Now I intend that if there is the above answer (I will continue to search), or the insight provided leads to an equivalent solution, read a large amount of data at one time, draw it on the timeline (see the screenshot the timeline is green), and allow the user to interact with the bit until she clicks on the next block (or something), Then I will save the changes to the file and load the next "block" data and display it

Alternatively, I will display the entire timeline of all the data (down sampling, so I can load it), but only allow access to one "block" in the main window (part of the timeline viewed above green, as shown in the viewport rectangle in the timeline)

The most important thing is, is there a better way? Please note that I cannot zoom out the data of the main window because I need to process it and let the user interact with it (for example, click a point or nearby point to add a "marker" to the point: the marker is drawn as a vertical rule at this point)

I will appreciate any comments, answers, suggestions or corrections! I am also willing to elaborate on any problem you want

Hope that at least part of it is open source; I want an easy - to - use and fast way to make XY graphs of large amounts of data in the clojure world

Editing down sampling is possible only when drawing, not always, depending on the part being drawn I need to access all the data for analysis (just cleaned up!) Although I must look at the following sampling, I don't think it will solve my memory problem, because what I'm doing is drawing a bufferedimage

Solution

You can provide - XMS (minimum heap) and - Xmx (max heap) options at startup to change the Java heap size, see docs

So things like Java - xms256m - xmx1024m... Will increase the initial heap of 256MB to 1GB

I don't use leiningen / Swan, but I hope I can change it If nothing else, there should be a Java startup script where you can change parameters

Memory is not controlled within the jar file, but from the startup script, usually calling Java and providing parameters SH or Bat file

java. io. Random access file gives random file access through byte index, on which you can build content samples

Line SEQ returns the delay order of each line in a file, so you can process it once as needed

Alternatively, use Java Java mechanism in io – BufferedReader Readline() or FileInputStream read(byte [] buffer)

There is a BufferedReader in Java / clojure, or you can keep your own byte buffer and read larger blocks at a time

In order to make full use of memory, keep the original data as much as possible

For some actual numbers, let's assume that you want to draw the contents of a music CD:

>The CD has two channels, each with 44100 samples per second

>60 minutes of music is about 300 million data points

>Each data point represents 16 bits (2 bytes, short): 600MB > represents the original int array (4 bytes per data point): 1.2GB > represents the integer array (32 bytes per data point): 10GB

Use the digital object size in this blog (16 bytes overhead of each object, 4 bytes of the original int, objects aligned with the 8-byte boundary, 8-byte pointer in the array = 32 bytes per integer data point)

Even 600MB of data is stored in memory at one time on a "normal" computer, because you may use a lot of memory elsewhere However, the conversion from raw to digital itself will reduce the number of data points you can save in memory by an order of magnitude

If you want to draw data from a 60 minute CD on a 1900 pixel wide overview timeline, you will have one pixel displaying two seconds of music (approximately 180000 data points) This obviously does not show any level of detail, and you will need some form of subsampling or summary data

So the solution you described – one block of the complete dataset at a time for summary display in the overview timeline and keeping a small subset of the main details window in memory – sounds perfectly reasonable

to update:

Fast file reading: This article reads a 100MB file in 13 different ways in Java – results from 0.5 seconds to 10 minutes (!) Generally speaking, the read speed is fast. When reading one byte at a time, the buffer size (4K to 8K bytes) and (very) slow

The article also has a comparison to C in case anyone is interested (spooler: the fastest Java reads twice as many memory mapped files in C)

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>