Java – how do you differentiate XML at the element level rather than at the attribute level?

I need to compare between two XML documents I've been looking for many different XML diffing tools commonly mentioned on stack overflow, but of course my needs are very special, so they are not suitable In short, I need to compare the entire document, not the element content (considering the order), and I need a very specific output format instead of the traditional difference patch

Please forgive this volume of text, but I find it difficult to explain it

First, my limitations

The solution must be Java - based or can be integrated with command - line Java applications It must also be free, because I'm not allowed to spend "real money" on it, only my working hours (but of course not too much; my deadline is imminent)... Sounds familiar? Finally, my goal is not the traditional difference patch results, but the indirect combination of two source files

Second, the description of my data

Each document contains nodes of text or section type; Text is a simple string, but parts can contain text and more parts (they also have a name given as an attribute) In addition, each node is marked with revision information

This is a sample document Note that this appears to be a list for brevity; In fact, it's more like prose - that is, the order of elements is very important

<document diff="=" revision="1">
  <text diff="=" revision="1">Apples</text>
  <text diff="=" revision="1">Chxrries</text>
  <section diff="=" revision="1" name="Blue ones">
    <text diff="=" revision="1">Grapes</text>
    <section diff="=" revision="1" name="More">
      <text diff="=" revision="1">Blueberries</text>
    </section>
    <text diff="=" revision="1">Oranges</text>
  </section>
</document>

This needs to be compared with the new version, which contains changes but no revision information (not yet!) In this example, I fixed the spelling error in the second element, and I moved another element, but there may be more extensive changes, such as adding or removing the whole part

<document>
  <text>Apples</text>
  <text>Oranges</text>
  <text>Cherries</text>
  <section name="Blue ones">
    <text>Grapes</text>
    <section name="More">
      <text>Blueberries</text>
    </section>
  </section>
</document>

The goal is to create a third XML document that contains all the information Note that the diff tag of the affected element has been changed ("*" indicates the change within the element) and its revision number has been collided; Unchanged elements retain their old revision information

<document diff="*" revision="2">
  <text diff="=" revision="1">Apples</text>
  <text diff="+" revision="2">Oranges</text>
  <text diff="-" revision="2">Chxrries</text>
  <text diff="+" revision="2">Cherries</text>
  <sectio diff="*" revision="1"n name="Blue ones">
    <text diff="=" revision="1">Grapes</text>
    <section diff="=" revision="1" name="More">
      <text diff="=" revision="1">Blueberries</text>
    </section>
    <text diff="-" revision="2">Oranges</text>
  </section>
</document>

Therefore, the result is not a difference patch, but a complete document with updated version information

Third, my work – and my problems

I do most of my work, using custom Java functions for line by line comparison - except that it fails in a specific use case, that is, the old version contains specific text more than once, rather than the last one changed in the new version This will "trick" the comparator to match the old version text with the following new version text instead of recognizing a single text change Although the result is technically correct, the unnecessary addition and deletion of "noise" masks the fact that it looks simple for humans (and, by the way, this mark is for human readability) Now, thanks to my line by line method, I find it difficult to solve

This is an example of a use case that deceives my code First, a simple fruit basket:

<document diff="=" revision="1">
  <text diff="=" revision="1">Apples</text>
  <text diff="=" revision="1">Oranges</text>
  <text diff="=" revision="1">Apples</text>
  <text diff="=" revision="1">Cherries</text>
  <text diff="=" revision="1">Apples</text>
</document>

Now, let's change the second "apple" project:

<document>
  <text>Apples</text>
  <text>Oranges</text>
  <text>Bananas</text>   <--- I've only changed this
  <text>Cherries</text>
  <text>Apples</text>
  <text>Grapes</text>
</document>

The result incorrectly becomes:

<document diff="*" revision="2">
  <text diff="=" revision="1">Apples</text>
  <text diff="=" revision="1">Oranges</text>
  <text diff="+" revision="2">Bananas</text>   <--- Addition,okay
  <text diff="+" revision="2">Cherries</text>   <--- Incorrectly added
  <text diff="=" revision="1">Apples</text>   <--- Incorrectly matches the next occurrence
  <text diff="-" revision="2">Cherries</text>   <--- Incorrectly removed
  <text diff="-" revision="2">Apples</text>   <--- Incorrectly removed
  <text diff="=" revision="1">Grapes</text>   <--- Back on track,after the next occurrence of the changed element
</document>

Yes, I may alleviate this problem, but to achieve some form of prediction, but I can't distinguish foresight, so it sounds like a very chaotic solution rather than a real solution

... so finally, I urgently need an XML diff tool that allows me to analyze the data content and create this very special output Either that or any hint of how I can avoid this particular trap

If you have any suggestions or questions to explain in detail, I hope to hear from you very much

This is a restatement of a previous question Unfortunately, I can't offer any reward to promote it, but I hope my new explanation here will be better

For its value, this is my algorithm. It doesn't seem to be listed on the diffalgorithm page linked to @ larsh:

Compare two lists: the left hand and the right hand. Call them ll and LR Create two "main" pointers IL and IR and set them as the first element of each list For loops, use these primary pointers to set the primary elements el and ER so that El = LL (IL) and ER = 1R (IR) Compare el and ER If El matches Er, we can copy El into the result and push the two main pointers into a slot If el and ER do not match, create an auxiliary pointer (IR2), initialize it to the time slot after IR (IR2 = IR 1) and scan the rest of LR (set ER2 = LR (IR2) when we go) If El does not match the rest of LR, El must have been deleted. We can add el to the result, delete it and only advance the main pointer IL (so that the next comparison will compare the next el with the same ER) If El matching ER2 is found (at position IR2 > IR), Then all the elements in the range [IR, IR2] must be added. Then we can add each element as the LR range of the addition result and set IR = IR2. We can also add the element El as a match to the result (because it has been matched in ER2) Finally, the comparison position is repeated at the new main pointer Perform all these operations when iterating on the shorter of the two; Then, add the remaining ll as deletion or add the remaining part of LR as supplement

Solution

It turned out that there was no solution for my needs at that time! At the same time, I have developed my own XML diff routine, which is specific to my problem, so I finally got an effective solution

Then, at the end of 2011, this release: Slashdot: researchers expanding diff, grep UNIX tools

Dartmouth computer scientists introduced variants of grep and diff UNIX command line utilities that can handle more complex types of data These new programs called context free grep and hierarchical diff will provide the ability to parse data blocks rather than single lines The research was partly funded by Google and the U.S. Department of energy

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>