Java regex – how to replace patterns or how to replace patterns
I have a pile of HTML files In these files, I need to correct the SRC attribute of the IMG tag
<img alt="" src="./Suitbert_files/233px-Suitbertus.jpg" class="thumbimage" height="243" width="233" />`
Where attributes are not in any particular order I need to delete the dot and forward slash at the beginning of the SRC attribute of the IMG tag to make them look like this:
<img alt="" src="Suitbert%20%E2%80%93%20Wikipedia_files/233px-Suitbertus.jpg" class="thumbimage" height="243" width="233" />
So far, I have the following courses:
import java.util.regex.*;
public class Replacer {
    // this PATTERN should find all img tags with 0 or more attributes before the src-attribute
    private static final String PATTERN = "<img\\.*\\ssrc=\"\\./";
    private static final String REPLACEMENT = "<img\\.*\\ssrc=\"";
    private static final Pattern COMPILED_PATTERN = Pattern.compile(PATTERN,Pattern.CASE_INSENSITIVE);
    public static void findMatches(String html){
        Matcher matcher = COMPILED_PATTERN.matcher(html);
        // Check all occurance
        System.out.println("------------------------");
        System.out.println("Following Matches found:");
        while (matcher.find()) {
            System.out.print("Start index: " + matcher.start());
            System.out.print(" End index: " + matcher.end() + " ");
            System.out.println(matcher.group());
        }
        System.out.println("------------------------");
    }
    public static String replaceMatches(String html){
        //Pattern replace = Pattern.compile("\\s+");
        Matcher matcher = COMPILED_PATTERN.matcher(html);
        html = matcher.replaceAll(REPLACEMENT);
        return html;
    }
}
Therefore, my method findmatches (string HTML) seems to find the SRC attribute correctly/ All img tags at the beginning
Now my method replacematches (string HTML) does not replace matches correctly I'm new to regex, but I think the replacement regular expression is incorrect, or I use the replaceall method or both You can see that the replacement string contains two parts, which are the same in all img Tags: < img and Src = ". /. Between these two parts, the original string should have 0 or more HTML attributes. How to formulate such a replacement string? Can anyone give me advice?
Solution
Try these:
PATTERN = "(<img[^>]*\\ssrc=\")\\./" REPLACEMENT = "$1"
Basically, you capture and divide the group #1/ And then re insert it with a $1 placeholder to effectively split. /
Please note how I will* Change to [^ >] * If you happen to have two img tags on the same line, as follows:
<img src="good" /><img src="./bad" />
... your regular expression will match this:
<img src="good" /><img src="./
Even if you use non greed, you will do so. *? [^>;] * Make sure that the match is always contained within a tag
