Java regex – how to replace patterns or how to replace patterns

I have a pile of HTML files In these files, I need to correct the SRC attribute of the IMG tag

<img alt="" src="./Suitbert_files/233px-Suitbertus.jpg" class="thumbimage" height="243" width="233" />`

Where attributes are not in any particular order I need to delete the dot and forward slash at the beginning of the SRC attribute of the IMG tag to make them look like this:

<img alt="" src="Suitbert%20%E2%80%93%20Wikipedia_files/233px-Suitbertus.jpg" class="thumbimage" height="243" width="233" />

So far, I have the following courses:

import java.util.regex.*;


public class Replacer {

    // this PATTERN should find all img tags with 0 or more attributes before the src-attribute
    private static final String PATTERN = "<img\\.*\\ssrc=\"\\./";
    private static final String REPLACEMENT = "<img\\.*\\ssrc=\"";
    private static final Pattern COMPILED_PATTERN = Pattern.compile(PATTERN,Pattern.CASE_INSENSITIVE);


    public static void findMatches(String html){
        Matcher matcher = COMPILED_PATTERN.matcher(html);
        // Check all occurance
        System.out.println("------------------------");
        System.out.println("Following Matches found:");
        while (matcher.find()) {
            System.out.print("Start index: " + matcher.start());
            System.out.print(" End index: " + matcher.end() + " ");
            System.out.println(matcher.group());
        }
        System.out.println("------------------------");
    }

    public static String replaceMatches(String html){
        //Pattern replace = Pattern.compile("\\s+");
        Matcher matcher = COMPILED_PATTERN.matcher(html);
        html = matcher.replaceAll(REPLACEMENT);
        return html;
    }
}

Therefore, my method findmatches (string HTML) seems to find the SRC attribute correctly/ All img tags at the beginning

Now my method replacematches (string HTML) does not replace matches correctly I'm new to regex, but I think the replacement regular expression is incorrect, or I use the replaceall method or both You can see that the replacement string contains two parts, which are the same in all img Tags: < img and Src = ". /. Between these two parts, the original string should have 0 or more HTML attributes. How to formulate such a replacement string? Can anyone give me advice?

Solution

Try these:

PATTERN = "(<img[^>]*\\ssrc=\")\\./"
REPLACEMENT = "$1"

Basically, you capture and divide the group #1/ And then re insert it with a $1 placeholder to effectively split. /

Please note how I will* Change to [^ >] * If you happen to have two img tags on the same line, as follows:

<img src="good" /><img src="./bad" />

... your regular expression will match this:

<img src="good" /><img src="./

Even if you use non greed, you will do so. *? [^>;] * Make sure that the match is always contained within a tag

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>