Java regex – how to replace patterns or how to replace patterns
I have a pile of HTML files In these files, I need to correct the SRC attribute of the IMG tag
<img alt="" src="./Suitbert_files/233px-Suitbertus.jpg" class="thumbimage" height="243" width="233" />`
Where attributes are not in any particular order I need to delete the dot and forward slash at the beginning of the SRC attribute of the IMG tag to make them look like this:
<img alt="" src="Suitbert%20%E2%80%93%20Wikipedia_files/233px-Suitbertus.jpg" class="thumbimage" height="243" width="233" />
So far, I have the following courses:
import java.util.regex.*; public class Replacer { // this PATTERN should find all img tags with 0 or more attributes before the src-attribute private static final String PATTERN = "<img\\.*\\ssrc=\"\\./"; private static final String REPLACEMENT = "<img\\.*\\ssrc=\""; private static final Pattern COMPILED_PATTERN = Pattern.compile(PATTERN,Pattern.CASE_INSENSITIVE); public static void findMatches(String html){ Matcher matcher = COMPILED_PATTERN.matcher(html); // Check all occurance System.out.println("------------------------"); System.out.println("Following Matches found:"); while (matcher.find()) { System.out.print("Start index: " + matcher.start()); System.out.print(" End index: " + matcher.end() + " "); System.out.println(matcher.group()); } System.out.println("------------------------"); } public static String replaceMatches(String html){ //Pattern replace = Pattern.compile("\\s+"); Matcher matcher = COMPILED_PATTERN.matcher(html); html = matcher.replaceAll(REPLACEMENT); return html; } }
Therefore, my method findmatches (string HTML) seems to find the SRC attribute correctly/ All img tags at the beginning
Now my method replacematches (string HTML) does not replace matches correctly I'm new to regex, but I think the replacement regular expression is incorrect, or I use the replaceall method or both You can see that the replacement string contains two parts, which are the same in all img Tags: < img and Src = ". /. Between these two parts, the original string should have 0 or more HTML attributes. How to formulate such a replacement string? Can anyone give me advice?
Solution
Try these:
PATTERN = "(<img[^>]*\\ssrc=\")\\./" REPLACEMENT = "$1"
Basically, you capture and divide the group #1/ And then re insert it with a $1 placeholder to effectively split. /
Please note how I will* Change to [^ >] * If you happen to have two img tags on the same line, as follows:
<img src="good" /><img src="./bad" />
... your regular expression will match this:
<img src="good" /><img src="./
Even if you use non greed, you will do so. *? [^>;] * Make sure that the match is always contained within a tag