Java: I have a large string of HTML and need to extract href = “…” text
I have this string that contains a chunk of HTML. I try to extract the link from the href = "..." part of the string Href can take one of the following forms:
<a href="..." /> <a class="..." href="..." />
I don't have a real regular expression problem, but for some reason I use the following code:
String innerHTML = getHTML();
Pattern p = Pattern.compile("href=\"(.*)\"",Pattern.DOTALL);
Matcher m = p.matcher(innerHTML);
if (m.find()) {
// Get all groups for this match
for (int i=0; i<=m.groupCount(); i++) {
String groupStr = m.group(i);
System.out.println(groupStr);
}
}
Can someone tell me what's wrong with my code? I did these things in PHP, but in Java, I did something wrong... What happened was that whenever I tried to print it, it would print the whole HTML string
Edit: so everyone knows what kind of string I'm dealing with:
<a class="Wrap" href="item.PHP?id=43241"><input type="button">
<span class="chevron"></span>
</a>
<div class="menu"></div>
Every time I run the code, it prints the entire string... That's the problem
About using jtidy... I'm using it, but it's interesting to know what's wrong in this case
Solution
.*
.*
This is a greedy operation that will contain any character, including quotation marks
Try something similar:
"href=\"([^\"]*)\""
