Java: I have a large string of HTML and need to extract href = “…” text
I have this string that contains a chunk of HTML. I try to extract the link from the href = "..." part of the string Href can take one of the following forms:
<a href="..." /> <a class="..." href="..." />
I don't have a real regular expression problem, but for some reason I use the following code:
String innerHTML = getHTML(); Pattern p = Pattern.compile("href=\"(.*)\"",Pattern.DOTALL); Matcher m = p.matcher(innerHTML); if (m.find()) { // Get all groups for this match for (int i=0; i<=m.groupCount(); i++) { String groupStr = m.group(i); System.out.println(groupStr); } }
Can someone tell me what's wrong with my code? I did these things in PHP, but in Java, I did something wrong... What happened was that whenever I tried to print it, it would print the whole HTML string
Edit: so everyone knows what kind of string I'm dealing with:
<a class="Wrap" href="item.PHP?id=43241"><input type="button"> <span class="chevron"></span> </a> <div class="menu"></div>
Every time I run the code, it prints the entire string... That's the problem
About using jtidy... I'm using it, but it's interesting to know what's wrong in this case
Solution
.*
.*
This is a greedy operation that will contain any character, including quotation marks
Try something similar:
"href=\"([^\"]*)\""