Java – extract date from web page
I want to extract dates in different formats from the web page I am using selenium 2 Java API to interact with my browser I also use jQuery to further interact with documents Therefore, a two - tier solution is welcome
Dates can have very different formats in different locales In addition, the month name can be written as text or number I need to match as many dates as possible, and I know the fact that there are many combinations
For example, if I have such an HTML element:
<div class="tag_view"> Last update: May,22,2011 View :40 </div>
I want to extract and identify the relevant parts of the date:
May,2011
You should now convert it to a regular Java Date object
to update
This should apply to HTML from any web page, and dates can be included in any element in any format For example, on stackoverflow, the source code is as follows:
<span class="relativetime" title="2011-05-13 14:45:06Z">May 13 at 14:45</span>
I want it to be done in the most efficient way. I think it will be a jQuery selector or filter that returns a standardized date representation But I'm willing to accept your suggestion
Solution
Since we can't limit ourselves to any specific element type or child elements of any element, you're basically talking about searching the text of the entire page for dates The only way to do this with any efficiency is to use regular expressions Since you are looking for dates in any format, you need to use regular expressions for each acceptable format Once you have defined what those are, just compile the regular expression and run it as follows:
var datePatterns = new Array(); datePatterns.push(/\d\d\/\d\d\/\d\d\d\d/g); datePatterns.push(/\d\d\d\d\/\d\d\/\d\d/g); ... var stringToSearch = $('body').html(); // change this to be more specific if at all possible var allMatches = new Array(); for (datePatternIndex in datePatterns){ allMatches.push(stringToSearch.match(datePatterns[datePatternIndex])); }
You can find more date regular expressions through Google search, or create them yourself. They are very simple One thing to note: you can combine some of the above regular expressions to create more efficient programs I will be very careful, which may make your code difficult to read quickly It seems clearer to execute a regular expression for each date format