Scrape External Site Content With ASP
There are many times were you have wanted to grab something from another site that isn’t provided via RSS. You could type it in, but that would be time consuming. So how do we do this? We scrap the content from the site using the XMLHTTP object. If you have read the previous article on how to cache an RSS feed, you will no doubt say many similarities here in this article, so I won’t re-explain those portions. So let’s have a look at the code.
<%
FUNCTION LoadThePage(strPageText, strInputURL)
Set objXMLHTTP = Server.CreateObject("MSXML2.ServerXMLHTTP")
objXMLHTTP.Open "GET", strInputURL, False
objXMLHTTP.Send
strPageText = objXMLHTTP.responseText
Set objXMLHTTP = Nothing
End FUNCTION
FUNCTION GrabTheContent(strStart, strEnd)
Dim strStartPos, strEndPos, strLength
strStartPos = 0
strEndPos = 0
strLength = 0
'Find the Start Position of the Search String
strStartPos = instr(strPageText,strStart)
'Starting from the Search String start position and call it the end position
strEndPos = instr(strStartPos, strPageText, strEnd)
'Compute the length of the string in between the start and end positions
strLength = strEndPos - strStartPos
'filter the content, use trim to eliminate leading and trailing spaces
myContent = trim(mid(strPageText,strStartPos, len(strStart))) & "<br/>" & vbCRLF
myContent = myContent & trim(mid(strPageText,strStartPos + len(strStart), StrLength - len(strStart))) & "<br/>" & vbCRLF
GrabTheContent = myContent
End FUNCTION
'Declare the string used to hold the HTTP and the start/end strings
Dim strPageText, strStart, strEnd
'Declare and initialize the string used to hold the Input Page URL
Dim strInputURL
if DateDiff("h", Application("updated"), Now()) >=1 then
strInputURL = "http://www.austmus.gov.au/factSheets/galah.htm"
'Load the desired page into a string
LoadThePage strPageText, strInputURL
strHTML = GrabTheContent("Cacatua roseicapilla","Food and feeding")
Application.Lock
Application("content") = strHTML
Application("updated") = Now()
Application.Unlock
end if
strHTML = Application("content")
response.write (strHTML)
%>
The code contains two functions, LoadThePage and GrabTheContent, and there names self explain what they perform. LoadThePage saves a copy the external HTML into an XMLHTTP object. GrabTheContent manipulates this object and returns a string that we can use. We give this function two paramters, the text at the start of the section we want to grab and the text that ends what we want to grab. Pretty simple. For this example, we will be using http://www.austmus.gov.au/factSheets/galah.htm as the page we retrieve text from.
Now the main portion of the program firstly it checks the date stamp of the application variable from when it was last cached and if necessary, retrieves a new cache. We do this so we don’t hammer someone else’s website and slow down the performance of our page (and to not annoy the other webserver with heaps of connections). Once we have either retrieved a new cache or used the existing one, we display the contents on the page. So we would see these results:
Ok. So that is nice but how would we format this retrieved text? How do you remove the HTML tags?
To strip the HTML tags, we will be using a function I have previously published named stripHTML. What this function does is strip ALL HTML tags using regex and returns plain text. Now to format the text appearance, we will need to use the string REPLACE function to change/remove/insert tags into appropriate places. Essentially, you scan through the retrieved page and add in the formatting you like. We will finally present this text inside a DIV with an inline style applied. So we would firstly add in the stripHTML function:
<%
FUNCTION stripHTML(strHTML)
Dim objRegExp, strOutput, tempStr
Set objRegExp = New Regexp
objRegExp.IgnoreCase = True
objRegExp.Global = True
objRegExp.Pattern = "<(.|n)+?>"
'Replace all HTML tag matches with the empty string
strOutput = objRegExp.Replace(strHTML, "")
'Replace all < and > with < and >
strOutput = Replace(strOutput, "<", "<")
strOutput = Replace(strOutput, ">", ">")
stripHTML = strOutput 'Return the value of strOutput
Set objRegExp = Nothing
END FUNCTION
>%
And finally, we would change how the text is presented at the bottom of the script to this:
<%
strHTML = Application("content")
strHTML = stripHTML(strHTML)
strHTML = REPLACE(strHTML,"Cacatua roseicapilla","<B>Cacatua Roseicapilla</B><br/><br/>")
strHTML = REPLACE(strHTML,"Description","<b style='font:bold 12px/15px Arial,Helvetica;color:#cc0000;'>Description</b><br/>")
strHTML = REPLACE(strHTML,"Distribution and Habitat ","<br/><br/><b style='font:bold 12px/15px Arial, Helvetica;color:#cc0000;'>Distribution and Habitat </b><br/>")
response.write "<DIV style='width:250px;font:normal 12px/15px Arial,Helvetica;color:#333333;text-align:justify;'>" & strHTML & "</div>"
%>
So let’s take a look at how it appears now:
There you go, much better. Exactly the same results as the first example but much more readable and adaptable to your site after using your formatting. Hopefully you find this useful one day.