WinBatch Tech Support Home

Database Search

If you can't find the information using the categories below, post a question over in our WinBatch Tech Support Forum.

TechHome

How To
plus
plus
plus
plus
plus
plus
plus
plus
plus
plus
plus

Can't find the information you are looking for here? Then leave a message over on our WinBatch Tech Support Forum.

Parse HTML

 Keywords: Parsing extract grab get text HTM strip HTML MSIE IE Internet Explorer COM OLE Regular Expressions httpStripHTML

Question:

I am trying to parse some HTML files for some specific values. What is the preferred method in winbatch for parsing an HTML File? Should I do it using regular string functions? Any tricks or EXTENDERS? A regular expression extender perhaps?

Answer:

There are quite a few different ways to parse HTML, using WinBatch

Via OLE COM:

Get text from HTML pages

URL

Browser = ObjectCreate("InternetExplorer.Application")
browser.visible = @true
browser.navigate ("http://www.winbatch.com") ;to access an internet file
while browser.readystate <> 4
       timedelay(0.5)
endwhile
BrowserDoc = Browser.Document
BrowserBody = BrowserDoc.Body
BrowserPage = BrowserBody.CreateTextRange
BrowserText = BrowserPage.Text
message("Text", BrowserText)
exit
HTML String
;***************************************************************************
;**
;**     Internet Explorer Strip HTML
;**
;***************************************************************************

#DefineFunction udfStripHTML(strHTML)
   ;This code doesn't include TITLE in results
   objIE = ObjectCreate("InternetExplorer.Application")
   objIE.visible = @FALSE
   objIE.navigate("about:blank")
   objDiv = objIE.Document.createElement("div");
   objDiv.innerHTML = strHTML
   strTxt = objDiv.InnerText
   ;Pause('InnerText',strTxt)
   objIE.quit
   objDiv = 0
   objIE = 0
   Return strTxt
#EndFunction

strHTML=  `<HTML><HEAD><TITLE>Basic HTML Sample Page</TITLE></HEAD><BODY><CENTER><H1>A Simple Sample Web Page</H1></CENTER></BODY></HTML> `
strText = udfStripHTML(strHTML)
Pause('udfStripHTML',strText)

Using WinSock Extender:

See the functions httpRecvText and httpStripHTML
AddExtender("wwwsk34i.dll")
URL="http://www.winbatch.com/"
serv=httpGetServer(URl, "")
path=httpGetpath(URl, "")
a=httpRecvText(serv, path, 30000, 0)
a=httpStripHTML(a)   
message("File Contents are", a)

Using Regular Expressions:

Here are some Regular Expression UDFs at Regular Expression UDFs

;***************************************************************************
;**
;**     RegEx Strip HTML
;**
;***************************************************************************

#DefineFunction udfRegExStripHTML(strHTML)
   objRegExp = ObjectCreate("VBScript.RegExp") ; Creates a regular expression object for use by WinBatch.
   objRegExp.IgnoreCase = @TRUE     ; Set case insensitivity. Default is @false.
   objRegExp.Global     = @TRUE   ; Set global applicability. Default is @false.
   objRegExp.Pattern = `<(.|\n)*?>` ;   ; Thanks to Roy Osherove, Oisin and Hugh Brown : http://osherove.com/blog/2003/5/13/strip-html-tags-from-a-string-using-regular-expressions.html
   ;objRegExp.Pattern = `<[^>]*>`
   strTxt = objRegExp.Replace(strHTML,"")
   objRegExp = 0
   Return strTxt
#EndFunction

strHTML=  `<HTML><HEAD><TITLE>Basic HTML Sample Page</TITLE></HEAD><BODY><CENTER><H1>A Simple Sample Web Page</H1></CENTER></BODY></HTML> `
strText = udfRegExStripHTML(strHTML)
Pause('udfRegExStripHTML',strText)
Exit

Using BinaryTag Functions:

http://techsupt.winbatch.com/webcgi/webbatch.exe?techsupt/nftechsupt.web+Tutorials+Template~File~Processing.txt

It rambles on about "template" files, but the section on BinaryTags is relavent...


Article ID:   W15522
File Created: 2014:07:18:09:51:38
Last Updated: 2014:07:18:09:51:38