Parse HTML

 Keywords: Parsing extract grab get text HTM strip HTML MSIE IE Internet Explorer COM OLE Regular Expressions httpStripHTML

Question:

I am trying to parse some HTML files for some specific values. What is the preferred method in winbatch for parsing an HTML File? Should I do it using regular string functions? Any tricks or EXTENDERS? A regular expression extender perhaps?

Answer:

There are quite a few different ways to parse HTML, using WinBatch

Via OLE COM:

Get text from HTML pages

URL

Browser = ObjectCreate("InternetExplorer.Application")
browser.visible = @true
browser.navigate ("http://www.winbatch.com") ;to access an internet file
while browser.readystate <> 4
       timedelay(0.5)
endwhile
BrowserDoc = Browser.Document
BrowserBody = BrowserDoc.Body
BrowserPage = BrowserBody.CreateTextRange
BrowserText = BrowserPage.Text
message("Text", BrowserText)
exit

HTML String

;***************************************************************************
;**
;**     Internet Explorer Strip HTML
;**
;***************************************************************************

#DefineFunction udfStripHTML(strHTML)
   ;This code doesn't include TITLE in results
   objIE = ObjectCreate("InternetExplorer.Application")
   objIE.visible = @FALSE
   objIE.navigate("about:blank")
   objDiv = objIE.Document.createElement("div");
   objDiv.innerHTML = strHTML
   strTxt = objDiv.InnerText
   ;Pause('InnerText',strTxt)
   objIE.quit
   objDiv = 0
   objIE = 0
   Return strTxt
#EndFunction

strHTML=  `<HTML><HEAD><TITLE>Basic HTML Sample Page</TITLE></HEAD><BODY><CENTER><H1>A Simple Sample Web Page</H1></CENTER></BODY></HTML> `
strText = udfStripHTML(strHTML)
Pause('udfStripHTML',strText)

Using WinSock Extender:

See the functions httpRecvText and httpStripHTML

AddExtender("wwwsk34i.dll")
URL="http://www.winbatch.com/"
serv=httpGetServer(URl, "")
path=httpGetpath(URl, "")
a=httpRecvText(serv, path, 30000, 0)
a=httpStripHTML(a)   
message("File Contents are", a)

Using Regular Expressions:

Here are some Regular Expression UDFs at Regular Expression UDFs

;***************************************************************************
;**
;**     RegEx Strip HTML
;**
;***************************************************************************

#DefineFunction udfRegExStripHTML(strHTML)
   objRegExp = ObjectCreate("VBScript.RegExp") ; Creates a regular expression object for use by WinBatch.
   objRegExp.IgnoreCase = @TRUE     ; Set case insensitivity. Default is @false.
   objRegExp.Global     = @TRUE   ; Set global applicability. Default is @false.
   objRegExp.Pattern = `<(.|\n)*?>` ;   ; Thanks to Roy Osherove, Oisin and Hugh Brown : http://osherove.com/blog/2003/5/13/strip-html-tags-from-a-string-using-regular-expressions.html
   ;objRegExp.Pattern = `<[^>]*>`
   strTxt = objRegExp.Replace(strHTML,"")
   objRegExp = 0
   Return strTxt
#EndFunction

strHTML=  `<HTML><HEAD><TITLE>Basic HTML Sample Page</TITLE></HEAD><BODY><CENTER><H1>A Simple Sample Web Page</H1></CENTER></BODY></HTML> `
strText = udfRegExStripHTML(strHTML)
Pause('udfRegExStripHTML',strText)
Exit

Using BinaryTag Functions:

http://techsupt.winbatch.com/webcgi/webbatch.exe?techsupt/nftechsupt.web+Tutorials+Template~File~Processing.txt

It rambles on about "template" files, but the section on BinaryTags is relavent...

Article ID:   W15522

File Created: 2014:07:18:09:51:38

Last Updated: 2014:07:18:09:51:38

	Backups
	Ctrl Alt Del Issues
	Determine Platform - OS
	Get EXE Name
	Get Machine Name - Workstation
	Icons
	Reboot Issues
	Recycle Bin
	Resource Lock
	System Information
	User Profiles

Database Search

TechHome