HTMLPage

mn8 Language Reference | Index

This is the concept which will be returned by all functions (FROM) which have as main purpose the document retrieval and the documents happens to be HTML pages. The content of the HTML page will be available as a Stream stored in the content element of the Concept. The concept does not alter in this way the source of the HTML Page. It also allows the extraction of one ore more pieces of the page using regular expressions through the select method provided by the String concept. Once the desired HTML pieces are extracted it can be transformed in a Series of Tags through the Tag/getTags method. It also provides a method to return the content of the page directly as a Series of tags through the getTags method.

Usage

   $page = HTMLPage.create("http://www.nolimits.ro")
        PRINT $page@url
            http://www.nolimits.ro

        PRINT $page/getLinks
            http://www.nolimits.ro/corpinfo_en.shtml
            http://www.nolimits.ro/documents/index.shtml
            http://www.nolimits.ro/product_en.shtml
            http://www.nolimits.ro/sl_en.shtml
            http://www.nolimits.ro/news_en.shtml
            http://www.nolimits.ro/contact_en.shtml
            http://www.nolimits.ro/search_en.shtml
            http://www.nolimits.ro/forum/
            http://www.nolimits.ro/help_en.shtml
            http://www.nolimits.ro/ro/
            http://www.nolimits.ro/

This example will show how you can loggin in into a webpage using forms and cookies. (I suppose you have already completed the needed form, if not, see the examples in HTMLForm)

        $url = "http://192.168.1.22/oursite/"

        # Store the HTMLPage from URL into the $page
        # We need this step only to obtain and store the cookies.
        # You will not loggin in with this step !!!
        $page from $url + "index.php"

        # Saving cookies
        $cookies = $page.getCookies
        each $i in $cookies do [
          $i.storeCookie ]

        # creating a simplex expression
        $expr = Simplex.create( $url + "*" )

        # collecting the stored forms using the simplex
        # We need to use only the first form
        $form = HTMLForm.getStoredForms($expr)/1

        # applying the form (with POST method) it will loggin in the script
        HTMLPage.create( $form )

        # finally We can get the informations from URL, we are already logged in
        $page from $url + "index.php"

Version:	0.1
Authors:	Remus Pereni (http://neuro.nolimits.ro)
Location:
Inherits:	Concept

Attribute List

	@contentType TYPEOF String LABEL "contentType"
	@lastModified TYPEOF Integer LABEL "lastModified"
	@length TYPEOF Integer LABEL "length"
	@title TYPEOF integer LABEL "title"
	@url TYPEOF String LABEL "url"

Element List

content TYPEOF String LABEL "content"

Constructor List

create (String $url)

create (HTMLForm $form)

Method List

Concept	getContent
Series	getCookies
Series	getForms
Map	getHeaders
Series	getLinks
Series	getTags
Series	getTagsWithText
Series	getURIs
Integer	getResponseCode
String	getResponseMessage
static	setFollowRedirects (Logical `$bol`)
static Logical	isFollowRedirects

Methods inherited from: Concept

cloneConcept, extendsConcept, fromXML, getAllInheritedConcepts, getConceptAttribute, getConceptAttributeField, getConceptAttributeFields, getConceptAttributes, getConceptConstructors, getConceptElement, getConceptElementField, getConceptElementFields, getConceptElements, getConceptLabel, getConceptMethod, getConceptMethods, getConceptOperators, getConceptType, getErrorHandler, getInheritedConcepts, getResourceURI, hasConceptAttribute, hasConceptElement, hasConceptMethod, hasPath, isHidden, loadContent, setConceptLabel, setErrorHandler, setHidden, setShowEmpty, showEmpty, toTXT, toXML, setResourceURI

Detailed Attribute Info

@contentType

Label:	contentType
Type:	String
Is Static:	false
Is Hidden:	false
Show Empty:	true

Contains the type of the content of this page which is the value of the content-type header field.

@lastModified

Label:	lastModified
Type:	Integer
Is Static:	false
Is Hidden:	false
Show Empty:	true

This is an Integer representing the time the file was last modified, measured in milliseconds since the epoch (00:00:00 GMT, January 1, 1970)

@length

Label:	length
Type:	Integer
Is Static:	false
Is Hidden:	false
Show Empty:	true

Contains the length of this page.

@title

Label:	title
Type:	integer
Is Static:	false
Is Hidden:	false
Show Empty:	true

Contains the Title of this document, extracted between the<title>... </title> tags of the HTML page.

@url

Label:	url
Type:	String
Is Static:	false
Is Hidden:	false
Show Empty:	true

Contains the URL of the document reproduced in this HTMLPage concept.

Detailed Element Info

content

Label:	content
Type:	String
Is Static:	false
Is Hidden:	false
Is Multi:	false
Show Empty:	true

Contains a stream with the actual page content. This is the primary way of keeping the content of the page. The reason for this is that on the stream can be applied diverse regular expressions to extract the relevant pieces of the page, and then to apply the selected pieces to a Tag constructor for further processing.

Detailed Constructor Info

create (String $url)

Parameters:

$url : The URL from which this page will be constructed.

Exceptions:

`badURLException` : (Error)	If the URL is not valid.
`IOException` : (Error)	If an I/O exception occurs.
`httpOperationException` : (Error)	If can't get HTML page header or page content.

Constructor which will produce a HTMLPage concept from the URL given as parameter. The concept produced in this way will have no headers.

create (HTMLForm $form)

Parameters:

$form : The HTMLForm from which this page will be constructed.

Exceptions:

`badURLException` : (Error)	If the URL isn't valid.
`httpOperationException` : (Error)	If can't get HTML page handler or page content.
`IOException` : (Error)	If an I/O exception occurs.

Constructor which will produce a HTMLPage concept with the HTMLForm given as parameter. The concept produced in this way will have no headers.

Detailed Method Info

getContent

Returns: Concept

Exceptions:

httpOperationException :
(Error) If can't get HTML page content.

Returns a String or a ByteArray which represents this page content.

            $page = HTMLPage.create("http://www.google.com")
            print $page.getContent
            -- the result is --
            <html><head><META HTTP-EQUIV="content-type" CONTENT="text/html; charset=ISO-8859-1">
            <title>Google</title><style>
            <!-- body,td,a,p,.h{font-family:arial,sans-serif;} .h{font-size: 20px;} .h{color:} 
            .q{text-decoration:none; color:#0000cc;}
            //--></style>
            ...
            </html>

getCookies

Returns: Series

Returns a Series containing Cookie concepts of all the cookie directives found in the HTTP header of this page.

            $page = HTMLPage.create("http://www.google.com")
            print $page.getCookie
            -- the result is --
            Name   : PREF
            Value  : ID=01ea0cb64090ca3a:TM=1028290720:LM=1028290720:S=MnaQWUJlyfw
            Domain : .google.com
            Path   : /
            Expires: 2147368447000
            Secure : FALSE

getForms

Returns: Series

Returns a Series containing HTMLForm concepts of all the HTML forms found in this particular page.

            $page = HTMLPage.create("http://www.google.com")
            print $page.getForms
            -- the result is --
            HTMLForm
            method: GET     url: http://www.google.com/search       name: f id: -1309137107
            hl en
            ie ISO-8859-1
            q 
            btnG Google Search
            btnI I'm Feeling Lucky

getHeaders

Returns: Map

Returns a Map containing all the key, value HTTP header pairs, as returned by the server which served this page.

            $page = HTMLPage.create("http://www.google.com")
            print $page.getHeaders
            -- the result is --
            Map
            Content-Length 3152
            Server GWS/2.0
            Date Fri, 02 Aug 2002 12:26:14 GMT
            Content-Type text/html
            Cache-control private
            Set-Cookie PREF=ID=2b87fa35752aa236:TM=1028291174:LM=1028291174:S=mGb7J7tJ95Q; 
            domain=.google.com; path=/; expires=Sun, 17-Jan-2038 19:14:07 GMT

getLinks

Returns: Series

Returns a Series of Elements containing all the Links found in the page. The relative links will be returned to, but in their absolute form (the URL of the page appended before).

            $page = HTMLPage.create("http://www.google.com")
            print $page.getLinks
            -- the result is --
            http://www.google.com/imghp?hl=en&ie=UTF-8
            http://www.google.com/grphp?hl=en&ie=UTF-8
            http://www.google.com/dirhp?hl=en&ie=UTF-8
            http://www.google.com/advanced_search?hl=en
            http://www.google.com/preferences?hl=en
            http://www.google.com/language_tools?hl=en
            http://www.google.com/ads/
            http://www.google.com/services/
            http://www.google.com/news/
            http://toolbar.google.com
            http://www.google.com/about.html
            http://www.google.com/mgyhp.html\

getTags

Returns: Series

The method will return a Series containing Tags and Strings resulted from processing the content of this HTML page.

The rules are: * all tags denoted by the < symbol and closed by the > symbol will be transformed in a Tag type of concept. * all strings outside tags will be transformed in the String type of concept. * the CR LF, CR, LF symbols will determine the parser to create a new String type concept.

            $page = HTMLPage.create("http://www.google.com")
            print $page.getTags
            -- the result is --
            <html>
            <head>
            <META HTTP-EQUIV="content-type" CONTENT="text/html; charset=ISO-8859-1">
            <title>
            </title>
            <style>
            ...
            </html>

getTagsWithText

Returns: Series

Returns a Series containing the tags with text resulted from processing the content of this HTMLPage.

            $page = HTMLPage.create("http://www.google.com")
            print $page.getTagsWithText
            -- the result is --
            <html>
            <head>
            <META HTTP-EQUIV="content-type" CONTENT="text/html; charset=ISO-8859-1">
            <title>
            Google
            </title>
            <style>
            ...
            </html>

getURIs

Returns: Series

Returns a Series with all URIs contained by this HTMLPage.

            $page = HTMLPage.create("http://www.google.com")
            print $page.getURIs
            -- the result is --
            http://www.google.com/imghp?hl=en&ie=UTF-8
            http://www.google.com/grphp?hl=en&ie=UTF-8
            http://www.google.com/dirhp?hl=en&ie=UTF-8
            http://www.google.com/advanced_search?hl=en
            http://www.google.com/preferences?hl=en
            http://www.google.com/language_tools?hl=en
            http://www.google.com/ads/
            http://www.google.com/services/
            http://www.google.com/news/
            http://toolbar.google.com
            http://www.google.com/about.html
            http://www.google.com/mgyhp.html\

getResponseCode

Returns: Integer

Returns a response code of this HTMLPage header.

            $page = HTMLPage.create("http://www.google.com")
            print $page.getResponseCode
            -- the result is --
            200

getResponseMessage

Returns: String

Returns a response message of this HTMLPage header.

            $page = HTMLPage.create("http://www.google.com")
            print $page.getResponseMessage
            -- the result is --
            OK

static setFollowRedirects (Logical $bol)

Parameters:

$bol : Boolean expression.

Returns:

Sets whether HTTP redirects (requests with response code 3xx) should be automatically followed by your connection.

            $page = HTMLPage.create("http://www.google.com")
            print $page.isFollowRedirects
            $page.setFollowRedirects(FALSE)
            print $page.isFollowRedirects
            -- the result is --
            true
            false

static isFollowRedirects

Returns: Logical

Returns a boolean indicating whether or not HTTP redirects (3xx) should be automatically followed.

            $page = HTMLPage.create("http://www.google.com")
            print $page.isFollowRedirects
            $page.setFollowRedirects(FALSE)
            print $page.isFollowRedirects
            -- the result is --
            true
            false