mn8 Language Reference | Index    

HTMLPage

SUMMARY: ATTRIBUTES SUMMARY  ELEMENTS SUMMARY  CONSTRUCTORS SUMMARY  NO OPERATORS  METHODS SUMMARYDETAIL: ATTRIBUTE DETAILS  ELEMENT DETAILS  CONSTRUCTOR DETAILS  NO OPERATORS  METHOD DETAILS

Description

This is the concept which will be returned by all functions (FROM) which have as main purpose the document retrieval and the documents happens to be HTML pages. The content of the HTML page will be available as a Stream stored in the content element of the Concept. The concept does not alter in this way the source of the HTML Page. It also allows the extraction of one ore more pieces of the page using regular expressions through the select method provided by the String concept. Once the desired HTML pieces are extracted it can be transformed in a Series of Tags through the Tag/getTags method. It also provides a method to return the content of the page directly as a Series of tags through the getTags method.

Usage

   $page = HTMLPage.create("http://www.nolimits.ro")
        PRINT $page@url
            http://www.nolimits.ro

        PRINT $page/getLinks
            http://www.nolimits.ro/corpinfo_en.shtml
            http://www.nolimits.ro/documents/index.shtml
            http://www.nolimits.ro/product_en.shtml
            http://www.nolimits.ro/sl_en.shtml
            http://www.nolimits.ro/news_en.shtml
            http://www.nolimits.ro/contact_en.shtml
            http://www.nolimits.ro/search_en.shtml
            http://www.nolimits.ro/forum/
            http://www.nolimits.ro/help_en.shtml
            http://www.nolimits.ro/ro/
            http://www.nolimits.ro/

This example will show how you can loggin in into a webpage using forms and cookies. (I suppose you have already completed the needed form, if not, see the examples in HTMLForm)

        $url = "http://192.168.1.22/oursite/"

        # Store the HTMLPage from URL into the $page
        # We need this step only to obtain and store the cookies.
        # You will not loggin in with this step !!!
        $page from $url + "index.php"

        # Saving cookies
        $cookies = $page.getCookies
        each $i in $cookies do [
          $i.storeCookie ]

        # creating a simplex expression
        $expr = Simplex.create( $url + "*" )

        # collecting the stored forms using the simplex
        # We need to use only the first form
        $form = HTMLForm.getStoredForms($expr)/1

        # applying the form (with POST method) it will loggin in the script
        HTMLPage.create( $form )

        # finally We can get the informations from URL, we are already logged in
        $page from $url + "index.php"

Version: 0.1
Authors:Remus Pereni (http://neuro.nolimits.ro)
Location:
Inherits: Concept

Attribute List

 @contentType TYPEOF String LABEL "contentType"
 @lastModified TYPEOF Integer LABEL "lastModified"
 @length TYPEOF Integer LABEL "length"
 @title TYPEOF integer LABEL "title"
 @url TYPEOF String LABEL "url"
top

Element List

 content TYPEOF String LABEL "content"
top

Constructor List

create (String $url)
create (HTMLForm $form)
top

Method List

ConceptgetContent
SeriesgetCookies
SeriesgetForms
MapgetHeaders
SeriesgetLinks
SeriesgetTags
SeriesgetTagsWithText
SeriesgetURIs
IntegergetResponseCode
StringgetResponseMessage
static setFollowRedirects (Logical $bol)
static LogicalisFollowRedirects
top
Methods inherited from: Concept
cloneConcept, extendsConcept, fromXML, getAllInheritedConcepts, getConceptAttribute, getConceptAttributeField, getConceptAttributeFields, getConceptAttributes, getConceptConstructors, getConceptElement, getConceptElementField, getConceptElementFields, getConceptElements, getConceptLabel, getConceptMethod, getConceptMethods, getConceptOperators, getConceptType, getErrorHandler, getInheritedConcepts, getResourceURI, hasConceptAttribute, hasConceptElement, hasConceptMethod, hasPath, isHidden, loadContent, setConceptLabel, setErrorHandler, setHidden, setShowEmpty, showEmpty, toTXT, toXML, setResourceURI

Detailed Attribute Info

@contentType

Label:contentType
Type:String
Is Static:false
Is Hidden:false
Show Empty:true

Contains the type of the content of this page which is the value of the content-type header field.

top

@lastModified

Label:lastModified
Type:Integer
Is Static:false
Is Hidden:false
Show Empty:true

This is an Integer representing the time the file was last modified, measured in milliseconds since the epoch (00:00:00 GMT, January 1, 1970)

top

@length

Label:length
Type:Integer
Is Static:false
Is Hidden:false
Show Empty:true

Contains the length of this page.

top

@title

Label:title
Type:integer
Is Static:false
Is Hidden:false
Show Empty:true

Contains the Title of this document, extracted between the<title>... </title> tags of the HTML page.

top

@url

Label:url
Type:String
Is Static:false
Is Hidden:false
Show Empty:true

Contains the URL of the document reproduced in this HTMLPage concept.

top

Detailed Element Info

content

Label:content
Type:String
Is Static:false
Is Hidden:false
Is Multi:false
Show Empty:true

Contains a stream with the actual page content. This is the primary way of keeping the content of the page. The reason for this is that on the stream can be applied diverse regular expressions to extract the relevant pieces of the page, and then to apply the selected pieces to a Tag constructor for further processing.

top

Detailed Constructor Info

create (String $url)
Parameters:
$url :The URL from which this page will be constructed.
Exceptions:
badURLException :
(Error)
If the URL is not valid.
IOException :
(Error)
If an I/O exception occurs.
httpOperationException :
(Error)
If can't get HTML page header or page content.

Constructor which will produce a HTMLPage concept from the URL given as parameter. The concept produced in this way will have no headers.

top
create (HTMLForm $form)
Parameters:
$form :The HTMLForm from which this page will be constructed.
Exceptions:
badURLException :
(Error)
If the URL isn't valid.
httpOperationException :
(Error)
If can't get HTML page handler or page content.
IOException :
(Error)
If an I/O exception occurs.

Constructor which will produce a HTMLPage concept with the HTMLForm given as parameter. The concept produced in this way will have no headers.

top

Detailed Method Info

getContent
Returns: Concept
Exceptions:
httpOperationException :
(Error)
If can't get HTML page content.

Returns a String or a ByteArray which represents this page content.

            $page = HTMLPage.create("http://www.google.com")
            print $page.getContent
            -- the result is --
            <html><head><META HTTP-EQUIV="content-type" CONTENT="text/html; charset=ISO-8859-1">
            <title>Google</title><style>
            <!-- body,td,a,p,.h{font-family:arial,sans-serif;} .h{font-size: 20px;} .h{color:} 
            .q{text-decoration:none; color:#0000cc;}
            //--></style>
            ...
            </html>
            

top
getCookies
Returns: Series

Returns a Series containing Cookie concepts of all the cookie directives found in the HTTP header of this page.

            $page = HTMLPage.create("http://www.google.com")
            print $page.getCookie
            -- the result is --
            Name   : PREF
            Value  : ID=01ea0cb64090ca3a:TM=1028290720:LM=1028290720:S=MnaQWUJlyfw
            Domain : .google.com
            Path   : /
            Expires: 2147368447000
            Secure : FALSE
            

top
getForms
Returns: Series

Returns a Series containing HTMLForm concepts of all the HTML forms found in this particular page.

            $page = HTMLPage.create("http://www.google.com")
            print $page.getForms
            -- the result is --
            HTMLForm
            method: GET     url: http://www.google.com/search       name: f id: -1309137107
            hl en
            ie ISO-8859-1
            q 
            btnG Google Search
            btnI I'm Feeling Lucky
            

top
getHeaders
Returns: Map

Returns a Map containing all the key, value HTTP header pairs, as returned by the server which served this page.

            $page = HTMLPage.create("http://www.google.com")
            print $page.getHeaders
            -- the result is --
            Map
            Content-Length 3152
            Server GWS/2.0
            Date Fri, 02 Aug 2002 12:26:14 GMT
            Content-Type text/html
            Cache-control private
            Set-Cookie PREF=ID=2b87fa35752aa236:TM=1028291174:LM=1028291174:S=mGb7J7tJ95Q; 
            domain=.google.com; path=/; expires=Sun, 17-Jan-2038 19:14:07 GMT
            

top
getLinks
Returns: Series

Returns a Series of Elements containing all the Links found in the page. The relative links will be returned to, but in their absolute form (the URL of the page appended before).

            $page = HTMLPage.create("http://www.google.com")
            print $page.getLinks
            -- the result is --
            http://www.google.com/imghp?hl=en&ie=UTF-8
            http://www.google.com/grphp?hl=en&ie=UTF-8
            http://www.google.com/dirhp?hl=en&ie=UTF-8
            http://www.google.com/advanced_search?hl=en
            http://www.google.com/preferences?hl=en
            http://www.google.com/language_tools?hl=en
            http://www.google.com/ads/
            http://www.google.com/services/
            http://www.google.com/news/
            http://toolbar.google.com
            http://www.google.com/about.html
            http://www.google.com/mgyhp.html\
            

top
getTags
Returns: Series

The method will return a Series containing Tags and Strings resulted from processing the content of this HTML page.

The rules are: * all tags denoted by the < symbol and closed by the > symbol will be transformed in a Tag type of concept. * all strings outside tags will be transformed in the String type of concept. * the CR LF, CR, LF symbols will determine the parser to create a new String type concept.

            $page = HTMLPage.create("http://www.google.com")
            print $page.getTags
            -- the result is --
            <html>
            <head>
            <META HTTP-EQUIV="content-type" CONTENT="text/html; charset=ISO-8859-1">
            <title>
            </title>
            <style>
            ...
            </html>
            

top
getTagsWithText
Returns: Series

Returns a Series containing the tags with text resulted from processing the content of this HTMLPage.

            $page = HTMLPage.create("http://www.google.com")
            print $page.getTagsWithText
            -- the result is --
            <html>
            <head>
            <META HTTP-EQUIV="content-type" CONTENT="text/html; charset=ISO-8859-1">
            <title>
            Google
            </title>
            <style>
            ...
            </html>
            

top
getURIs
Returns: Series

Returns a Series with all URIs contained by this HTMLPage.

            $page = HTMLPage.create("http://www.google.com")
            print $page.getURIs
            -- the result is --
            http://www.google.com/imghp?hl=en&ie=UTF-8
            http://www.google.com/grphp?hl=en&ie=UTF-8
            http://www.google.com/dirhp?hl=en&ie=UTF-8
            http://www.google.com/advanced_search?hl=en
            http://www.google.com/preferences?hl=en
            http://www.google.com/language_tools?hl=en
            http://www.google.com/ads/
            http://www.google.com/services/
            http://www.google.com/news/
            http://toolbar.google.com
            http://www.google.com/about.html
            http://www.google.com/mgyhp.html\
            

top
getResponseCode
Returns: Integer

Returns a response code of this HTMLPage header.

            $page = HTMLPage.create("http://www.google.com")
            print $page.getResponseCode
            -- the result is --
            200
            

top
getResponseMessage
Returns: String

Returns a response message of this HTMLPage header.

            $page = HTMLPage.create("http://www.google.com")
            print $page.getResponseMessage
            -- the result is --
            OK
            

top
static setFollowRedirects (Logical $bol)
Parameters:
$bol :Boolean expression.
Returns:

Sets whether HTTP redirects (requests with response code 3xx) should be automatically followed by your connection.

            $page = HTMLPage.create("http://www.google.com")
            print $page.isFollowRedirects
            $page.setFollowRedirects(FALSE)
            print $page.isFollowRedirects
            -- the result is --
            true
            false
            

top
static isFollowRedirects
Returns: Logical

Returns a boolean indicating whether or not HTTP redirects (3xx) should be automatically followed.

            $page = HTMLPage.create("http://www.google.com")
            print $page.isFollowRedirects
            $page.setFollowRedirects(FALSE)
            print $page.isFollowRedirects
            -- the result is --
            true
            false
            

top