Blogsite that talks about web Programming experiences,techniques and ideas for ASP,PHP,ASP.NET and Javascript.

Help get peace of mind knowing that everything is managed for you with Windows Live OneCare—virus and spyware scanning, firewalls, tune-ups, file backups, the whole nine yards. And it's all delivered to you in a smooth, hassle-free package. Download the 90 day free trial

XML, Xpath and DOM

April 10th, 2008 by Roy L. `dshiznitz Besiera

Once i was asked by my boss to automatically detect feeds on a given website URL. Of course you couldn’t say “no” to your boss and said “right away boss :p”. I always understand that when your boss asked you to do something he does not need to know if the task is hard for you or not.

My idea is to read a website’s HTML and locate the tags link hoping the it is pointing to an RSS or ATOM Feeds, parse the XML feed and display it to the user as human readable data. This is easily achieved in PHP using the CURL library to retrieve remote webpage contents and HttpWebRequest Class in ASP.NET

Step 1. Reading a Website Contents

Fig. 1.a Using CURL to retrieve remote webpage contents in PHP.

<?php
$ch = curl_init("http://genusproject.com/");
$fp = fopen("example_homepage.txt", "w");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
$html=curl_exec($ch);
curl_close($ch);
fclose($fp);
// To be Continued below
?>

Fig. 1.b Using HttpWebRequest Class (Under namespace System.NET) in C#
You can use code behind or inline scripting as long as you are using the namespace System.Net

HttpWebRequest request = null;
string url="http://genusproject.com";
request = (HttpWebRequest)WebRequest.Create(url);
request.Method = "GET";
request.ProtocolVersion = HttpVersion.Version11;
request.AllowAutoRedirect = false;
request.Accept = "*/*";
// Emulate the Request Agent
request.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0;
Windows NT 5.1; SV1;.NET CLR 2.0.50727)";
request.Headers.Add("Accept-Language", "en-us");
request.KeepAlive = true;
StreamReader responseStream = null;
HttpWebResponse webResponse = null;
string webResponseStream = string.Empty;
try{
  wr = (HttpWebResponse)request.GetResponse();
  rs = new StreamReader(wr.GetResponseStream());
  webResponseStream = rs.ReadToEnd();
}catch(Exception e){
  Response.Write("Error Getting Document. " + e.Message);
  Response.End();
}

Step 2. Finding the “Link”s Tag
Fortunately enough there is Xpath to do this rather than using Regular Expression to locate the tags.
Fig 2.a Using Xpath to get all the links tag in HTML document in PHP

<?php
// Cotinuation from above PHP code
$dom = new DOMDocument();
@$dom->loadhtml($html);
$xpath = new DOMxpath($dom);
$links = $xpath->evaluate("/html/head//link");
for ($i = 0; $i < $hrefs->length; $i++) {
  $links=$links->item($i);
  $attrib=$links->getAttribute('type');
  if($attrib=="application/rss+xml" ||
          $attrib=="application/atom+xml"){
   $arrURL[]=$links->getAttribute(’href’);
  }
}
?>

Currently i have no idea how to directly transform an HTML stream to a DOM object in ASP.NET. devcomponents has offered a Utility class to do this called “HTML Document Class Library” that loads an HTML file or stream and transform it into a DOM object for Xpath queries. If you have a way to do this the native way, post a comment pls.

The “@” symbol is used to suppress warning messages when transforming the string HTML into a full DOM object. $links should contain an array of nodes or node set. We only want the link tag with the the attribute “rel” with value “alternate”. If Found, check the attribute “type” with value “application/rss+xml” or “application/atom+xml”. So far these two format is currently popular and usually used in websites. Usually a website contains 2 or more links to their RSS feeds with the same content so we will just choose one link from the links that get from $arrURL variable which contains the links of RSS feeds

Step 3. Accessing the Node Set and Displaying the Data

Fig 3.a Loading the XML link and display it in human readable format.

<?php
$xml=@simplexml_load_file($url);
if($xml){
  // From here on you can use xpath to query the DOM object.
  $xpath = new Domxpath($dom);
  $titles= $xpath->query("//channel/title");
  foreach ($titles as $title) {
    echo $title->nodeValue. "\n";
  }
}
?>

If you’re new to Xpath, then you should read about it. it is the foundation of all advanced XML manipulations/transformation. if you know xpath a bit, then you can proceed learning about Xquery, XSL and maybe XSLT. Again im not genius about all of these stuff. Sometimes saying “yes boss i will do that” can benefit you. You will be *forced* (:p) to do some research and deadline will keep your butt in the edge of your seat. But after that, you cant deny yourself the feeling of accomplishment. Ok now, to learn xpath expression you can go here my favorite place of learning.

In ASP.NET, its very simple to do that too. We can use the classes XpathDocument,XpathNavigator and XpathIterator. Using namespaces System,System.Xml and System.Xml.Xpath. Well actually we can use System.XML for xml manipulations or accessing the nodes but since we are pointing out the usage of Xpath, we will use that instead.

using System.Xml;
using System.Xml.XPath;
/*
Once we know the URL of a feed, we will get it using
httpWebRequest Class again and pass the stream as
constructor parameter on XPathDocument class.
Or you can use a file somewhere on the server
*/
XPathDocument Doc = new XPathDocument("feedurlstream");
XPathNavigator navigator = Doc.CreateNavigator();
XPathNodeIterator iterator = navigator.Select("//channel/title");
while (iterator.MoveNext()){
  Response.Write(iterator.Current.Name);
  Response.Write(iterator.Current.Value);
}

Since we know now how to access the node set, we can format this and display it to our user. I hope that this is enough to keep you excited about Xpath. I am familiarizing all the expression and built in functions of xpath. I wanted to learn and i cant stop learning.

[del.icio.us] [DotNetKicks] [Digg] [Facebook] [Google] [MySpace] [Squidoo] [StumbleUpon] [Technorati] [Windows Live] [Yahoo!]

Posted in C#/ASP.NET Programming, PHP Programming

One Response

  1. bob

    nJk8d3 hi nice site thx http://peace.com

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.