Blog

2008 03 posts (4)

Ordering: Ascending Descending

1. PHP 5 features: Exceptions

2008-03-25 10:18:59 by Martynas Jusevičius

A useful new feature in PHP 5 is exception handling via the try/throw/catch paradigm.

An exception may be thrown and caught. If an exception is thrown in code surrounded by try, the following statements will not be executed, and the exception will be handled by the first matching catch block. Each try has to have at least one corresponding catch, and multiple catch blocks can be defined to handle different types of exceptions. Normal execution continues after the last catch block.
The built-in Exception class can be extended and a user-defined exception class can be implemented.

Exceptions is a powerful construct which can be used to improve code but also easily abused. PHP5 Exception Use Guidelines defines one basic rule of thumb:

Exceptions should never be used as normal program flow. If removing all exception handling logic (try-catch statements) from the program, the remaining code should represent the "One True Path" -- the flow that would be executed in the absence of errors.
This requirement is equivalent to requiring that exceptions be thrown only on error conditions, and never in normal program states.

When developing with the DIY Framework, we are mostly using exceptions to indicate form validation or non-permitted access errors. Here is an example from a SettingsResource class which handles a page of user settings, which should be accessible only when the user is logged on:

public function doPost(Request $request, Response $response)
{
	$view = null;
	$parent = parent::doPost($request, $response);

	if ($parent != null) $view = $parent;
	else
	{
		try
		{
			if ($request->getSession()->getAttribute("user") instanceof GuestUser) throw new NoPermissionException();

			$view = new SettingsGeneralView($this);

			// ... some other code to save submitted settings
		}
		catch (NoPermissionException $e)
		{
			$response->sendRedirect(FrontEndMapping::getHost().LoginResource::getInstance()->getURI());
		}
	}

	return $view;
}

If the user in session is a guest user (not logged on), an exception is thrown, and a redirect to the login page follows.

Add a comment Comments (9)

2. Joining documents in XSLT

2008-03-18 11:26:42 by Martynas Jusevičius

Sometimes XSLT stylesheet have to have access to several XML documents. In many cases they are inter-related, for example through @id/@idref attributes, and need to be joined to produce the desired result.
We have already written on how to pass side documents to XSLT using XSLTView and URIResolver classes from the DIY Framework. Now we are going to show how the actual XSLT join works.

Let us say we have an array of product categories with IDs, and an array of products, where every product has a category ID. We want to display a table of products, with product name and product category name on each row. Since product array only contains category IDs (references to categories), we need to join the two arrays on category IDs to get the corresponding category names.

First we pass both serialized arrays as side documents that can be accessed in XSLT as document('arg://products') and document('arg://product-categories'). Then we start building the table by applying templates on products:

<table>
	<xsl:apply-templates select="document('arg://products')/Array/Product"/>
</table>

To complete it, we need to define a template which will produce table rows from products. It is where the actual join on category ID attributes needs to occur:

<xsl:template match="Product">
	<tr>
		<td>
			<xsl:value-of select="Name"/>
		</td>
		<td>
			<xsl:for-each select="document('arg://product-categories')/Array/Category[@id = current()/@categoryID]">
				<xsl:value-of select="Name"/>
			</xsl:for-each>
		</td>
	</tr>
</xsl:template>

Notice <xsl:for-each>, which might be used to iterate over a node set. But here we simply use it to change the context node to a Category element from document('arg://product-categories') document which has the same @id as the @categoryID of the current Product that is processed in the row template. Therefore the second <xsl:value-of select="Name"/> will output the name of the category, not the product.
Another important construct is the current(). Here it is used in a predicate expression and refers to the current Product node, while the context node in that expression is Category.

Sounds complicated? But it makes lots of sense when you get used to it.

Add a comment

3. Scraping HTML with DOM

2008-03-11 11:28:45 by Martynas Jusevičius

HTML scraping is used to extract structured data from human-intended webpages. It is a common way to work around old school websites which do not provide data feeds in machine-readable formats such as RDF or at least custom XML.
Scraping can be implemented in several ways. Regular expressions (regexp) is probably the most widely-used technique. They employ specific syntax rules to define patterns that have to be matched in a string.

However, there are several issues in the regexp implementation. Because of complex HTML source, pattern strings soon become lengthy. The pattern syntax is not trivial and may differ on various systems. That leads to complicated and non–intuitive code, where the program logic (such as conditional cases) is hard–coded in the pattern and therefore not obvious.

Moreover, regular expressions operate on a generic string level and have no knowledge of HTML tags and the tree–shaped document model that they form. It means that special care has to be taken of insignificant whitespace (such as line breaks), control characters have to be escaped etc. In situations where tree-like structures need to be scraped (for example, nested comment or e-mail threads), sequential matching makes it difficult to figure out the item's position in the tree, i. e. to relate the item to its parent, if there is one. Recursion would be an appropriate technique in this case, however regular expressions aren't really meant for recursive solutions.

Another approach to scraping is using Document Object Model (DOM). It is a standard object model for representing HTML or XML. DOM is usually associated with client–side scripting, but can be equally well used on the server side — its support is provided on many platforms, including Java libraries and PHP extension. Fortunately, PHP's DOM extension is able to parse even invalid HTML, which is most often the case.

Here is a simple PHP scraper which turns a table with student information into FOAF data:

class StudentScraper
{
	private $doc = null;

	public function __construct($url)
	{
		$htmlString = file_get_contents($url); // load HTML file as a string
		$this->doc = new DOMDocument();
		$this->doc->loadHTML($htmlString); // load string into DOM document object
	}

	public function process() // return RDF/XML string
	{
		$xml = "<rdf:RDF xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\" xmlns:foaf=\"http://xmlns.com/foaf/0.1/\">\n";
		$table = $this->doc->getElementsByTagName("table")->item(0); // table element

		foreach ($table->childNodes as $tr) // iterate rows
		{
			$name = $tr->getElementsByTagName("td")->item(0)->textContent; // content of first cell
			$eMail = $tr->getElementsByTagName("td")->item(1)->textContent; // content of second cell

			$xml .= "	<foaf:Person>\n"; // construct RDF/XML
			$xml .= "		<foaf:name>".htmlspecialchars($name)."</foaf:name>\n";
			$xml .= "		<foaf:mbox rdf:resource=\"mailto:".htmlspecialchars($eMail)."\"/>\n";
			$xml .= "	</foaf:Person>\n";

		}
		$xml .= "</rdf:RDF>";

		return $xml;
	}

}

$scraper = new StudentScraper("http://some.university.com/students.htm"); // run scraper
print $scraper->process();

Here we load the contents of the HTML file in to a DOM document. Student data is found in a table, which contains a row for each student. Each row contains 2 cells — one for name, and one for e-mail address. We iterate through rows and create one foaf:Person entry per student. When run, the scraper will return RDF/XML string with FOAF data.
Constructing XML as a plain string is an error-prone practice and should be also done using DOM, but we leave it here for simplicity and clarity.

Add a comment Comments (1384)

4. RESTful URIs

2008-03-03 11:17:29 by Martynas Jusevičius

The RESTful Web Services book provides three basic rules for URI design, born of collective experience:

  1. Use path variables to encode hierarchy: /parent/child
  2. Put punctuation characters in path variables to avoid implying hierarchy where none exists: /parent/child1;child2
  3. Use query variables to imply inputs into an algorithm, for example: /search?q=jellyfish&start=20

We have come up with very much the same principles in our designs. The second rule however seems to be rarely used in practice.

The rules are tested and precise enough, but it is the definitions used in them that are not always clear. What constitutes a hierarchy? A strict class/subclass/instance tree is probably the most logical and intuitive form of it, for example Places/Clubs/VEGA. It should be used whenever it applies, but that is not always the case. Another good example is things that are embedded one into another, such as Countries/USA/Georgia/Atlanta.

Many hierarchical paths however break these two patterns, or mix them. For example, Places/Clubs/VEGA/Pictures/ or user/john/tags/ do not represent a strict hierarchy. Pictures/ is neither an instance of Places/Clubs/VEGA/ nor embedded in it in a strict sence. Still it is a kind of resource that should be given a RESTful URI.

Imagine that we would like to add pictures for every location in our location hierarchy. The pictures of USA would then be addressed as Countries/USA/Pictures/. This breaks our URI “embeddedness” hierarchy by adding some “metadata leaf nodes”, which are not true nodes (locations). When handling such a request most Web applications (especially those based on URL rewriting) would not be able to easily tell that Pictures/ does not refer to one of the US states as Georgia/ does. Last.fm for example distinguishes between these two kind of URIs by using a + prefix: music/Radiohead/In+Rainbows (album) vs. music/Radiohead/+events. Using the DIY Framework that is not necessary, since each resource has an explicit type which does not depend on its URI.

Another example found in practice is path URIs used for pagination, e. g. user/john/tags/all or /posts/4/pages/2. Again, they break the strict hierarchy, but also lead to the definition of algorithm in the 3rd rule. In our eyes, pagination is much more an algorithm than a kind of hierarchy. One can argue philosophically about resources and how best to design them, but we see a list of posts as a single resource which representation can be paginated used query parameters passed to an algorithm rather than a number of resources for every page. It can also have more inputs than the page number, e. g. the number of posts per page or the sort order, which would be hard to embed in the path. Therefore we use pagination URIs like Products/?offset=1200&limit=20.

The bottom line is that the URI space is infinite, but in practice the design choices for logical RESTful URIs are often difficult and constrained.

Add a comment Comments (8)

Ordering: Ascending Descending