Blog

All posts (105)

Pages: < Previous 1–10 11–20 21–30 31–40 41–50 51–60 61–70 71–80 81–90 91–100 101–110 Next >
Ordering: Ascending Descending

41. Joining documents in XSLT

2008-03-18 11:26:42 by Martynas Jusevičius

Sometimes XSLT stylesheet have to have access to several XML documents. In many cases they are inter-related, for example through @id/@idref attributes, and need to be joined to produce the desired result.
We have already written on how to pass side documents to XSLT using XSLTView and URIResolver classes from the DIY Framework. Now we are going to show how the actual XSLT join works.

Let us say we have an array of product categories with IDs, and an array of products, where every product has a category ID. We want to display a table of products, with product name and product category name on each row. Since product array only contains category IDs (references to categories), we need to join the two arrays on category IDs to get the corresponding category names.

First we pass both serialized arrays as side documents that can be accessed in XSLT as document('arg://products') and document('arg://product-categories'). Then we start building the table by applying templates on products:

<table>
	<xsl:apply-templates select="document('arg://products')/Array/Product"/>
</table>

To complete it, we need to define a template which will produce table rows from products. It is where the actual join on category ID attributes needs to occur:

<xsl:template match="Product">
	<tr>
		<td>
			<xsl:value-of select="Name"/>
		</td>
		<td>
			<xsl:for-each select="document('arg://product-categories')/Array/Category[@id = current()/@categoryID]">
				<xsl:value-of select="Name"/>
			</xsl:for-each>
		</td>
	</tr>
</xsl:template>

Notice <xsl:for-each>, which might be used to iterate over a node set. But here we simply use it to change the context node to a Category element from document('arg://product-categories') document which has the same @id as the @categoryID of the current Product that is processed in the row template. Therefore the second <xsl:value-of select="Name"/> will output the name of the category, not the product.
Another important construct is the current(). Here it is used in a predicate expression and refers to the current Product node, while the context node in that expression is Category.

Sounds complicated? But it makes lots of sense when you get used to it.

Add a comment Comments (1324)

42. Scraping HTML with DOM

2008-03-11 11:28:45 by Martynas Jusevičius

HTML scraping is used to extract structured data from human-intended webpages. It is a common way to work around old school websites which do not provide data feeds in machine-readable formats such as RDF or at least custom XML.
Scraping can be implemented in several ways. Regular expressions (regexp) is probably the most widely-used technique. They employ specific syntax rules to define patterns that have to be matched in a string.

However, there are several issues in the regexp implementation. Because of complex HTML source, pattern strings soon become lengthy. The pattern syntax is not trivial and may differ on various systems. That leads to complicated and non–intuitive code, where the program logic (such as conditional cases) is hard–coded in the pattern and therefore not obvious.

Moreover, regular expressions operate on a generic string level and have no knowledge of HTML tags and the tree–shaped document model that they form. It means that special care has to be taken of insignificant whitespace (such as line breaks), control characters have to be escaped etc. In situations where tree-like structures need to be scraped (for example, nested comment or e-mail threads), sequential matching makes it difficult to figure out the item's position in the tree, i. e. to relate the item to its parent, if there is one. Recursion would be an appropriate technique in this case, however regular expressions aren't really meant for recursive solutions.

Another approach to scraping is using Document Object Model (DOM). It is a standard object model for representing HTML or XML. DOM is usually associated with client–side scripting, but can be equally well used on the server side — its support is provided on many platforms, including Java libraries and PHP extension. Fortunately, PHP's DOM extension is able to parse even invalid HTML, which is most often the case.

Here is a simple PHP scraper which turns a table with student information into FOAF data:

class StudentScraper
{
	private $doc = null;

	public function __construct($url)
	{
		$htmlString = file_get_contents($url); // load HTML file as a string
		$this->doc = new DOMDocument();
		$this->doc->loadHTML($htmlString); // load string into DOM document object
	}

	public function process() // return RDF/XML string
	{
		$xml = "<rdf:RDF xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\" xmlns:foaf=\"http://xmlns.com/foaf/0.1/\">\n";
		$table = $this->doc->getElementsByTagName("table")->item(0); // table element

		foreach ($table->childNodes as $tr) // iterate rows
		{
			$name = $tr->getElementsByTagName("td")->item(0)->textContent; // content of first cell
			$eMail = $tr->getElementsByTagName("td")->item(1)->textContent; // content of second cell

			$xml .= "	<foaf:Person>\n"; // construct RDF/XML
			$xml .= "		<foaf:name>".htmlspecialchars($name)."</foaf:name>\n";
			$xml .= "		<foaf:mbox rdf:resource=\"mailto:".htmlspecialchars($eMail)."\"/>\n";
			$xml .= "	</foaf:Person>\n";

		}
		$xml .= "</rdf:RDF>";

		return $xml;
	}

}

$scraper = new StudentScraper("http://some.university.com/students.htm"); // run scraper
print $scraper->process();

Here we load the contents of the HTML file in to a DOM document. Student data is found in a table, which contains a row for each student. Each row contains 2 cells — one for name, and one for e-mail address. We iterate through rows and create one foaf:Person entry per student. When run, the scraper will return RDF/XML string with FOAF data.
Constructing XML as a plain string is an error-prone practice and should be also done using DOM, but we leave it here for simplicity and clarity.

Add a comment Comments (1465)

43. RESTful URIs

2008-03-03 11:17:29 by Martynas Jusevičius

The RESTful Web Services book provides three basic rules for URI design, born of collective experience:

  1. Use path variables to encode hierarchy: /parent/child
  2. Put punctuation characters in path variables to avoid implying hierarchy where none exists: /parent/child1;child2
  3. Use query variables to imply inputs into an algorithm, for example: /search?q=jellyfish&start=20

We have come up with very much the same principles in our designs. The second rule however seems to be rarely used in practice.

The rules are tested and precise enough, but it is the definitions used in them that are not always clear. What constitutes a hierarchy? A strict class/subclass/instance tree is probably the most logical and intuitive form of it, for example Places/Clubs/VEGA. It should be used whenever it applies, but that is not always the case. Another good example is things that are embedded one into another, such as Countries/USA/Georgia/Atlanta.

Many hierarchical paths however break these two patterns, or mix them. For example, Places/Clubs/VEGA/Pictures/ or user/john/tags/ do not represent a strict hierarchy. Pictures/ is neither an instance of Places/Clubs/VEGA/ nor embedded in it in a strict sence. Still it is a kind of resource that should be given a RESTful URI.

Imagine that we would like to add pictures for every location in our location hierarchy. The pictures of USA would then be addressed as Countries/USA/Pictures/. This breaks our URI “embeddedness” hierarchy by adding some “metadata leaf nodes”, which are not true nodes (locations). When handling such a request most Web applications (especially those based on URL rewriting) would not be able to easily tell that Pictures/ does not refer to one of the US states as Georgia/ does. Last.fm for example distinguishes between these two kind of URIs by using a + prefix: music/Radiohead/In+Rainbows (album) vs. music/Radiohead/+events. Using the DIY Framework that is not necessary, since each resource has an explicit type which does not depend on its URI.

Another example found in practice is path URIs used for pagination, e. g. user/john/tags/all or /posts/4/pages/2. Again, they break the strict hierarchy, but also lead to the definition of algorithm in the 3rd rule. In our eyes, pagination is much more an algorithm than a kind of hierarchy. One can argue philosophically about resources and how best to design them, but we see a list of posts as a single resource which representation can be paginated used query parameters passed to an algorithm rather than a number of resources for every page. It can also have more inputs than the page number, e. g. the number of posts per page or the sort order, which would be hard to embed in the path. Therefore we use pagination URIs like Products/?offset=1200&limit=20.

The bottom line is that the URI space is infinite, but in practice the design choices for logical RESTful URIs are often difficult and constrained.

Add a comment Comments (10)

44. The Rule of customization

2008-02-23 13:51:57 by Martynas Jusevičius

Whenever I need to use complex GUIs that are supposed to hide the complexities of the code or generate it for you as a convenience, or complex configuration files, I get the feeling that something is not right, that they actually stand in the way of doing my work rather than helping me.

One example could be ASP.NET in Visual Studio and that kind of interfaces. A few clicks in a wizard allows you to bind the database and show a table of data on your page. You can of course add or remove columns, change styles and appearance etc. But since you do not actually control the code behind it, and if you are doing something more advanced, you hit the wall eventually since there is just no way to do it using the interface. Then you still need go down to hack the code. And in the meanwhile you have been learning all these knobs and buttons of a proprietary program instead of using and extending your knowledge of SQL or HTML.

Another example I can think of is huge declarative configuration files, usually written in XML. They are common in Web frameworks such as Struts, but probably elsewhere, too. They were probably built to make things simpler and just hold some constants, but then got out of hand and blew up. At some point it might seem that you are configuring more than coding, and still are not able to achieve what you want. And again, to do that you probably had to figure out a whole specification of a custom XML schema that you will not be able to use somewhere else, instead of sticking to your plain old Java code.

I am not saying customizable interfaces and configuration files are useless, but I think this rule applies:

At some point, customizable tools which are meant to ease the software development become so complex that it takes more effort to figure them out and customize them for your needs rather than build what you want from scratch.

Add a comment Comments (200)

45. RESTful cache

2008-02-15 13:08:41 by Martynas Jusevičius

We've been recently thinking about how to implement a cache over the DIY Framework. Ideally it should be an extra layer in the application and not require making changes in the underlying design.

The golden rule of caching says, that it is best to cache as close to the final product as possible. It is good to cache a result set, but best best to cache the whole webpage. So we'll focus on that, since it is also easier to implement as a separate layer.

Imagine a product webpage. There a couple of forms on it (e. g. for comments), and the page is only updated when they are submitted. Otherwise the content stays the same, so it can be cached and served until the page is updated again. The important thing is to know when the update happens and when to invalidate the cache.
Unfortunately (or not), there are web pages that are not updated directly via HTTP methods. They change because other pages get updated. Imagine a list of most recent product comments. It would take some logic to figure out when it was updated — at least retrieving the timestamp for the most recent comment. It makes cache invalidation hard.

The benefit of REST architecture and our framework is that resources are fine-grained. If there is a resource with a URI Products/123 and it received a POST (or basically any non-GET) request, we can assume it was updated. It would be harder to figure out in a non-RESTful design, for example if all the product requests would be handled by a single script and URIs would be something like products.php?id=123.
Making all forms submit to the same URI as they are served on seems also to be a good practice. If a product comment would be submitted to some comment.php instead, the cache would not know that the product page was updated.

Now we need to figure out how to implement this using some memory cache (such as eAccelerator) and sending correct Last-Modified and Cache-Control headers :)

Add a comment Comments (2495)

46. Data portability

2008-02-05 12:33:27 by Martynas Jusevičius

Recently, a number of initiatives complaining about Web applications being built as walled gardens and not allowing users to control their data or transfer it across to another application started showing up, especially in the social networks area: Data Portability, Open Social Web, OpenID. They demand ownership and control over profiles and relationships, and publishing of them using open standards.

Although many Web 2.0 websites started offering APIs, they do not solve the problem completely. Most of them are based on XML, which is machine-readable to great advantage, but still leads to the N^2 problem — in order to be integrated, each pair of applications has to be programed accordingly, with the knowledge of API and formats on the other end.

The true solution here might also become a bootstrap for the Semantic Web, which is not about some kind of Artificial Intelligence right now, but about data integration in the first place. In fact, the Web 2.0 data portability initiatives resemble the semantic Linking Open Data community project a lot. To support true data portability, Web applications should publish their data as RDF Linked Data, and ultimately provide SPARQL query endpoints and employ OWL ontologies.

Add a comment Comments (1214)

47. XHTML 2 vs. HTML 5

2008-01-28 12:11:50 by Martynas Jusevičius

W3C has published the first working draft of HTML 5. In it, new features are introduced to help Web application authors, new elements are introduced based on research into prevailing authoring practices, and special attention has been given to defining clear conformance criteria for user agents in an effort to improve interoperability. However, at the same time there is ongoing work on XHTML 2.

How do these specifications relate to each other? HTML 5 claims that XHTML 2 only fits the document-oriented paradigm, but there is a need to extend the HTML vocabulary to support non-document content such as forum sites and online shops. They come in different namespaces, so there should be no conflict in that.
XHTML 2 should further increase semantics and separation between content and presentation, move to a modular approach in combination with XForms, and base on XML. HTML 5 on the other hand tries to incorporate features used in practice.
XHTML 2 seems to be pushed by the W3C, while HTML 5 is backed by vendors such as Mozilla and Opera, which started the work on it in WHATWG but eventually joined W3C's HTML working group.

To make things even more confusing, HTML 5 draft proposes 2 authoring formats: one based on XML (called XHTML 5), and one based on a custom format inspired by SGML (called HTML 5). Vendors are encouraged to implement both.

So far the future of the relationship seems very unclear, as both HTML 5 and XHTML 2 seem to compete in trying to replace HTML 4 and XHTML 1. It would be hard to imagine future versions of HTML not based on XML. Anyway, if this is about to become a standards war, which is the very least thing needed on the Web today, the real losers will be users and developers.

More insights in X/HTML 5 Versus XHTML 2 by XHTML.com and HTML V5 and XHTML V2 by IBM developerWorks.

Add a comment Comments (738)

48. Why PHP rocks

2008-01-24 16:17:22 by Martynas Jusevičius

OdinJobs published an interview with some PHP folks on why they think PHP rocks. I was glad to contribute a little :)

Add a comment Comments (1)

49. SPARQL became W3C recommendation

2008-01-24 16:14:06 by Martynas Jusevičius

For those who missed it, SPARQL Query Language for RDF became a W3C recommendation on January 15th.

Add a comment

50. DBpedia: A Nucleus for a Web of Open Data

2008-01-17 12:26:09 by Martynas Jusevičius

DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. It is a fine example of the next generation semantic applications and can be compared by functionality to Freebase which also uses Wikipedia's dataset. The difference is that DBpedia's data and framework are open-source, and it is built using W3C standards such as RDF/OWL and SPARQL.

DBpedia makes it possible to ask sophisticated queries, such as as to select “people influenced by Friedrich Nietzsche” or “German musicians who were born in Berlin”. The second query looks like this in SPARQL (prefixes omitted):

SELECT ?name ?birth ?description ?person WHERE {
     ?person dbpedia2:birthPlace <http://dbpedia.org/resource/Berlin> .
     ?person skos:subject <http://dbpedia.org/resource/Category:German_musicians> .
     ?person dbpedia2:birth ?birth .
     ?person foaf:name ?name .
     ?person rdfs:comment ?description .
     FILTER (LANG(?description) = 'en') .
}
ORDER BY ?name

It can be executed using one of the several SPARQL endpoints (see results here).

The datasets, now containing 103 million triples that describe 1.95 million things, are published online as linked data and can be browsed using semantic browsers such as Tabulator. They are also available for download.
What is even more exciting, DBpedia is also being linked to different other semantic datasets such as MusicBrainz (information about music and artists) and GeoNames (information about geographical features), becoming the core of W3C's Linking Open Data project.

Add a comment Comments (9)

Pages: < Previous 1–10 11–20 21–30 31–40 41–50 51–60 61–70 71–80 81–90 91–100 101–110 Next >
Ordering: Ascending Descending