Blog

2008 posts (36)

Pages: < Previous 1–10 11–20 21–30 31–40 Next >
Ordering: Ascending Descending

21. kbh.dk

2008-04-18 12:04:15 by Martynas Jusevičius

I started working as a freelancer on kbh.dk. From the English description:

Kbh.dk is a social network for everybody that live in or care about Copenhagen.

We will launch a Public Beta version in May 2008. In the first phase the web site will be mainly in Danish but we’re working hard to get full internationalization as soon as possible.

The project is driven by the wish to create a socially connected metropolis. Kbh.dk will be the place where the people of Copenhagen inspire each other to use the city - together. The project is financed by the Municipality of Copenhagen.

Kbh.dk helps you to

  • Find people that share your interests and connect through groups
  • Promote your interests with video, photos and blogs
  • Network with your friends - and use the city with them
  • Invite others to the events you are organizing or following

I am responsible for the events part. Users will be able to create them, but there will also be events imported from a third-party aggregation webservice.

It seems like it is going to be fun, I'm excited :)

Add a comment Comments (7)

22. Recursion in XSLT

2008-04-09 23:52:48 by Martynas Jusevičius

XSLT, which is great for web templating, is a functional language and has no loops or mutable variables. These constructs have to be replaced with recursion and parameters. For example, if you would like to loop until a certain number and print the counter, you would have to use recursion.
Recursion in XSLT is implemented using <xsl:apply-templates> or <xsl:call-template> together with <xsl:with-param> and <xsl:param>.

The general logic is to call or apply a template with some initial parameter value, get the necessary work done inside of the template, and let it call itself with a modified parameter. Considering the loop example, it would mean calling the template with 0 as initial counter parameter, printing it out in the template, and calling the template recursively with counter increased by one with the condition that it still is smaller than the target number.
As a complete example we provide a template for converting line breaks in a string into HTML <br/> elements:

<xsl:template name="process-line-breaks">
	<xsl:param name="text"/>
	<xsl:variable name="break" select="'&#10;'"/> <!-- entity for line break character, might differ according to platform -->
	<xsl:choose>
		<xsl:when test="contains($text, $break)">
			<xsl:value-of select="substring-before($text, $break)"/>
			<br/>
			<xsl:call-template name="process-line-breaks">
				<xsl:with-param name="text" select="substring-after($text, $break)"/>
			</xsl:call-template>
		</xsl:when>
		<xsl:otherwise>
			<xsl:value-of select="$text"/>
		</xsl:otherwise>
	</xsl:choose>
</xsl:template>

The template takes a text string as a parameter and tries to locate a line break in it. If that suceeds, the string part before the break is the output, followed by a <br/>, and the template is applied recursively on the string part after the break. If there are no more breaks in the string parameter, the remaining string becomes the output and the template terminates. In each step the template is left with a smaller and smaller remainder of the original string, while the preceding part becomes template output with line breaks replaced.
The template can be applied on a text content like this:

<xsl:template match="text()" mode="process-line-breaks">
	<xsl:call-template name="process-line-breaks">
		<xsl:with-param name="text" select="."/>
	</xsl:call-template>
</xsl:template>

Add a comment Comments (11)

23. Semantic social networks

2008-04-02 14:05:45 by Martynas Jusevičius

Lately there has been a growing dissatisfaction in the user communities of social networks and other user-based websites which were built with an intention to lock and control the data that users provide. Users complain about not having control over profile data and social relationships they have established and not being able to export them in machine readable formats, which makes reuse and integration of this data virtually impossible. This resulted in several initiatives:

Open Social Web
Demands ownership, control, and freedom of personal information on the social web
Social Network Portability
Demands ability to import profile information and social network (based on Microformats)
DataPortability
Encourages standards-based data portability in general
OpenID
Promotes single digital identity
OpenSocial
Defines a common API for social applications across multiple websites

The main problem seems to be data portability in general, and we have already argued that it is a sweet spot for semantic technologies. However, there are several Semantic Web projects directly addressing issues of social networks:

Friend of a Friend (FOAF)
Aimed at creating a Web of machine-readable pages describing people, the links between them and the things they create and do
Semantically Interlinked Online Communities (SIOC)
Provides methods for interconnecting discussion methods such as blogs, forums and mailing lists to each other

In combination with Linked Data, FOAF would allow import of friend profiles from one website to another, and SIOC would enable structured queries over distributed user-generated content. Social networks could reuse semantic data from sites like DBpedia and be valuable sources of such data themselves. These were the main ideas in Tim Berners-Lee Giant Global Graph vision and Brad Fitzpatrick's thoughts on Social Graph.
Leigh Dodds wrote about this back in 2004, but things have advanced slowly. There exist SIOC plugins for several blog engines, and LiveJournal is publishing FOAF data, but portability still has to gain momentum, most likely by support from major sites.

Add a comment Comments (1169)

24. PHP 5 features: Exceptions

2008-03-25 10:18:59 by Martynas Jusevičius

A useful new feature in PHP 5 is exception handling via the try/throw/catch paradigm.

An exception may be thrown and caught. If an exception is thrown in code surrounded by try, the following statements will not be executed, and the exception will be handled by the first matching catch block. Each try has to have at least one corresponding catch, and multiple catch blocks can be defined to handle different types of exceptions. Normal execution continues after the last catch block.
The built-in Exception class can be extended and a user-defined exception class can be implemented.

Exceptions is a powerful construct which can be used to improve code but also easily abused. PHP5 Exception Use Guidelines defines one basic rule of thumb:

Exceptions should never be used as normal program flow. If removing all exception handling logic (try-catch statements) from the program, the remaining code should represent the "One True Path" -- the flow that would be executed in the absence of errors.
This requirement is equivalent to requiring that exceptions be thrown only on error conditions, and never in normal program states.

When developing with the DIY Framework, we are mostly using exceptions to indicate form validation or non-permitted access errors. Here is an example from a SettingsResource class which handles a page of user settings, which should be accessible only when the user is logged on:

public function doPost(Request $request, Response $response)
{
	$view = null;
	$parent = parent::doPost($request, $response);

	if ($parent != null) $view = $parent;
	else
	{
		try
		{
			if ($request->getSession()->getAttribute("user") instanceof GuestUser) throw new NoPermissionException();

			$view = new SettingsGeneralView($this);

			// ... some other code to save submitted settings
		}
		catch (NoPermissionException $e)
		{
			$response->sendRedirect(FrontEndMapping::getHost().LoginResource::getInstance()->getURI());
		}
	}

	return $view;
}

If the user in session is a guest user (not logged on), an exception is thrown, and a redirect to the login page follows.

Add a comment Comments (20)

25. Joining documents in XSLT

2008-03-18 11:26:42 by Martynas Jusevičius

Sometimes XSLT stylesheet have to have access to several XML documents. In many cases they are inter-related, for example through @id/@idref attributes, and need to be joined to produce the desired result.
We have already written on how to pass side documents to XSLT using XSLTView and URIResolver classes from the DIY Framework. Now we are going to show how the actual XSLT join works.

Let us say we have an array of product categories with IDs, and an array of products, where every product has a category ID. We want to display a table of products, with product name and product category name on each row. Since product array only contains category IDs (references to categories), we need to join the two arrays on category IDs to get the corresponding category names.

First we pass both serialized arrays as side documents that can be accessed in XSLT as document('arg://products') and document('arg://product-categories'). Then we start building the table by applying templates on products:

<table>
	<xsl:apply-templates select="document('arg://products')/Array/Product"/>
</table>

To complete it, we need to define a template which will produce table rows from products. It is where the actual join on category ID attributes needs to occur:

<xsl:template match="Product">
	<tr>
		<td>
			<xsl:value-of select="Name"/>
		</td>
		<td>
			<xsl:for-each select="document('arg://product-categories')/Array/Category[@id = current()/@categoryID]">
				<xsl:value-of select="Name"/>
			</xsl:for-each>
		</td>
	</tr>
</xsl:template>

Notice <xsl:for-each>, which might be used to iterate over a node set. But here we simply use it to change the context node to a Category element from document('arg://product-categories') document which has the same @id as the @categoryID of the current Product that is processed in the row template. Therefore the second <xsl:value-of select="Name"/> will output the name of the category, not the product.
Another important construct is the current(). Here it is used in a predicate expression and refers to the current Product node, while the context node in that expression is Category.

Sounds complicated? But it makes lots of sense when you get used to it.

Add a comment Comments (1324)

26. Scraping HTML with DOM

2008-03-11 11:28:45 by Martynas Jusevičius

HTML scraping is used to extract structured data from human-intended webpages. It is a common way to work around old school websites which do not provide data feeds in machine-readable formats such as RDF or at least custom XML.
Scraping can be implemented in several ways. Regular expressions (regexp) is probably the most widely-used technique. They employ specific syntax rules to define patterns that have to be matched in a string.

However, there are several issues in the regexp implementation. Because of complex HTML source, pattern strings soon become lengthy. The pattern syntax is not trivial and may differ on various systems. That leads to complicated and non–intuitive code, where the program logic (such as conditional cases) is hard–coded in the pattern and therefore not obvious.

Moreover, regular expressions operate on a generic string level and have no knowledge of HTML tags and the tree–shaped document model that they form. It means that special care has to be taken of insignificant whitespace (such as line breaks), control characters have to be escaped etc. In situations where tree-like structures need to be scraped (for example, nested comment or e-mail threads), sequential matching makes it difficult to figure out the item's position in the tree, i. e. to relate the item to its parent, if there is one. Recursion would be an appropriate technique in this case, however regular expressions aren't really meant for recursive solutions.

Another approach to scraping is using Document Object Model (DOM). It is a standard object model for representing HTML or XML. DOM is usually associated with client–side scripting, but can be equally well used on the server side — its support is provided on many platforms, including Java libraries and PHP extension. Fortunately, PHP's DOM extension is able to parse even invalid HTML, which is most often the case.

Here is a simple PHP scraper which turns a table with student information into FOAF data:

class StudentScraper
{
	private $doc = null;

	public function __construct($url)
	{
		$htmlString = file_get_contents($url); // load HTML file as a string
		$this->doc = new DOMDocument();
		$this->doc->loadHTML($htmlString); // load string into DOM document object
	}

	public function process() // return RDF/XML string
	{
		$xml = "<rdf:RDF xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\" xmlns:foaf=\"http://xmlns.com/foaf/0.1/\">\n";
		$table = $this->doc->getElementsByTagName("table")->item(0); // table element

		foreach ($table->childNodes as $tr) // iterate rows
		{
			$name = $tr->getElementsByTagName("td")->item(0)->textContent; // content of first cell
			$eMail = $tr->getElementsByTagName("td")->item(1)->textContent; // content of second cell

			$xml .= "	<foaf:Person>\n"; // construct RDF/XML
			$xml .= "		<foaf:name>".htmlspecialchars($name)."</foaf:name>\n";
			$xml .= "		<foaf:mbox rdf:resource=\"mailto:".htmlspecialchars($eMail)."\"/>\n";
			$xml .= "	</foaf:Person>\n";

		}
		$xml .= "</rdf:RDF>";

		return $xml;
	}

}

$scraper = new StudentScraper("http://some.university.com/students.htm"); // run scraper
print $scraper->process();

Here we load the contents of the HTML file in to a DOM document. Student data is found in a table, which contains a row for each student. Each row contains 2 cells — one for name, and one for e-mail address. We iterate through rows and create one foaf:Person entry per student. When run, the scraper will return RDF/XML string with FOAF data.
Constructing XML as a plain string is an error-prone practice and should be also done using DOM, but we leave it here for simplicity and clarity.

Add a comment Comments (1465)

27. RESTful URIs

2008-03-03 11:17:29 by Martynas Jusevičius

The RESTful Web Services book provides three basic rules for URI design, born of collective experience:

  1. Use path variables to encode hierarchy: /parent/child
  2. Put punctuation characters in path variables to avoid implying hierarchy where none exists: /parent/child1;child2
  3. Use query variables to imply inputs into an algorithm, for example: /search?q=jellyfish&start=20

We have come up with very much the same principles in our designs. The second rule however seems to be rarely used in practice.

The rules are tested and precise enough, but it is the definitions used in them that are not always clear. What constitutes a hierarchy? A strict class/subclass/instance tree is probably the most logical and intuitive form of it, for example Places/Clubs/VEGA. It should be used whenever it applies, but that is not always the case. Another good example is things that are embedded one into another, such as Countries/USA/Georgia/Atlanta.

Many hierarchical paths however break these two patterns, or mix them. For example, Places/Clubs/VEGA/Pictures/ or user/john/tags/ do not represent a strict hierarchy. Pictures/ is neither an instance of Places/Clubs/VEGA/ nor embedded in it in a strict sence. Still it is a kind of resource that should be given a RESTful URI.

Imagine that we would like to add pictures for every location in our location hierarchy. The pictures of USA would then be addressed as Countries/USA/Pictures/. This breaks our URI “embeddedness” hierarchy by adding some “metadata leaf nodes”, which are not true nodes (locations). When handling such a request most Web applications (especially those based on URL rewriting) would not be able to easily tell that Pictures/ does not refer to one of the US states as Georgia/ does. Last.fm for example distinguishes between these two kind of URIs by using a + prefix: music/Radiohead/In+Rainbows (album) vs. music/Radiohead/+events. Using the DIY Framework that is not necessary, since each resource has an explicit type which does not depend on its URI.

Another example found in practice is path URIs used for pagination, e. g. user/john/tags/all or /posts/4/pages/2. Again, they break the strict hierarchy, but also lead to the definition of algorithm in the 3rd rule. In our eyes, pagination is much more an algorithm than a kind of hierarchy. One can argue philosophically about resources and how best to design them, but we see a list of posts as a single resource which representation can be paginated used query parameters passed to an algorithm rather than a number of resources for every page. It can also have more inputs than the page number, e. g. the number of posts per page or the sort order, which would be hard to embed in the path. Therefore we use pagination URIs like Products/?offset=1200&limit=20.

The bottom line is that the URI space is infinite, but in practice the design choices for logical RESTful URIs are often difficult and constrained.

Add a comment Comments (10)

28. The Rule of customization

2008-02-23 13:51:57 by Martynas Jusevičius

Whenever I need to use complex GUIs that are supposed to hide the complexities of the code or generate it for you as a convenience, or complex configuration files, I get the feeling that something is not right, that they actually stand in the way of doing my work rather than helping me.

One example could be ASP.NET in Visual Studio and that kind of interfaces. A few clicks in a wizard allows you to bind the database and show a table of data on your page. You can of course add or remove columns, change styles and appearance etc. But since you do not actually control the code behind it, and if you are doing something more advanced, you hit the wall eventually since there is just no way to do it using the interface. Then you still need go down to hack the code. And in the meanwhile you have been learning all these knobs and buttons of a proprietary program instead of using and extending your knowledge of SQL or HTML.

Another example I can think of is huge declarative configuration files, usually written in XML. They are common in Web frameworks such as Struts, but probably elsewhere, too. They were probably built to make things simpler and just hold some constants, but then got out of hand and blew up. At some point it might seem that you are configuring more than coding, and still are not able to achieve what you want. And again, to do that you probably had to figure out a whole specification of a custom XML schema that you will not be able to use somewhere else, instead of sticking to your plain old Java code.

I am not saying customizable interfaces and configuration files are useless, but I think this rule applies:

At some point, customizable tools which are meant to ease the software development become so complex that it takes more effort to figure them out and customize them for your needs rather than build what you want from scratch.

Add a comment Comments (200)

29. RESTful cache

2008-02-15 13:08:41 by Martynas Jusevičius

We've been recently thinking about how to implement a cache over the DIY Framework. Ideally it should be an extra layer in the application and not require making changes in the underlying design.

The golden rule of caching says, that it is best to cache as close to the final product as possible. It is good to cache a result set, but best best to cache the whole webpage. So we'll focus on that, since it is also easier to implement as a separate layer.

Imagine a product webpage. There a couple of forms on it (e. g. for comments), and the page is only updated when they are submitted. Otherwise the content stays the same, so it can be cached and served until the page is updated again. The important thing is to know when the update happens and when to invalidate the cache.
Unfortunately (or not), there are web pages that are not updated directly via HTTP methods. They change because other pages get updated. Imagine a list of most recent product comments. It would take some logic to figure out when it was updated — at least retrieving the timestamp for the most recent comment. It makes cache invalidation hard.

The benefit of REST architecture and our framework is that resources are fine-grained. If there is a resource with a URI Products/123 and it received a POST (or basically any non-GET) request, we can assume it was updated. It would be harder to figure out in a non-RESTful design, for example if all the product requests would be handled by a single script and URIs would be something like products.php?id=123.
Making all forms submit to the same URI as they are served on seems also to be a good practice. If a product comment would be submitted to some comment.php instead, the cache would not know that the product page was updated.

Now we need to figure out how to implement this using some memory cache (such as eAccelerator) and sending correct Last-Modified and Cache-Control headers :)

Add a comment Comments (2495)

30. Data portability

2008-02-05 12:33:27 by Martynas Jusevičius

Recently, a number of initiatives complaining about Web applications being built as walled gardens and not allowing users to control their data or transfer it across to another application started showing up, especially in the social networks area: Data Portability, Open Social Web, OpenID. They demand ownership and control over profiles and relationships, and publishing of them using open standards.

Although many Web 2.0 websites started offering APIs, they do not solve the problem completely. Most of them are based on XML, which is machine-readable to great advantage, but still leads to the N^2 problem — in order to be integrated, each pair of applications has to be programed accordingly, with the knowledge of API and formats on the other end.

The true solution here might also become a bootstrap for the Semantic Web, which is not about some kind of Artificial Intelligence right now, but about data integration in the first place. In fact, the Web 2.0 data portability initiatives resemble the semantic Linking Open Data community project a lot. To support true data portability, Web applications should publish their data as RDF Linked Data, and ultimately provide SPARQL query endpoints and employ OWL ontologies.

Add a comment Comments (1214)

Pages: < Previous 1–10 11–20 21–30 31–40 Next >
Ordering: Ascending Descending