Blog

2009 03 posts (5)

Ordering: Ascending Descending

1. The Unreasonable Effectiveness of Data

2009-03-27 00:36:35 by Martynas Jusevičius

It is nice to see Google embracing the Semantic Web, in a recent paper by their top research scientists called “The Unreasonable Effectiveness of Data”. Here is an excerpt from the summary:

So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail. For natural language applications, trust that human language has already evolved words for the important concepts. See how far you can go by tying together the words that are already there, rather than by inventing new concepts with clusters of words. Now go out and gather some data, and see what it can do.

Add a comment Comments (18)

2. Regexp string replace with PHP XSL

2009-03-19 13:26:52 by Martynas Jusevičius

While XSLT 2.0 has regular expression support, it is missing in XSLT 1.0. Tasks like string pattern matching and replace cannot be done in native XSLT 1.0 code, extension functions are needed for that. Luckily, it can be achieved with PHP XSL in the same fashion as URL-encoding, by registering PHP function support on XSLT processor and calling native PHP functions as XPath functions in PHP namespace.

One of the common cases for this functionality when using XSLT as View templates would be replacement of http:// links in text (for example, message content from a database) with actual <a> hyperlink elements. Here is a quick-and-dirty PHP function that does that:

abstract class FrontEndView extends XSLTView
{
	// ...

	public static function replaceLinks($text)
	{
		$text = htmlspecialchars($text);
		$text = ereg_replace("[[:alpha:]]+://[^<>[:space:]]+[[:alnum:]/]", "<a href=\"\\0\">\\0</a>", $text);
		$text = "<div xmlns=\"http://www.w3.org/1999/xhtml\">".$text."</div>";
		$doc = new DOMDocument();
		$doc->loadXML($text);
		return $doc->documentElement;
	}
}

It replaces link text with hyperlinks at string level, then loads it to a DOM document and returns it. Before it can be accessed from a stylesheet, registerPhpFunctions() needs to be called on the XSLT processor instance.

The XSLT code then looks like this:

<xsl:copy-of select="php:function('FrontEndView::replaceLinks', string($text))/node()"/>

Notice that while you have to send string content to the function, what you get back is a node list, because the replaced text becomes XML elements, which are mixed with the unreplaced text nodes.
Don't forget to register the PHP namespace (http://php.net/xsl) in your stylesheet and use exclude-result-prefixes="php" for it not to appear in the result document.

I guess similar functions could be used to implement BBCode or wiki syntax support.

Add a comment Comments (672)

3. Tim Berners-Lee on Linked Data at TED

2009-03-14 03:14:29 by Martynas Jusevičius

A great and accessible presentation on Linked Data (the core technology behind Semantic Web, together with RDF) and DBpedia at the TED conference by the founder of the World Wide Web:

Add a comment Comments (512)

4. E-book formats

2009-03-11 20:06:27 by Martynas Jusevičius

I was recently doing an analysis of e-book formats for a client, and can share some of my findings here.

Criteria

Here are the main criteria of a good e-book format that I came up with (in the order of importance from the most important one):

# Criteria Description
1 Open standard The format is documented in detail and available publicly free of charge
2 XML-based The e-book file format itself (not the source formats) is based on XML and can be produced using standard XML tools only
3 End-user-oriented The format is meant to be used for the final workflow product — the e-book file downloaded to user's (software or hardware) reader
4 Reader-independent There is more than one reader for this format, released by different vendors
5 Workflow-oriented The format can be easily transformed to and from a variety of different e-book formats and used in the whole workflow (e. g. exchange and storage) of e-book production
6 Vendor-independent No vendor-dependent tools are necessary to produce the book file
7 Reflowable Support for rearrangement of text in respect to the size of the reader window instead of zooming and panning it to see the full document

If we stick to the criteria and look for a long-term solution, formats like Mobipocket and iSilo fall through at stage one for being binary, proprietary, and/or vendor-dependent. They require commercial software to transform source documents into end-user e-book files. That leaves us with basically 2 file formats.

PDF

Adobe's Portable Document Format (PDF) is the first one of them. Actually, it does not qualify for the criteria above either, but it is so popular and has a strong support that it cannot be ignored in such a consideration.

The main cause of PDF's limitations is the historical relation to printed media. Its binary structure does not preserve many of the semantics of the original layout and simply places characters and graphical elements at specified coordinates on the page's surface. For example, a table cannot be extracted from a PDF document, since it appears as text placed at specific places and a bunch of graphic lines. Alternatively, a table may exist as an image in the file. Both variants result in increased file size.

Since each document is intended for a specific page size, it is problematic to display it on screens of limited size or resolution, such as those of mobile devices. Reflow is only possible if the document was specifically marked for tagging at creation time, which excludes a majority of existing documents. Event then, the support for reflow is not guaranteed on the reader.

For these reasons PDF is meant for the final products in the digital publishing workflow, i. e. documents to be printed or displayed to the end-user. It cannot be easily used as an intermediary format in the workflow (format of source documents) and transformed to other formats. There are many software readers available, usually distributed free of charge, the best know of course being Adobe Acrobat. Most hardware readers support PDF as well.

There are also many PDF creation tools. Most of Adobe's products support the format, as well as office packages such as OpenOffice and Microsoft Office, TeX and DocBook tools. There exists also a number of PDF printers which create a PDF image instead of an actual printer output.

Choices of editing software are limited because of the format's complexities.

PDF supports encryption and DRM.

ePub

ePub is a relatively new XML-based format for reflowable digital books and publications, which seeks to increase interoperability between software as well as hardware tools in the digital publishing industry.

ePub is an open standard, defined not by some vendor but by a standards body containing members from the industry, namely the International Digital Publishing Forum (IDPF). It is meant to be used both in workflow as well as end-user reader format.

ePub is actually a set of 3 related standards:

OPS defines a standard presentation of electronic books which would be accessible on different readers and displayed equivalently. OPS 2.0 is based on XML and uses XHTML 1.1 vocabulary (DTD) with some extensions (such as SVG) to describe content structure and CSS 2 to describe its style. That means OPS and ePub can be easily produced and consumed using standard XML and CSS tools such as XSLT, no special software is necessary. Readers can also be easily implemented on existing Web browser libraries.

OPF defines the mechanism by which the various components of an OPS publication are tied together and provides additional structure and semantics to the electronic publication, for example, describes its components (markup files, images etc.), metadata, and table of contents. OCF is a general-purpose container technology based on the widely-used ZIP compression format. It defines the standard mechanism by which all components of an electronic publication can be packaged together into a single file for transmission, delivery and archival.

DRM is not integrated into these standards, but may be layered on top (for example, implemented in the reader application). However, many experts discourage the use of DRM (with ePub and in general) and even blame it for the unpopularity of e-books and for failures already experienced by the music industry, and many publishers abandon it as well, or use a so-called “social” DRM.

The format is being adopted by publishers such as Penguin Books, O'Reilly, as well as public libraries, and backed by associations in the digital publishing industry such as IDPF and AAP (Association of American Publishers).

A variety of readers is available and increasing, both software and hardware, including iPhone with the use of Stanza application and smartphones using FBReader.

Conclusions

Currently ePub seems to be the most reasonable choice as an e-book file format.

First of all, it is an open standard and not related to any specific vendor. No fees have to be paid for implementing it, and no specific software or hardware readers need to be used, nor any specific creation software needs to be purchased.

Secondly, the format is based on XHTML/XML+CSS and ZIP, technologies that are widely supported and implemented and have a strong tool base. Thirdly, the format is rather unique in the sense that it be used both as workflow/source format in the publishing pipeline, and as end-user format, which minimizes the need to support several standards. It does not lose layout semantics when packaged, therefore it can be used to store and/or exchange e-book files.

And finally, the format has received some strong backing by the digital publishing industry as well as public institutions such as libraries. Several well-known publishing houses are offering ePub among its e-book formats and report its increasing popularity. Software and hardware reader support is constantly increasing, as is support by commercial publishing software.

Add a comment Comments (1421)

5. War on Internet Explorer 6

2009-03-05 10:58:51 by Martynas Jusevičius

The Internet Explorer 6 browser was first released by Microsoft in 2001 and is well deprecated by today's standards. It has been long hated by Web developers for its bugs and lack of support for Web standards. Recently some of them got fed up with that, and an action has been started to encourage IE 6 users to upgrade to a modern browser such as Firefox, Opera, or Safari. It seems to have started in Norway, but has now spread across Scandinavia and as far as Australia. There are ready-made widgets that Web developers can include in their sites to warn IE 6 users about the problem.

I guess there have been similar attempts before, but this time it is becoming widespread, probably because it has been initiated by some well-known big sites collectively. I have been wanting to do a similar thing for quite a while, but did not dare to annoy the users. Now I might actually have a reason to implement it in support of the action. Hopefully we all get rid of IE 6 soon.

Add a comment Comments (45)

Ordering: Ascending Descending