Blog

2009 posts (16)

Pages: < Previous 1–10 11–20 Next >
Ordering: Ascending Descending

1. HTML 5 won over XHTML 2

2009-07-06 21:23:42 by Martynas Jusevičius

Some time ago I was wondering what the future of XHTML 2 and HTML 5 is going to be. Now it seems to be clear: W3C announces end of work on XHTML 2, and hopes to accelerate work on HTML 5.

It's good to have it finally sorted out, and HTML 5 seems to bring some long awaited features. However, one thing about it worries me a lot, and it's the syntax. The specification now defines two variants of them: HTML 5 (old-school HTML style, with no closing tags) and XHTML 5 (XML serialization, with namespaces etc.).

Years have been spent on developer education and improving XML support in browsers and parsers, and HTML 5 seems to make them useless again and bring even more confusion. Even though W3C promises XML serialization of HTML to remain compatible with XML (I hope there were no plans to change that, at least?) we still will have 2 syntaxes and 2 sets of tools to process them, and once beggining to seem possible, convergence to a single XML serialization will be as far away as 10 years ago.

HTML is not easier to parse than XHTML, supported basically on the same level, has no extension mechanism (like namespaces), and is arguably less logical to write (when is there a closing tag, and when not?). So why the nonsense?

Add a comment Comments (231)

2. Wolfram|Alpha

2009-05-18 02:04:12 by Martynas Jusevičius

Wolfram|Alpha is just mindblowing.

I don't know if it uses any technologies from the Semantic Web stack like RDF, but it is believed to contain “10+ trillion of pieces of data”. If we say one piece is similar to an RDF triple, then current triple stores come nowhere near — tens of billions of triples for them is still a challenge.
However it truly is one of the first important pieces of the semantic Web, and still very young. It's what the future Web should be like.

If Alpha holds its promise, “Who wants to be a millionaire” should get rid of factual/scientific/statistical questions. Or help over phone.

Add a comment Comments (2146)

3. Semantic Web explained

2009-05-15 01:57:21 by Martynas Jusevičius

The article Tying Web 3.0, the Semantic Web and Linked Data Together understandably and in length explains the main concepts of the Semantic Web. It also reminded me of couple of things.

The first one is 1,5 half years-old post on this blog Relational model does not fit the Web, which received some mixed response.

The other one is something a had in my mind for a while, a list of very basic differences/advantages of RDF/OWL over the relational model:

  1. your record ID (primary key) becomes URI, so you can define foreign keys pointing to any record in any data source on the Web
  2. there are standard generic serialization formats, so you don't have to invent your own every time
  3. the schema uses the same data model as the data, but the data can just as well live without any schema
  4. it allows you to build applications that are generic, that is, can operate on different data without knowing its specific concepts and data types
  5. schemas can be global, shared, reused, and extended
  6. you can merge your databases without making any changes

Some more?

Add a comment Comments (1693)

4. ODF to XHTML or ePub

2009-05-08 13:25:44 by Martynas Jusevičius

I am looking for a way to convert Open Document Format (ODF) files into XHTML, and later package them as ePub. It should be a tool that can be integrated into a publishing workflow, not user-oriented software.

I would especially like an XSLT solution, and it seems that there is one in a form of odf2xhtml filter for OpenOffice.org. Has anyone worked with it? The stylesheets can be found in CVS:
http://framework.openoffice.org/source/browse/framework/filter/source/xslt/

But maybe there are other options? Or a direct way to ePub?

Add a comment Comments (1631)

5. API as remote Model

2009-04-15 15:38:05 by Martynas Jusevičius

When developing with the DIY Framework we're using internal XML serialization of objects, and XSLT stylesheets as templates.
I'm not proud of the serialization code, it started as a quick hack to get the data into a usable format, and has improved only a little since then. It is building XML on a string level, when in fact it should be using DOM for that. Object serializations retain their database IDs, on which later they can be joined in the stylesheet from several XML documents.

Later I was developing a RESTful API to publish XML intended to be public and used remotely, not internally. Then I caught myself producing yet another serialization of Model objects using DOM, which was based on HTTP principles and resource URIs.

I soon realized that I was doing a double job, and that a single XML serialization could serve both purposes. It should be based on the API version, i.e. employ URIs and not IDs (and use DOM of course). I would have to adjust the stylesheets (especially joins) accordingly, but that should not be that hard, because both serializations are only supporting 2 general types of content: list of items (array), and a single item.

A useful outcome of such architecture would be that the API would effectively serve as the Model (as in MVC), either internally or remotely. On the other hand, the XSLT stylesheets would also become decoupled, they could be used on the server-side as well as the client-side. They would use internal XSLT processor calls to retrieve the necessary XML documents in the first case, and remote HTTP calls to the API in the second.

Add a comment Comments (13)

6. The Unreasonable Effectiveness of Data

2009-03-27 00:36:35 by Martynas Jusevičius

It is nice to see Google embracing the Semantic Web, in a recent paper by their top research scientists called “The Unreasonable Effectiveness of Data”. Here is an excerpt from the summary:

So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail. For natural language applications, trust that human language has already evolved words for the important concepts. See how far you can go by tying together the words that are already there, rather than by inventing new concepts with clusters of words. Now go out and gather some data, and see what it can do.

Add a comment Comments (18)

7. Regexp string replace with PHP XSL

2009-03-19 13:26:52 by Martynas Jusevičius

While XSLT 2.0 has regular expression support, it is missing in XSLT 1.0. Tasks like string pattern matching and replace cannot be done in native XSLT 1.0 code, extension functions are needed for that. Luckily, it can be achieved with PHP XSL in the same fashion as URL-encoding, by registering PHP function support on XSLT processor and calling native PHP functions as XPath functions in PHP namespace.

One of the common cases for this functionality when using XSLT as View templates would be replacement of http:// links in text (for example, message content from a database) with actual <a> hyperlink elements. Here is a quick-and-dirty PHP function that does that:

abstract class FrontEndView extends XSLTView
{
	// ...

	public static function replaceLinks($text)
	{
		$text = htmlspecialchars($text);
		$text = ereg_replace("[[:alpha:]]+://[^<>[:space:]]+[[:alnum:]/]", "<a href=\"\\0\">\\0</a>", $text);
		$text = "<div xmlns=\"http://www.w3.org/1999/xhtml\">".$text."</div>";
		$doc = new DOMDocument();
		$doc->loadXML($text);
		return $doc->documentElement;
	}
}

It replaces link text with hyperlinks at string level, then loads it to a DOM document and returns it. Before it can be accessed from a stylesheet, registerPhpFunctions() needs to be called on the XSLT processor instance.

The XSLT code then looks like this:

<xsl:copy-of select="php:function('FrontEndView::replaceLinks', string($text))/node()"/>

Notice that while you have to send string content to the function, what you get back is a node list, because the replaced text becomes XML elements, which are mixed with the unreplaced text nodes.
Don't forget to register the PHP namespace (http://php.net/xsl) in your stylesheet and use exclude-result-prefixes="php" for it not to appear in the result document.

I guess similar functions could be used to implement BBCode or wiki syntax support.

Add a comment Comments (672)

8. Tim Berners-Lee on Linked Data at TED

2009-03-14 03:14:29 by Martynas Jusevičius

A great and accessible presentation on Linked Data (the core technology behind Semantic Web, together with RDF) and DBpedia at the TED conference by the founder of the World Wide Web:

Add a comment Comments (512)

9. E-book formats

2009-03-11 20:06:27 by Martynas Jusevičius

I was recently doing an analysis of e-book formats for a client, and can share some of my findings here.

Criteria

Here are the main criteria of a good e-book format that I came up with (in the order of importance from the most important one):

# Criteria Description
1 Open standard The format is documented in detail and available publicly free of charge
2 XML-based The e-book file format itself (not the source formats) is based on XML and can be produced using standard XML tools only
3 End-user-oriented The format is meant to be used for the final workflow product — the e-book file downloaded to user's (software or hardware) reader
4 Reader-independent There is more than one reader for this format, released by different vendors
5 Workflow-oriented The format can be easily transformed to and from a variety of different e-book formats and used in the whole workflow (e. g. exchange and storage) of e-book production
6 Vendor-independent No vendor-dependent tools are necessary to produce the book file
7 Reflowable Support for rearrangement of text in respect to the size of the reader window instead of zooming and panning it to see the full document

If we stick to the criteria and look for a long-term solution, formats like Mobipocket and iSilo fall through at stage one for being binary, proprietary, and/or vendor-dependent. They require commercial software to transform source documents into end-user e-book files. That leaves us with basically 2 file formats.

PDF

Adobe's Portable Document Format (PDF) is the first one of them. Actually, it does not qualify for the criteria above either, but it is so popular and has a strong support that it cannot be ignored in such a consideration.

The main cause of PDF's limitations is the historical relation to printed media. Its binary structure does not preserve many of the semantics of the original layout and simply places characters and graphical elements at specified coordinates on the page's surface. For example, a table cannot be extracted from a PDF document, since it appears as text placed at specific places and a bunch of graphic lines. Alternatively, a table may exist as an image in the file. Both variants result in increased file size.

Since each document is intended for a specific page size, it is problematic to display it on screens of limited size or resolution, such as those of mobile devices. Reflow is only possible if the document was specifically marked for tagging at creation time, which excludes a majority of existing documents. Event then, the support for reflow is not guaranteed on the reader.

For these reasons PDF is meant for the final products in the digital publishing workflow, i. e. documents to be printed or displayed to the end-user. It cannot be easily used as an intermediary format in the workflow (format of source documents) and transformed to other formats. There are many software readers available, usually distributed free of charge, the best know of course being Adobe Acrobat. Most hardware readers support PDF as well.

There are also many PDF creation tools. Most of Adobe's products support the format, as well as office packages such as OpenOffice and Microsoft Office, TeX and DocBook tools. There exists also a number of PDF printers which create a PDF image instead of an actual printer output.

Choices of editing software are limited because of the format's complexities.

PDF supports encryption and DRM.

ePub

ePub is a relatively new XML-based format for reflowable digital books and publications, which seeks to increase interoperability between software as well as hardware tools in the digital publishing industry.

ePub is an open standard, defined not by some vendor but by a standards body containing members from the industry, namely the International Digital Publishing Forum (IDPF). It is meant to be used both in workflow as well as end-user reader format.

ePub is actually a set of 3 related standards:

OPS defines a standard presentation of electronic books which would be accessible on different readers and displayed equivalently. OPS 2.0 is based on XML and uses XHTML 1.1 vocabulary (DTD) with some extensions (such as SVG) to describe content structure and CSS 2 to describe its style. That means OPS and ePub can be easily produced and consumed using standard XML and CSS tools such as XSLT, no special software is necessary. Readers can also be easily implemented on existing Web browser libraries.

OPF defines the mechanism by which the various components of an OPS publication are tied together and provides additional structure and semantics to the electronic publication, for example, describes its components (markup files, images etc.), metadata, and table of contents. OCF is a general-purpose container technology based on the widely-used ZIP compression format. It defines the standard mechanism by which all components of an electronic publication can be packaged together into a single file for transmission, delivery and archival.

DRM is not integrated into these standards, but may be layered on top (for example, implemented in the reader application). However, many experts discourage the use of DRM (with ePub and in general) and even blame it for the unpopularity of e-books and for failures already experienced by the music industry, and many publishers abandon it as well, or use a so-called “social” DRM.

The format is being adopted by publishers such as Penguin Books, O'Reilly, as well as public libraries, and backed by associations in the digital publishing industry such as IDPF and AAP (Association of American Publishers).

A variety of readers is available and increasing, both software and hardware, including iPhone with the use of Stanza application and smartphones using FBReader.

Conclusions

Currently ePub seems to be the most reasonable choice as an e-book file format.

First of all, it is an open standard and not related to any specific vendor. No fees have to be paid for implementing it, and no specific software or hardware readers need to be used, nor any specific creation software needs to be purchased.

Secondly, the format is based on XHTML/XML+CSS and ZIP, technologies that are widely supported and implemented and have a strong tool base. Thirdly, the format is rather unique in the sense that it be used both as workflow/source format in the publishing pipeline, and as end-user format, which minimizes the need to support several standards. It does not lose layout semantics when packaged, therefore it can be used to store and/or exchange e-book files.

And finally, the format has received some strong backing by the digital publishing industry as well as public institutions such as libraries. Several well-known publishing houses are offering ePub among its e-book formats and report its increasing popularity. Software and hardware reader support is constantly increasing, as is support by commercial publishing software.

Add a comment Comments (1421)

10. War on Internet Explorer 6

2009-03-05 10:58:51 by Martynas Jusevičius

The Internet Explorer 6 browser was first released by Microsoft in 2001 and is well deprecated by today's standards. It has been long hated by Web developers for its bugs and lack of support for Web standards. Recently some of them got fed up with that, and an action has been started to encourage IE 6 users to upgrade to a modern browser such as Firefox, Opera, or Safari. It seems to have started in Norway, but has now spread across Scandinavia and as far as Australia. There are ready-made widgets that Web developers can include in their sites to warn IE 6 users about the problem.

I guess there have been similar attempts before, but this time it is becoming widespread, probably because it has been initiated by some well-known big sites collectively. I have been wanting to do a similar thing for quite a while, but did not dare to annoy the users. Now I might actually have a reason to implement it in support of the action. Hopefully we all get rid of IE 6 soon.

Add a comment Comments (45)

Pages: < Previous 1–10 11–20 Next >
Ordering: Ascending Descending