Blog
Scraping HTML with DOM
2008-03-11 11:28:45 by Martynas Jusevičius
HTML scraping is used to extract structured data from human-intended webpages. It is a common way to work around old school websites which do not provide data feeds in machine-readable formats such as RDF or at least custom XML.
Scraping can be implemented in several ways. Regular expressions (regexp) is probably the most widely-used technique. They employ specific syntax rules to define patterns that have to be matched in a string.
However, there are several issues in the regexp implementation. Because of complex HTML source, pattern strings soon become lengthy. The pattern syntax is not trivial and may differ on various systems. That leads to complicated and non–intuitive code, where the program logic (such as conditional cases) is hard–coded in the pattern and therefore not obvious.
Moreover, regular expressions operate on a generic string level and have no knowledge of HTML tags and the tree–shaped document model that they form. It means that special care has to be taken of insignificant whitespace (such as line breaks), control characters have to be escaped etc. In situations where tree-like structures need to be scraped (for example, nested comment or e-mail threads), sequential matching makes it difficult to figure out the item's position in the tree, i. e. to relate the item to its parent, if there is one. Recursion would be an appropriate technique in this case, however regular expressions aren't really meant for recursive solutions
.
Another approach to scraping is using Document Object Model (DOM). It is a standard object model for representing HTML or XML. DOM is usually associated with client–side scripting, but can be equally well used on the server side — its support is provided on many platforms, including Java libraries and PHP extension. Fortunately, PHP's DOM extension is able to parse even invalid HTML, which is most often the case.
Here is a simple PHP scraper which turns a table with student information into FOAF data:
class StudentScraper
{
private $doc = null;
public function __construct($url)
{
$htmlString = file_get_contents($url); // load HTML file as a string
$this->doc = new DOMDocument();
$this->doc->loadHTML($htmlString); // load string into DOM document object
}
public function process() // return RDF/XML string
{
$xml = "<rdf:RDF xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\" xmlns:foaf=\"http://xmlns.com/foaf/0.1/\">\n";
$table = $this->doc->getElementsByTagName("table")->item(0); // table element
foreach ($table->childNodes as $tr) // iterate rows
{
$name = $tr->getElementsByTagName("td")->item(0)->textContent; // content of first cell
$eMail = $tr->getElementsByTagName("td")->item(1)->textContent; // content of second cell
$xml .= " <foaf:Person>\n"; // construct RDF/XML
$xml .= " <foaf:name>".htmlspecialchars($name)."</foaf:name>\n";
$xml .= " <foaf:mbox rdf:resource=\"mailto:".htmlspecialchars($eMail)."\"/>\n";
$xml .= " </foaf:Person>\n";
}
$xml .= "</rdf:RDF>";
return $xml;
}
}
$scraper = new StudentScraper("http://some.university.com/students.htm"); // run scraper
print $scraper->process();Here we load the contents of the HTML file in to a DOM document. Student data is found in a table, which contains a row for each student. Each row contains 2 cells — one for name, and one for e-mail address. We iterate through rows and create one foaf:Person entry per student. When run, the scraper will return RDF/XML string with FOAF data.
Constructing XML as a plain string is an error-prone practice and should be also done using DOM, but we leave it here for simplicity and clarity.
