Good citizenship on the Web of Data

August 6th, 2009

<executive_summary>

“If you consume open data you should publish as open data.”
I’m going to call this Principle 5, as things need to be named, and also of course give it a URI: http://www.rkbexplorer.com/blog/?p=33

What am I talking about?

In the bright new world, there will be lots of data from government (and elsewhere) out there for others to consume. Much of the discussion of this list is to advise the owners of that data to publish it in a way that makes it easily accessible.
However, we should remember that every time some of that data gets consumed, new data is generated. And the people who have generated that data should feel similar or greater pressure of obligation to publish as open data to that pressure applied to the publishers of the data they consumed.

One might say: “What’s sauce for the goose is sauce for the gander.”

</executive_summary>

Of course many people already do this, but I think it is worth pointing it out as people start to build more systems that consume.
It is quite easy to build a system that publishes, as long as it is designed in from the start; on the other hand, having built an intricate web page, it can be incredibly time-consuming and even very difficult to add on the data publishing facilities later.

As a simple example of what I mean, let’s take a site that consumes education data, and tells you a little bit about it, perhaps by showing a map where I can click on an area and find out how many kids go to school there.
Does the site publish URIs for each of these statistics?
Is it easy to find these things, or do I need some complex API?
Can I get a dump of the whole dataset?
What formats are offered?
Are there interesting html fragments that someone else might use?
Is the license clear?
Even, is there a SPARQL or other querying endpoint?

I think quite often a consumer who does something like this doesn’t really think they have generated much data, and so doesn’t engage with publishing; but each step along the way adds value, and they should celebrate the fruits of their labours by making them easily accessible.
Even taking some data and doing a nice html rendering can be really useful
to someone who just wants to add something interesting to their own page.

This leads to another issue on dataset directories.
We should not consider it satisfactory just to list the source datasets:- we should consider everything a source, and so try to record a graph of dataset derivation.

Finally, why call it Principle 5?
That relates to Linked Data – there are four principles at the moment:- I happen to think that this is so important that Tim might decide to add it as a fifth principle: “If you consume Linked Data, you should publish as Linked Data”.

As they say, my 2 cents worth.
Best
Hugh

How do we generate the sitemap.xml and submit to the search engines?

August 2nd, 2009
#!/usr/bin/php -q
<?php
require_once “/usr/lib/rkb/functions-utf.inc.php”;
/**
generates sitemap.xml for a linked data site that has
a triplestore which resolves URIs to provide a Symmetric Concise Bounded Description,
as well as a SPARQL endpoint,
and RDF files which are the source that populated the triplestore.
Now submits to some search engines:
SWSE
*/
$usage = “Usage: {$argv[0]} sub_domain_name\n”;
if(!isset($argv[1])) die($usage);
$base_domain = “rkbexplorer.com”;
$sub_domain = $argv[1];
$domain = $sub_domain.”.”.$base_domain;
$outfile = “../$domain/sitemap.xml”;
$file = fopen($outfile, “w”);
$slicing = “subject-object”;
$name = “”; if (file_exists(”../$domain/about/name.txt”)) $name = trim(entities2accents(file_get_contents(”../$domain/about/name.txt”)));
$typical = “”; if (file_exists(”../$domain/about/typical.txt”)) $typical = trim(file_get_contents(”../$domain/about/typical.txt”));
$updated = “”; exec(”/var/www/vhosts/wildcard.rkbexplorer.com/repositories/tools/rkb-utils last-update-w3c “.$sub_domain, $updated);
$changefreq = “monthly”; if (file_exists(”../$domain/about/changefreq.txt”)) $name = trim(file_get_contents(”../$domain/about/changefreq.txt”));
fwrite($file,”<?xml version=\”1.0\” encoding=\”UTF-8\”?>\n”);
fwrite($file,”<urlset xmlns=\”http://www.sitemaps.org/schemas/sitemap/0.9\”\n”);
fwrite($file,”        xmlns:sc=\”http://sw.deri.org/2007/07/sitemapextension/scschema.xsd\”>\n”);
fwrite($file,”  <sc:dataset>\n”);
fwrite($file,”    <sc:linkedDataPrefix slicing=\”$slicing\”>http://$domain/id/</sc:linkedDataPrefix>\n”);
fwrite($file,”    <sc:sparqlEndpointLocation>http://$domain/sparql/</sc:sparqlEndpointLocation>\n”);
$models = opendir(”../$domain/models”);
while (false !== ($model = readdir($models))) {
if (preg_match(’/\.rdf$/’, $model) || preg_match(’/\.ttl$/’, $model) || preg_match(’/\.n3$/’, $model) || preg_match(’/\.turtle$/’, $model) || preg_match(’/\.ntriples$/’, $model))
fwrite($file,”    <sc:dataDumpLocation>http://$domain/models/$model</sc:dataDumpLocation>\n”);
};
closedir($models);
fwrite($file,”    <sc:datasetURI>http://$domain/</sc:datasetURI>\n”);
fwrite($file,”    <sc:datasetURI>http://$domain/id/void</sc:datasetURI>\n”);
if ($name != “”) fwrite($file,”    <sc:datasetLabel>$name RDF dataset from RKBExplorer.com</sc:datasetLabel>\n”);
if ($typical != “”) fwrite($file,”    <sc:sampleURI>$typical</sc:sampleURI>\n”);
fwrite($file,”    <lastmod>$updated[0]</lastmod>\n”);
fwrite($file,”    <changefreq>$changefreq</changefreq>\n”);
fwrite($file,”  </sc:dataset>\n”);
fwrite($file,”</urlset>\n”);
fclose($file);
$sitemap_url = “http://$domain/sitemap.xml”;
// Submit to SWSE
print “Submitting $domain to SWSE: \n”;
$ch = curl_init(”http://swse.deri.org/ping?sitemap=$sitemap_url”);
curl_exec($ch);
print “\n”;
curl_close($ch);
// Submit POST request to Sindice
print “Submitting $domain to Sindice: \n”;
$data = “url=”.urlencode($sitemap_url);;
$fp = fsockopen(”sindice.com”, 80);
fputs($fp, “POST /api/v1/sitemap HTTP/1.0\r\n”);
fputs($fp, “Host: sindice.com\r\n”);
fputs($fp, “Content-type: application/x-www-form-urlencoded\r\n”);
fputs($fp, “Content-length: “. strlen($data) .”\r\n\r\n”);
fputs($fp, $data);
//  read result back from the sindice server
$result = ”;
while(!feof($fp)) $result .= fgets($fp, 128);
fclose($fp);
#!/usr/bin/php -q
<?php
require_once “/usr/lib/rkb/functions-utf.inc.php”;
/**
Ian MIllard and Hugh Glaser
generates sitemap.xml for a linked data site that has
a triplestore which resolves URIs to provide a Symmetric Concise Bounded Description,
as well as a SPARQL endpoint,
and RDF files which are the source that populated the triplestore.
Now submits to search engines:
*/
$usage = “Usage: {$argv[0]} sub_domain_name\n”;
if(!isset($argv[1])) die($usage);
$base_domain = “rkbexplorer.com”;
$sub_domain = $argv[1];
$domain = $sub_domain.”.”.$base_domain;
$outfile = “../$domain/sitemap.xml”;
$file = fopen($outfile, “w”);
$slicing = “subject-object”;
$name = “”; if (file_exists(”../$domain/about/name.txt”)) $name = trim(entities2accents(file_get_contents(”../$domain/about/name.txt”)));
$typical = “”; if (file_exists(”../$domain/about/typical.txt”)) $typical = trim(file_get_contents(”../$domain/about/typical.txt”));
$updated = “”; exec(”/var/www/vhosts/wildcard.rkbexplorer.com/repositories/tools/rkb-utils last-update-w3c “.$sub_domain, $updated);
$changefreq = “monthly”; if (file_exists(”../$domain/about/changefreq.txt”)) $name = trim(file_get_contents(”../$domain/about/changefreq.txt”));
fwrite($file,”<?xml version=\”1.0\” encoding=\”UTF-8\”?>\n”);
fwrite($file,”<urlset xmlns=\”http://www.sitemaps.org/schemas/sitemap/0.9\”\n”);
fwrite($file,”        xmlns:sc=\”http://sw.deri.org/2007/07/sitemapextension/scschema.xsd\”>\n”);
fwrite($file,”  <sc:dataset>\n”);
fwrite($file,”    <sc:linkedDataPrefix slicing=\”$slicing\”>http://$domain/id/</sc:linkedDataPrefix>\n”);
fwrite($file,”    <sc:sparqlEndpointLocation>http://$domain/sparql/</sc:sparqlEndpointLocation>\n”);
$models = opendir(”../$domain/models”);
while (false !== ($model = readdir($models))) {
if (preg_match(’/\.rdf$/’, $model) || preg_match(’/\.ttl$/’, $model) || preg_match(’/\.n3$/’, $model) || preg_match(’/\.turtle$/’, $model) || preg_match(’/\.ntriples$/’, $model))
fwrite($file,”    <sc:dataDumpLocation>http://$domain/models/$model</sc:dataDumpLocation>\n”);
};
closedir($models);
fwrite($file,”    <sc:datasetURI>http://$domain/</sc:datasetURI>\n”);
fwrite($file,”    <sc:datasetURI>http://$domain/id/void</sc:datasetURI>\n”);
if ($name != “”) fwrite($file,”    <sc:datasetLabel>$name RDF dataset from RKBExplorer.com</sc:datasetLabel>\n”);
if ($typical != “”) fwrite($file,”    <sc:sampleURI>$typical</sc:sampleURI>\n”);
fwrite($file,”    <lastmod>$updated[0]</lastmod>\n”);
fwrite($file,”    <changefreq>$changefreq</changefreq>\n”);
fwrite($file,”  </sc:dataset>\n”);
fwrite($file,”</urlset>\n”);
fclose($file);
$sitemap_url = “http://$domain/sitemap.xml”;
// Submit to SWSE
print “Submitting $domain to SWSE: \n”;
$ch = curl_init(”http://swse.deri.org/ping?sitemap=$sitemap_url”);
curl_exec($ch);
print “\n”;
curl_close($ch);
// Submit POST request to Sindice
print “Submitting $domain to Sindice: \n”;
$data = “url=”.urlencode($sitemap_url);;
$fp = fsockopen(”sindice.com”, 80);
fputs($fp, “POST /api/v1/sitemap HTTP/1.0\r\n”);
fputs($fp, “Host: sindice.com\r\n”);
fputs($fp, “Content-type: application/x-www-form-urlencoded\r\n”);
fputs($fp, “Content-length: “. strlen($data) .”\r\n\r\n”);
fputs($fp, $data);
//  read result back from the sindice server
$result = ”;
while(!feof($fp)) $result .= fgets($fp, 128);
fclose($fp);
//  report server resonse
$status = substr($result, 0, strpos($result, “\n”));
preg_match(’@<h1>(.*?)</h1>@’, $result, $matches);
print “\t$status\n\t{$matches[1]}\n\n”;
//  submit to Ping the Semantic Web
passthru(”./ptsw.py $sitemap_url”);
?>

How to do the 303 redirect easily?

July 5th, 2009

  1. Create a web-accessible directory with all your .rdf, .ttl, .ntriples and .html files in it.
  2. Copy lodpub.php and path.php into it.
  3. Access path.php from your web server.
  4. Follow the instruction to paste that text into .htaccess
  5. You can remove path.php, it was only there to help you get the .htaccess right.

lodpub.php
path.php

What we do to work out co-reference.

July 4th, 2009

François Scharffe <francois.scharffe@inria.fr> asked us, so this is what I responded.

Here is a description of what we do in English.

The implementation may vary, but this is the rough idea.

It is really quite simple, I guess.

We are primarily concerned with organisations, people, publications, projects, research areas.

Research areas we have done by hand against relatively fixed ontologies.

Organisations and projects work in similar ways top publications, but I will do the publications bit.

1) To start with, there is absolutely no linkage, so we do a “coldstart”, and this is done on paper titles only.

Extracting all the strings/uri pairs from all the KBs, we map the title to lower case strings of the alphanumerics; if the result is sufficiently long (>=20) and identical, then the uris are considered the same (”smushed”).

2) Now we can work on authors (string matching out of context would be too liberal). For the same (co-reffed) papers, the authors are fuzzy matched (cross product).

3) For each author string name, we find the co-authorship sets for each paper (we do this by starting with a each unique name, to make it easier).

If there is an overlap of two or more co-author strings between different sets, then these authors are smushed.

The matching of names for this is not fuzzy, but does match name variants, as identified by previous co-ref work on the URI for the author name we are looking at.

(Another way of looking at it is that if we find three authors of the same name as paper authors, we smush them.)

Also, if there are exactly two author with similar names, then we smush them.

4) The rest is dynamic. As users browse the data at rkbexplorer.com, we compute networks (communities of practice of closely related entities by domain-specific weighted RDF predicate). If strings are similar in the network, then they are smushed.


New RKB Blog

July 4th, 2009

This sort of should be the first posting, I guess.

A new blog where we can out stuff about RKB, sameAs.org, etc.