How do we decide what goes in sameas.org?

September 9th, 2010

A reply to Joel Sachs:
Hi,
Thank you for your interest.
Here are some sort of answers to this and other questions.
In fact, this has become something of a dialogue with myself :-)

sameas.org does not itself do any interesting inference, other than
A sameas B & B sameas C => A sameas C when asked about A.
It aims to gather equivalence information from existing sources and service the results in a convenient (single) place.
(It also aims to address the problem of owl:sameas being a pairwise statement, which gives an unpleasant explosion (n**2) of statements for groups of equivalences, which can be quite hard to handle.)

Who chooses what data is acceptable?
Er, me.
I look at it and decide.

Is it a spider (people sometimes ask this)?
No – when I am bored with the other things I am doing I add more to it, by downloading dumps or querying SPARQL endpoints, often as a result of messages on this and other lists.

Is owl:sameAs the only predicate recognised?
As you have worked out, no.
It is a service giving equivalent URIs, and one of the formats you can get back is owl:sameAs. But you can get other formats if you want. So the inputs include things like skos:exactMatch and skos:closeMatch (as I recall).
And we could output other formats such as these if asked.
At the moment we only do rdf+xml, text/n3, application/json, text/plain, see
http://www.sameas.org/about.php.
What has now been noticed is that I decided that dbpedia redirects should be treated as equivalent.
The reason I did this is that it meant that a lot of expected URIs now worked.
Eg http://dbpedia.org/resource/UN/LOCODE:GBLON and
even http://dbpedia.org/resource/Capital_of_the_UK get to
http://data.ordnancesurvey.co.uk/id/7000000000041428 and
http://statistics.data.gov.uk/id/eer/07.
The downside is that there is quite a lot of cruft in the redirects, and so some strange things happen (as has been observed).

Do I know about errors in sameas.org?
Yes.
I like the Iron Maiden one to opencyc, for example.
But I don’t aim to correct these, any more than Google aims to correct things it links to.

Why such a liberal attitude to equivalence?
I eventually worked out that sameas.org was a discovery service.
We have other sameas services, called crs services, on our systems (eg http://opencyc.rkbexplorer.com/crs/ is an external one) which are definitional (I hesitate to use a word like authoritative, with all its other connotations).
And so in that vein, I have cast the net wider for sameas.org.
This was the case early in its life, as the wordnet equivalence to dbpedia is in fact the equivalence of the word to the thing, which is wrong at some/any level.
But I have taken the view that people/agents that come to sameas.org are looking for things, and might not care about such subtleties, not least because they may not have understand them when they constructed their RDF.

If I had the time/funding, I would provide other services that took different views of equivalence, in terms of discovery/definitional or liberal/conservative (precision/recall is another way of saying that).

Mind you it is probably the case that the sameas.org data is no worse than a lot of the data in the LOD diagram, in terms of reliably identifying resources, as I have rejected a bunch of them as being substandard.

On 08/09/2010 15:42, “joel sachs” wrote:

>

> So, a request for the sameas.org folks: Would it be possible to include a
> provenance column for all sameAs assertions you keep track of? In cases
> where the sameAs assertion isn’t actually asserted on the web, you could
> indicate the provenance as “inferred” in the provenance column. Also, have
> you published the heuristics you use (if any) to infer sameAs relations?
>

>
> Thanks!
> Joel.
>
>
>
So finally getting round to your specific question (although hopefully the other stuff has also helped).
It would be hard to provide the extra column for quite a few reasons.
We do know where we got the data from, but it may be a SPARQL endpoint, a dump downloaded, or an email sent to me, for examples. So it would not be very easy to interpret.
But only a small number of the pairs would be so identified, as all the rest are inferred from the other pairwise assertions.
We can actually have our own visualisation tools for bundles, with assertions and dates, etc, but the tool is hard to read if you don’t know what is happening, and…
1) Finding the resources to make it more accessible would be hard.
sameas.org has effectively never been funded – it is my hobby with Ian Millard, and we would love to have the resources to do this sort of stuff.
I actually have plans for a more sophisticated architecture behind sameas.org which facilitate this and a lot of other stuff, but again it is a question of resources.

2) What is the Ontology?
A big question with giving more information is, what is the ontology?
We live in the Linked Data world (for sameas.org), and machine-interpretable structures.
So sameas.org is designed to be used by services, and the ontology of provenance (and trust) is still an open question.
So it might be that if you can tell us the ontology for provenance that could be used, we might be able to add something to the service.

3) Simple services
I am a great believer in things that do a small number of things simply, and (hopefully) do them well.
I don’t yet understand how to keep the simplicity of sameas.org, while offering more sophisticated facilities to users.

Oh dear, that went on a while, but hopefully it has addressed a lot of the questions, asked and unasked.

I’ve just remembered there is a blog, so I will put this message there as well:
http://www.rkbexplorer.com/blog/?p=40

Best
Hugh Glaser

Good citizenship on the Web of Data

August 6th, 2009

<executive_summary>

“If you consume open data you should publish as open data.”
I’m going to call this Principle 5, as things need to be named, and also of course give it a URI: http://www.rkbexplorer.com/blog/?p=33

What am I talking about?

In the bright new world, there will be lots of data from government (and elsewhere) out there for others to consume. Much of the discussion of this list is to advise the owners of that data to publish it in a way that makes it easily accessible.
However, we should remember that every time some of that data gets consumed, new data is generated. And the people who have generated that data should feel similar or greater pressure of obligation to publish as open data to that pressure applied to the publishers of the data they consumed.

One might say: “What’s sauce for the goose is sauce for the gander.”

</executive_summary>

Of course many people already do this, but I think it is worth pointing it out as people start to build more systems that consume.
It is quite easy to build a system that publishes, as long as it is designed in from the start; on the other hand, having built an intricate web page, it can be incredibly time-consuming and even very difficult to add on the data publishing facilities later.

As a simple example of what I mean, let’s take a site that consumes education data, and tells you a little bit about it, perhaps by showing a map where I can click on an area and find out how many kids go to school there.
Does the site publish URIs for each of these statistics?
Is it easy to find these things, or do I need some complex API?
Can I get a dump of the whole dataset?
What formats are offered?
Are there interesting html fragments that someone else might use?
Is the license clear?
Even, is there a SPARQL or other querying endpoint?

I think quite often a consumer who does something like this doesn’t really think they have generated much data, and so doesn’t engage with publishing; but each step along the way adds value, and they should celebrate the fruits of their labours by making them easily accessible.
Even taking some data and doing a nice html rendering can be really useful
to someone who just wants to add something interesting to their own page.

This leads to another issue on dataset directories.
We should not consider it satisfactory just to list the source datasets:- we should consider everything a source, and so try to record a graph of dataset derivation.

Finally, why call it Principle 5?
That relates to Linked Data – there are four principles at the moment:- I happen to think that this is so important that Tim might decide to add it as a fifth principle: “If you consume Linked Data, you should publish as Linked Data”.

As they say, my 2 cents worth.
Best
Hugh

How do we generate the sitemap.xml and submit to the search engines?

August 2nd, 2009
#!/usr/bin/php -q
<?php
require_once “/usr/lib/rkb/functions-utf.inc.php”;
/**
generates sitemap.xml for a linked data site that has
a triplestore which resolves URIs to provide a Symmetric Concise Bounded Description,
as well as a SPARQL endpoint,
and RDF files which are the source that populated the triplestore.
Now submits to some search engines:
SWSE
*/
$usage = “Usage: {$argv[0]} sub_domain_name\n”;
if(!isset($argv[1])) die($usage);
$base_domain = “rkbexplorer.com”;
$sub_domain = $argv[1];
$domain = $sub_domain.”.”.$base_domain;
$outfile = “../$domain/sitemap.xml”;
$file = fopen($outfile, “w”);
$slicing = “subject-object”;
$name = “”; if (file_exists(”../$domain/about/name.txt”)) $name = trim(entities2accents(file_get_contents(”../$domain/about/name.txt”)));
$typical = “”; if (file_exists(”../$domain/about/typical.txt”)) $typical = trim(file_get_contents(”../$domain/about/typical.txt”));
$updated = “”; exec(”/var/www/vhosts/wildcard.rkbexplorer.com/repositories/tools/rkb-utils last-update-w3c “.$sub_domain, $updated);
$changefreq = “monthly”; if (file_exists(”../$domain/about/changefreq.txt”)) $name = trim(file_get_contents(”../$domain/about/changefreq.txt”));
fwrite($file,”<?xml version=\”1.0\” encoding=\”UTF-8\”?>\n”);
fwrite($file,”<urlset xmlns=\”http://www.sitemaps.org/schemas/sitemap/0.9\”\n”);
fwrite($file,”        xmlns:sc=\”http://sw.deri.org/2007/07/sitemapextension/scschema.xsd\”>\n”);
fwrite($file,”  <sc:dataset>\n”);
fwrite($file,”    <sc:linkedDataPrefix slicing=\”$slicing\”>http://$domain/id/</sc:linkedDataPrefix>\n”);
fwrite($file,”    <sc:sparqlEndpointLocation>http://$domain/sparql/</sc:sparqlEndpointLocation>\n”);
$models = opendir(”../$domain/models”);
while (false !== ($model = readdir($models))) {
if (preg_match(’/\.rdf$/’, $model) || preg_match(’/\.ttl$/’, $model) || preg_match(’/\.n3$/’, $model) || preg_match(’/\.turtle$/’, $model) || preg_match(’/\.ntriples$/’, $model))
fwrite($file,”    <sc:dataDumpLocation>http://$domain/models/$model</sc:dataDumpLocation>\n”);
};
closedir($models);
fwrite($file,”    <sc:datasetURI>http://$domain/</sc:datasetURI>\n”);
fwrite($file,”    <sc:datasetURI>http://$domain/id/void</sc:datasetURI>\n”);
if ($name != “”) fwrite($file,”    <sc:datasetLabel>$name RDF dataset from RKBExplorer.com</sc:datasetLabel>\n”);
if ($typical != “”) fwrite($file,”    <sc:sampleURI>$typical</sc:sampleURI>\n”);
fwrite($file,”    <lastmod>$updated[0]</lastmod>\n”);
fwrite($file,”    <changefreq>$changefreq</changefreq>\n”);
fwrite($file,”  </sc:dataset>\n”);
fwrite($file,”</urlset>\n”);
fclose($file);
$sitemap_url = “http://$domain/sitemap.xml”;
// Submit to SWSE
print “Submitting $domain to SWSE: \n”;
$ch = curl_init(”http://swse.deri.org/ping?sitemap=$sitemap_url”);
curl_exec($ch);
print “\n”;
curl_close($ch);
// Submit POST request to Sindice
print “Submitting $domain to Sindice: \n”;
$data = “url=”.urlencode($sitemap_url);;
$fp = fsockopen(”sindice.com”, 80);
fputs($fp, “POST /api/v1/sitemap HTTP/1.0\r\n”);
fputs($fp, “Host: sindice.com\r\n”);
fputs($fp, “Content-type: application/x-www-form-urlencoded\r\n”);
fputs($fp, “Content-length: “. strlen($data) .”\r\n\r\n”);
fputs($fp, $data);
//  read result back from the sindice server
$result = ”;
while(!feof($fp)) $result .= fgets($fp, 128);
fclose($fp);
#!/usr/bin/php -q
<?php
require_once “/usr/lib/rkb/functions-utf.inc.php”;
/**
Ian MIllard and Hugh Glaser
generates sitemap.xml for a linked data site that has
a triplestore which resolves URIs to provide a Symmetric Concise Bounded Description,
as well as a SPARQL endpoint,
and RDF files which are the source that populated the triplestore.
Now submits to search engines:
*/
$usage = “Usage: {$argv[0]} sub_domain_name\n”;
if(!isset($argv[1])) die($usage);
$base_domain = “rkbexplorer.com”;
$sub_domain = $argv[1];
$domain = $sub_domain.”.”.$base_domain;
$outfile = “../$domain/sitemap.xml”;
$file = fopen($outfile, “w”);
$slicing = “subject-object”;
$name = “”; if (file_exists(”../$domain/about/name.txt”)) $name = trim(entities2accents(file_get_contents(”../$domain/about/name.txt”)));
$typical = “”; if (file_exists(”../$domain/about/typical.txt”)) $typical = trim(file_get_contents(”../$domain/about/typical.txt”));
$updated = “”; exec(”/var/www/vhosts/wildcard.rkbexplorer.com/repositories/tools/rkb-utils last-update-w3c “.$sub_domain, $updated);
$changefreq = “monthly”; if (file_exists(”../$domain/about/changefreq.txt”)) $name = trim(file_get_contents(”../$domain/about/changefreq.txt”));
fwrite($file,”<?xml version=\”1.0\” encoding=\”UTF-8\”?>\n”);
fwrite($file,”<urlset xmlns=\”http://www.sitemaps.org/schemas/sitemap/0.9\”\n”);
fwrite($file,”        xmlns:sc=\”http://sw.deri.org/2007/07/sitemapextension/scschema.xsd\”>\n”);
fwrite($file,”  <sc:dataset>\n”);
fwrite($file,”    <sc:linkedDataPrefix slicing=\”$slicing\”>http://$domain/id/</sc:linkedDataPrefix>\n”);
fwrite($file,”    <sc:sparqlEndpointLocation>http://$domain/sparql/</sc:sparqlEndpointLocation>\n”);
$models = opendir(”../$domain/models”);
while (false !== ($model = readdir($models))) {
if (preg_match(’/\.rdf$/’, $model) || preg_match(’/\.ttl$/’, $model) || preg_match(’/\.n3$/’, $model) || preg_match(’/\.turtle$/’, $model) || preg_match(’/\.ntriples$/’, $model))
fwrite($file,”    <sc:dataDumpLocation>http://$domain/models/$model</sc:dataDumpLocation>\n”);
};
closedir($models);
fwrite($file,”    <sc:datasetURI>http://$domain/</sc:datasetURI>\n”);
fwrite($file,”    <sc:datasetURI>http://$domain/id/void</sc:datasetURI>\n”);
if ($name != “”) fwrite($file,”    <sc:datasetLabel>$name RDF dataset from RKBExplorer.com</sc:datasetLabel>\n”);
if ($typical != “”) fwrite($file,”    <sc:sampleURI>$typical</sc:sampleURI>\n”);
fwrite($file,”    <lastmod>$updated[0]</lastmod>\n”);
fwrite($file,”    <changefreq>$changefreq</changefreq>\n”);
fwrite($file,”  </sc:dataset>\n”);
fwrite($file,”</urlset>\n”);
fclose($file);
$sitemap_url = “http://$domain/sitemap.xml”;
// Submit to SWSE
print “Submitting $domain to SWSE: \n”;
$ch = curl_init(”http://swse.deri.org/ping?sitemap=$sitemap_url”);
curl_exec($ch);
print “\n”;
curl_close($ch);
// Submit POST request to Sindice
print “Submitting $domain to Sindice: \n”;
$data = “url=”.urlencode($sitemap_url);;
$fp = fsockopen(”sindice.com”, 80);
fputs($fp, “POST /api/v1/sitemap HTTP/1.0\r\n”);
fputs($fp, “Host: sindice.com\r\n”);
fputs($fp, “Content-type: application/x-www-form-urlencoded\r\n”);
fputs($fp, “Content-length: “. strlen($data) .”\r\n\r\n”);
fputs($fp, $data);
//  read result back from the sindice server
$result = ”;
while(!feof($fp)) $result .= fgets($fp, 128);
fclose($fp);
//  report server resonse
$status = substr($result, 0, strpos($result, “\n”));
preg_match(’@<h1>(.*?)</h1>@’, $result, $matches);
print “\t$status\n\t{$matches[1]}\n\n”;
//  submit to Ping the Semantic Web
passthru(”./ptsw.py $sitemap_url”);
?>

How to do the 303 redirect easily?

July 5th, 2009

  1. Create a web-accessible directory with all your .rdf, .ttl, .ntriples and .html files in it.
  2. Copy lodpub.php and path.php into it.
  3. Access path.php from your web server.
  4. Follow the instruction to paste that text into .htaccess
  5. You can remove path.php, it was only there to help you get the .htaccess right.

lodpub.php
path.php

What we do to work out co-reference.

July 4th, 2009

François Scharffe <francois.scharffe@inria.fr> asked us, so this is what I responded.

Here is a description of what we do in English.

The implementation may vary, but this is the rough idea.

It is really quite simple, I guess.

We are primarily concerned with organisations, people, publications, projects, research areas.

Research areas we have done by hand against relatively fixed ontologies.

Organisations and projects work in similar ways top publications, but I will do the publications bit.

1) To start with, there is absolutely no linkage, so we do a “coldstart”, and this is done on paper titles only.

Extracting all the strings/uri pairs from all the KBs, we map the title to lower case strings of the alphanumerics; if the result is sufficiently long (>=20) and identical, then the uris are considered the same (”smushed”).

2) Now we can work on authors (string matching out of context would be too liberal). For the same (co-reffed) papers, the authors are fuzzy matched (cross product).

3) For each author string name, we find the co-authorship sets for each paper (we do this by starting with a each unique name, to make it easier).

If there is an overlap of two or more co-author strings between different sets, then these authors are smushed.

The matching of names for this is not fuzzy, but does match name variants, as identified by previous co-ref work on the URI for the author name we are looking at.

(Another way of looking at it is that if we find three authors of the same name as paper authors, we smush them.)

Also, if there are exactly two author with similar names, then we smush them.

4) The rest is dynamic. As users browse the data at rkbexplorer.com, we compute networks (communities of practice of closely related entities by domain-specific weighted RDF predicate). If strings are similar in the network, then they are smushed.


New RKB Blog

July 4th, 2009

This sort of should be the first posting, I guess.

A new blog where we can out stuff about RKB, sameAs.org, etc.