François Scharffe <francois.scharffe@inria.fr> asked us, so this is what I responded.
Here is a description of what we do in English.
The implementation may vary, but this is the rough idea.
It is really quite simple, I guess.
We are primarily concerned with organisations, people, publications, projects, research areas.
Research areas we have done by hand against relatively fixed ontologies.
Organisations and projects work in similar ways top publications, but I will do the publications bit.
1) To start with, there is absolutely no linkage, so we do a “coldstart”, and this is done on paper titles only.
Extracting all the strings/uri pairs from all the KBs, we map the title to lower case strings of the alphanumerics; if the result is sufficiently long (>=20) and identical, then the uris are considered the same (”smushed”).
2) Now we can work on authors (string matching out of context would be too liberal). For the same (co-reffed) papers, the authors are fuzzy matched (cross product).
3) For each author string name, we find the co-authorship sets for each paper (we do this by starting with a each unique name, to make it easier).
If there is an overlap of two or more co-author strings between different sets, then these authors are smushed.
The matching of names for this is not fuzzy, but does match name variants, as identified by previous co-ref work on the URI for the author name we are looking at.
(Another way of looking at it is that if we find three authors of the same name as paper authors, we smush them.)
Also, if there are exactly two author with similar names, then we smush them.
4) The rest is dynamic. As users browse the data at rkbexplorer.com, we compute networks (communities of practice of closely related entities by domain-specific weighted RDF predicate). If strings are similar in the network, then they are smushed.