cradledincaricature

2014-11-20T13:36:14+00:00

Hi James,

The British Museum data (as you probably know) uses the CIDOC-CRM model (http://www.cidoc-crm.org). This is a rich but complex way of modelling cultural heritage data, leading to some challenges for querying the data with SPARQL.

While some (or possibly all) the data you are interested in is surfaced via the ‘bmo:PX_display_wrap’ the majority (if not all) the data is also more richly modelled in the CIDOC-CRM model.

CIDOC-CRM tends to abstract each possible aspect of an object into a concept or event, and then uses this to surface details about that aspect. To try to be a bit more concrete – for example, rather than having a ‘date of production’ (for example), each object has a ‘production’ entity, which in turn has its own properties including things like production techniques used, people involved in the production, and date of production (see http://collection.britishmuseum.org/resource/ecrm/E12_Production for a more detailed description of this class).

This means you have to ‘follow’ various URIs through the SPARQL query until you get to the part of the graph where the literal value you want is stored.

I find using the British Museum web interface to their linked data and following links to explore it as a good way of getting to know the constructs I’m interested in.

Here is a ‘production’ for an object:

http://collection.britishmuseum.org/resource?uri=http%3A%2F%2Fcollection.britishmuseum.org%2Fid%2Fobject%2FPPA42612%2Fproduction

A timespan for that production:
http://collection.britishmuseum.org/resource?uri=http%3A%2F%2Fcollection.britishmuseum.org%2Fid%2Fobject%2FPPA42612%2Fproduction%2F1

The dates associated with that timespan:
http://collection.britishmuseum.org/resource?uri=http%3A%2F%2Fcollection.britishmuseum.org%2Fid%2Fobject%2FPPA42612%2Fproduction%2F1%2Fdate

I’ve tried to write a SPARQL query that gets all the information you originally mentioned as desirable. The SPARQL query is at:

	PREFIX crm: <http://erlangen-crm.org/current/>
	PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
	PREFIX thes: <http://collection.britishmuseum.org/id/thesauri/>
	PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
	PREFIX bmo: <http://collection.britishmuseum.org/id/ontology/>
	PREFIX thesIdentifier: <http://collection.britishmuseum.org/id/>
	SELECT DISTINCT ?id ?title ?name ?desc ?date
	{
	?object crm:P70i_is_documented_in <http://collection.britishmuseum.org/id/bibliography/294> .
	OPTIONAL {
	?object crm:P48_has_preferred_identifier ?id }.
	OPTIONAL {
	?object bmo:PX_physical_description ?desc } .
	OPTIONAL {
	?object crm:P108i_was_produced_by ?prodevent.
	?prodevent crm:P9_consists_of ?prodpart .
	?prodpart crm:P4_has_time-span ?timespan.
	?timespan rdfs:label ?date} .
	OPTIONAL {
	?object crm:P108i_was_produced_by ?prodevent2.
	?prodevent2 crm:P9_consists_of ?prodpart2 .
	?prodpart2 crm:P14_carried_out_by ?creator.
	?creator skos:prefLabel ?name .
	?creator skos:inScheme thesIdentifier:person-institution}
	OPTIONAL {
	?object rdfs:label ?title }
	}

view raw

gistfile1.txt

hosted with ❤ by GitHub

There are a number of compromises/decisions I’ve had to make in making this query. For example, to avoid complications of multiple titles, I’ve used the objects rdfs:label – this gives me a single ‘title’ per object, but misses alternative titles/different language titles.

I’ve tried to note major compromises or issues in the comments at the URL above.

Finally, this query results in multiple lines per object where there are multiple descriptions or multiple creators (multiplying up if there are multiples of both). I used OpenRefine to sort on the ID, then de-duplicate by joining together descriptions/creators etc. in OpenRefine – this eventually gave me 23365 unique objects – I think this is right, but didn’t spend enough time checking to be sure I haven’t made a mistake in the de-duplication process.

My SPARQL knowledge isn’t good enough to do a better job with the query, but I’d be really surprised if there isn’t a way of avoiding at least some of the duplication I get in my results.

Reply

2014-11-20T13:40:50+00:00

Oh and a final thought / comment (for now). Once you’ve got the URIs in OpenRefine, you can use the ‘Add column by fetching URLs’ option to get JSON representations of the RDF for each URI you have. You can then use the parseJson function to extract further information. This can be a bit slow, but it works and means you can always extend the available information by fetching more from the BM site when you need it.

Reply

2014-11-20T16:55:56+00:00

Realised I could use GROUP_CONCAT to achieve in the SPARQL query some of the work I was doing in OpenRefine – so have an updated SPARQL query that only gets a single line per item. Where there are duplicate titles, creators, creation dates, or descriptions these are concatenated into a single string in the query result using the pipe ‘|’ character to join together separate values.

Reply

2014-11-20T16:56:10+00:00

Realised I could use GROUP_CONCAT to achieve in the SPARQL query some of the work I was doing in OpenRefine – so have an updated SPARQL query that only gets a single line per item. Where there are duplicate titles, creators, creation dates, or descriptions these are concatenated into a single string in the query result using the pipe ‘|’ character to join together separate values.

	PREFIX crm: <http://erlangen-crm.org/current/>
	PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
	PREFIX thes: <http://collection.britishmuseum.org/id/thesauri/>
	PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
	PREFIX bmo: <http://collection.britishmuseum.org/id/ontology/>
	PREFIX thesIdentifier: <http://collection.britishmuseum.org/id/>
	SELECT DISTINCT ?id (GROUP_CONCAT(?title; SEPARATOR = "\|") as ?titles) (GROUP_CONCAT(?name; SEPARATOR = "\|") as ?names) (GROUP_CONCAT(?desc; SEPARATOR = "\|") as ?descs) (GROUP_CONCAT(?date; SEPARATOR = "\|") as ?dates)
	{
	?object crm:P70i_is_documented_in <http://collection.britishmuseum.org/id/bibliography/294> .
	OPTIONAL {
	?object crm:P48_has_preferred_identifier ?id }.
	OPTIONAL {
	?object bmo:PX_physical_description ?desc } .
	OPTIONAL {
	?object crm:P108i_was_produced_by ?prodevent.
	?prodevent crm:P9_consists_of ?prodpart .
	?prodpart crm:P4_has_time-span ?timespan.
	?timespan rdfs:label ?date} .
	OPTIONAL {
	?object crm:P108i_was_produced_by ?prodevent2.
	?prodevent2 crm:P9_consists_of ?prodpart2 .
	?prodpart2 crm:P14_carried_out_by ?creator.
	?creator skos:prefLabel ?name .
	?creator skos:inScheme thesIdentifier:person-institution}
	OPTIONAL {
	?object rdfs:label ?title }
	}
	GROUP BY ?id

view raw

gistfile1.txt

hosted with ❤ by GitHub

Reply

Pingback: Metadata for all the British Museum Satires: part four | cradledincaricature

cradledincaricature

Metadata for all the British Musuem Satires: part three

5 thoughts on “Metadata for all the British Musuem Satires: part three”

Leave a comment Cancel reply

…some thoughts on digital history, cartoons, and satire.

Share this:

Related

5 thoughts on “Metadata for all the British Musuem Satires: part three”

Leave a comment Cancel reply

…some thoughts on digital history, cartoons, and satire.