sshoc-skosmapping/sshoc_31_skos.ipynb at d486fd2c004360363221f662f511a053f9d0e364

20 KiB

Raw Blame History

Mapping Data Stewardship terminology and Metadata from spreadsheets to SKOS resources¶

This Notebook implements a simple parser used to transform the Data Stewardship terminology and the Metadata, created in the Task 3.1 of the SSHOC project and published as spreadsheets, into SKOS resources. The parser reads the spreadsheets and transforms the content in SKOS data following a set of mapping rules, the result is stored in two Turtle files.

In [1]:

import pandas as pd
import rdflib
import itertools
import yaml

The file config.yaml contains the external information used in the parsing, including the position of the spreadsheets. Set the correct values before running the Notebook.

In [2]:

try:
    with open("config.yaml", 'r') as stream:
        try:
           conf=yaml.safe_load(stream)
        except yaml.YAMLError as exc:
            print(exc)
except FileNotFoundError:
    print('Warning config.yaml file not present! Please store it in the same directory as the notebook')
#print (conf)

The following cells defines the Namespaces used in the parsing

In [3]:

from rdflib.namespace import DC, DCAT, DCTERMS, OWL, \
                            RDF, RDFS, SKOS,  \
                           XMLNS, XSD, XMLNS
from rdflib import Namespace
from rdflib import URIRef, BNode, Literal

sshocterm = Namespace(conf['Namespaces']['SSHOCTERM'])
sshoccmd= Namespace(conf['Namespaces']['SSHOCCMD'])
dc11=Namespace("http://purl.org/dc/elements/1.1/");
dct = Namespace("http://purl.org/dc/terms/")
iso369=Namespace("http://id.loc.gov/vocabulary/iso639-3");

Download Data Stewardship terminology spreadsheet and show it to check if the operation has been executed correctly

In [4]:

url=conf['Source']['VOCABULARYSOURCE']
df_data=pd.read_csv(url)

In [5]:

df_data.rename(columns = {'Unnamed: 0': 'Concept ID', 'Unnamed: 1':'Subject', 'Unnamed: 2':'Term',
                             'Unnamed: 3':'Source of definition', 'Translations':'Dutch', 'Unnamed: 5':'French', 
                             'Unnamed: 6':'German', 'Unnamed: 7':'Greek',
                             'Unnamed: 8':'Italian', 'Unnamed: 9':'Slovenian',
                             'Linking':'Loterre Open Science Thesaurus', 'Unnamed: 11':'Terms4FAIRSkills',
                             'Unnamed: 12':'CCR metadata', 'Unnamed: 13':'Linked Open Vocabularies',
                             'Unnamed: 14':'LOV 2', 'Unnamed: 15':'ISO',
                             'Unnamed: 16':'Broader Concept'}, inplace = True)
df_data=df_data.drop(0)

Create a graph for the SKOS data and binds the namespaces to it

In [6]:

c1rdf = rdflib.Graph()
c1rdf.bind("sshocterm", sshocterm)
c1rdf.bind("dc11", dc11)
c1rdf.bind("dct", dct)
c1rdf.bind("iso369-3", iso369)
c1rdf.bind("skos", SKOS)
c1rdf.bind("dc", DC)
c1rdf.bind("rdf", RDF)
c1rdf.bind("owl", OWL)
c1rdf.bind("xsd", XSD)

Insert in the graph the SKOS.ConceptScheme

In [7]:

title=Literal(conf['Texts']['VOCABULARYTITLE'], lang=conf['Texts']['LANG'])
description=Literal(conf['Texts']['VOCABULARYDESCRIPTION'], lang=conf['Texts']['LANG'])
identifier=Literal(conf['Texts']['VOCABULARYID'], lang=conf['Texts']['LANG'])
createddate= Literal(conf['Texts']['VOCABULARYCREATEDATE'],datatype=XSD.date)
moddate= Literal(conf['Texts']['VOCABULARYMODDATE'],datatype=XSD.date)
version= Literal(conf['Texts']['VOCABULARYVERSION'],datatype=XSD.string)

c1rdf.add((sshocterm[''], RDF.type, SKOS.ConceptScheme))
c1rdf.add((sshocterm[''], DC.title, title))
c1rdf.add((sshocterm[''], DC.identifier, identifier))
c1rdf.add((sshocterm[''], DC.description, description))
c1rdf.add((sshocterm[''], dct.created, createddate))
c1rdf.add((sshocterm[''], dct.modified, createddate))
c1rdf.add((sshocterm[''], OWL.versionInfo, version))
c1rdf.add((sshocterm[''], dct.language, iso369.eng))
c1rdf.add((sshocterm[''], dct.language, iso369.ger))
c1rdf.add((sshocterm[''], dct.language, iso369.fra))
c1rdf.add((sshocterm[''], dct.language, iso369.ell))
c1rdf.add((sshocterm[''], dct.language, iso369.ita))
c1rdf.add((sshocterm[''], dct.language, iso369.dut))
c1rdf.add((sshocterm[''], dct.language, iso369.slv))

Out[7]:

<Graph identifier=N354cb351c0e54b49974ca1568175540a (<class 'rdflib.graph.Graph'>)>

In [8]:

#c1rdf.serialize(destination='data/skostest.rdf', format="n3");#format="pretty-xml")
#comrdf.serialize(destination='data/parsed_rdf/prima_cantica_forme_com.rdf', format="n3");

The following cell implements the mapping rules for creating SKOS resources.

In [9]:

for index, row in df_data.iterrows():
    
    if row.Subject.lower()=="preflabel":
        label=row["Concept ID"].strip()
        enlabel=Literal(row["Term"].strip(), lang='en')
        frlabel=Literal(row["French"].strip(), lang='fr')
        nllabel=Literal(row['Dutch'].strip(), lang='nl')
        delabel=Literal(row['German'].strip(), lang='de')
        itlabel=Literal(row['Italian'].strip(), lang='it')
        sllabel=Literal(row['Slovenian'].strip(), lang='sl')
        ellabel=Literal(row['Greek'].strip(), lang='el')
        
        c1rdf.add((sshocterm[label], RDF.type, SKOS.Concept))
        c1rdf.add((sshocterm[label], SKOS.inScheme, sshocterm['']))
        c1rdf.add((sshocterm[label], SKOS.topConceptOf, sshocterm['']))
        c1rdf.add((sshocterm[label], SKOS.prefLabel, enlabel))
        c1rdf.add((sshocterm[label], SKOS.prefLabel, frlabel))
        c1rdf.add((sshocterm[label], SKOS.prefLabel, nllabel))
        c1rdf.add((sshocterm[label], SKOS.prefLabel, delabel))
        c1rdf.add((sshocterm[label], SKOS.prefLabel, itlabel))
        c1rdf.add((sshocterm[label], SKOS.prefLabel, sllabel))
        c1rdf.add((sshocterm[label], SKOS.prefLabel, ellabel))
    if row.Subject.lower()=="altlabel":
        if not pd.isna(row['Term']):
            c1rdf.add((sshocterm[label], SKOS.altLabel, Literal(row["Term"].strip(), lang='en')))
        if not pd.isna(row['French']):
            c1rdf.add((sshocterm[label], SKOS.altLabel, Literal(row["French"].strip(), lang='fr')))
        if not pd.isna(row['Dutch']):
            c1rdf.add((sshocterm[label], SKOS.altLabel, Literal(row["Dutch"].strip(), lang='nl')))
        if not pd.isna(row['German']):
            c1rdf.add((sshocterm[label], SKOS.altLabel, Literal(row["German"].strip(), lang='de')))
        if not pd.isna(row['Italian']):
            c1rdf.add((sshocterm[label], SKOS.altLabel, Literal(row["Italian"].strip(), lang='it')))
        if not pd.isna(row['Slovenian']):
            c1rdf.add((sshocterm[label], SKOS.altLabel, Literal(row["Slovenian"].strip(), lang='sl')))
        if not pd.isna(row['Greek']):
            c1rdf.add((sshocterm[label], SKOS.altLabel, Literal(row["Greek"].strip(), lang='el')))
        
    if row.Subject.lower()=="definition":
        endef=Literal(row["Term"].strip(), lang='en')
        frdef=Literal(row["French"].strip(), lang='fr')
        nldef=Literal(row['Dutch'].strip(), lang='nl')
        dedef=Literal(row['German'].strip(), lang='de')
        itdef=Literal(row['Italian'].strip(), lang='it')
        sldef=Literal(row['Slovenian'].strip(), lang='sl')
        eldef=Literal(row['Greek'].strip(), lang='el')
        
        c1rdf.add((sshocterm[label], SKOS.definition, endef))
        c1rdf.add((sshocterm[label], SKOS.definition, frdef))
        c1rdf.add((sshocterm[label], SKOS.definition, nldef))
        c1rdf.add((sshocterm[label], SKOS.definition, dedef))
        c1rdf.add((sshocterm[label], SKOS.definition, itdef))
        c1rdf.add((sshocterm[label], SKOS.definition, sldef))
        c1rdf.add((sshocterm[label], SKOS.definition, eldef))
        if not pd.isna(row['Source of definition']):
            source=Literal(row['Source of definition'].strip())
            #print (f'{label}, {source}')
            c1rdf.add((sshocterm[label], dct.source, source))
    if not pd.isna(row['Loterre Open Science Thesaurus']):
        lote=URIRef(row['Loterre Open Science Thesaurus'])
        c1rdf.add((sshocterm[label], SKOS.exactMatch, lote))
        
    if not pd.isna(row['Linked Open Vocabularies']):
        lov=URIRef(row['Linked Open Vocabularies'])
        c1rdf.add((sshocterm[label], SKOS.exactMatch, lov))
    
    if not pd.isna(row['LOV 2']):
        lov2=URIRef(row['LOV 2'])
        c1rdf.add((sshocterm[label], SKOS.exactMatch, lov2))
    #Terms4FAIRSkills ISO    
    if not pd.isna(row['Terms4FAIRSkills']):
        t4fs=Literal(row['Terms4FAIRSkills'].strip())
        c1rdf.add((sshocterm[label], SKOS.note, t4fs))
    if not pd.isna(row['ISO']):
        tiso=Literal(row['ISO'].strip())
        c1rdf.add((sshocterm[label], SKOS.note, tiso))
    if not pd.isna(row['Broader Concept']):
        broc=URIRef(row['Broader Concept'])
        c1rdf.add((sshocterm[label], SKOS.broadMatch, broc))
    
print(len(c1rdf))

In [10]:

#for s, p, o in c1rdf.triples((None,  None, None)):
#    print("{}  {}".format(s, o.n3))

Create a Turtle file in the /data directory with the SKOS resources for Data Stewardship terminology

In [11]:

c1rdf.serialize(destination='data/mdstskos.ttl', format="n3");#format="pretty-xml")
c1rdf.serialize(destination='data/mdstskos.rdf', format="pretty-xml");#format="pretty-xml")

Download Metadata spreadsheet and show it to check if the operation has been executed correctly

In [12]:

mdurl=conf['Source']['METADATASOURCE']
df_metadata=pd.read_csv(mdurl)

In [13]:

df_metadata.rename(columns = {'English': 'Englishterm', 'Unnamed: 1':'Englishdefinition', 'Unnamed: 2':'source',
                             'Unnamed: 3':'URI', 'Dutch':'Dutchterm', 'Unnamed: 5':'Dutchdefinition', 
                             'French':'Frenchterm', 'Unnamed: 7':'Frenchdefinition',
                             'Greek':'Greekterm', 'Unnamed: 9':'Greekdefinition',
                             'Italian':'Italianterm', 'Unnamed: 11':'Italiandefinition'}, inplace = True)
df_metadata=df_metadata.drop(0)

Create a graph for the SKOS data and binds the namespaces to it

In [14]:

ccr = rdflib.Graph()
ccr.bind("sshoccmd", sshoccmd)
ccr.bind("dc11", dc11)
ccr.bind("dct", dct)
ccr.bind("iso369-3", iso369)
ccr.bind("skos", SKOS)
ccr.bind("dc", DC)
ccr.bind("rdf", RDF)
ccr.bind("owl", OWL)
ccr.bind("xsd", XSD)

In [15]:

title=Literal(conf['Texts']['METADATATITLE'], lang=conf['Texts']['LANG'])
description=Literal(conf['Texts']['METADATADESCRIPTION'], lang=conf['Texts']['LANG'])
identifier=Literal(conf['Texts']['METADATAID'], lang=conf['Texts']['LANG'])
createddate= Literal(conf['Texts']['METADATACREATEDATE'],datatype=XSD.date)
moddate= Literal(conf['Texts']['METADATAMODDATE'],datatype=XSD.date)
version= Literal(conf['Texts']['METADATAVERSION'],datatype=XSD.string)

ccr.add((sshoccmd[''], RDF.type, SKOS.ConceptScheme))
ccr.add((sshoccmd[''], DC.title, title))
ccr.add((sshoccmd[''], DC.description, description))
ccr.add((sshoccmd[''], DC.identifier, identifier))
ccr.add((sshoccmd[''], dct.created, createddate))
ccr.add((sshoccmd[''], dct.modified, createddate))
ccr.add((sshoccmd[''], OWL.versionInfo, version))
ccr.add((sshoccmd[''], dct.language, iso369.eng))
ccr.add((sshoccmd[''], dct.language, iso369.ger))
ccr.add((sshoccmd[''], dct.language, iso369.fra))
ccr.add((sshoccmd[''], dct.language, iso369.ell))
ccr.add((sshoccmd[''], dct.language, iso369.ita))
ccr.add((sshoccmd[''], dct.language, iso369.dut))
ccr.add((sshoccmd[''], dct.language, iso369.slv))

Out[15]:

<Graph identifier=N876b7db85e864943a1e0342d4e4dafdc (<class 'rdflib.graph.Graph'>)>

The following cell implements the mapping rules for creating SKOS resources.

In [16]:

for index, row in df_metadata.iterrows():
    
    label=row["URI"]
    urilabel=URIRef(label)
    lastslash=label.rfind('/')
    label='sshoc_'+label[lastslash+1:]
    
    
    strsource=row['source']
    
    strsource=strsource.replace('(source: ','')
    strsource=strsource.replace(')','')
    source=Literal(strsource.strip())
    enterm=Literal(row["Englishterm"].strip(), lang='en')
    frterm=Literal(row["Frenchterm"].strip(), lang='fr')
    nlterm=Literal(row['Dutchterm'].strip(), lang='nl')
    #determ=Literal(row['Germanterm'], lang='de')
    itterm=Literal(row['Italianterm'].strip(), lang='it')
    #slterm=Literal(row['Slovenianterm'].strip(), lang='sl')
    elterm=Literal(row['Greekterm'].strip(), lang='el')
    
    endef=Literal(row["Englishdefinition"].strip(), lang='en')
    frdef=Literal(row["Frenchdefinition"].strip(), lang='fr')
    nldef=Literal(row['Dutchdefinition'].strip(), lang='nl')
    #dedef=Literal(row['Germandefinition'], lang='de')
    itdef=Literal(row['Italiandefinition'].strip(), lang='it')
    #sldef=Literal(row['Sloveniandefinition'], lang='sl')
    eldef=Literal(row['Greekdefinition'].strip(), lang='el')
        
    ccr.add((sshoccmd[label], RDF.type, SKOS.Concept))
    ccr.add((sshoccmd[label], SKOS.prefLabel, enterm))
    ccr.add((sshoccmd[label], SKOS.prefLabel, frterm))
    ccr.add((sshoccmd[label], SKOS.prefLabel, nlterm))
    #ccr.add(sshoccmd[label], SKOS.prefLabel, determ))
    ccr.add((sshoccmd[label], SKOS.prefLabel, itterm))
    #ccr.add((sshoccmd[label], SKOS.prefLabel, slterm))
    ccr.add((sshoccmd[label], SKOS.prefLabel, elterm))
    
    ccr.add((sshoccmd[label], SKOS.definition, endef))
    ccr.add((sshoccmd[label], SKOS.definition, frdef))
    ccr.add((sshoccmd[label], SKOS.definition, nldef))
    #ccr.add(sshoccmd[label], SKOS.definition, dedef))
    ccr.add((sshoccmd[label], SKOS.definition, itdef))
    #ccr.add((sshoccmd[label], SKOS.definition, sldef))
    ccr.add((sshoccmd[label], SKOS.definition, eldef))
   
    ccr.add((sshoccmd[label], dct.source, source))
    ccr.add((sshoccmd[label], SKOS.exactMatch, urilabel))
    
        
print(len(ccr))

Create a Turtle file in the /data directory with the SKOS resources for Metadata

In [17]:

ccr.serialize(destination='data/skosccr.rdf', format="pretty-xml");#format="n3")
ccr.serialize(destination='data/skosccr.ttl', format="n3");#format="n3")

20 KiB Raw Blame History

Mapping Data Stewardship terminology and Metadata from spreadsheets to SKOS resources¶

20 KiB

Raw Blame History