CLEF. Crowdsourcing Linked Entities via web Form


Source code


CLEF is a lightweight Linked Open Data (LOD) native cataloguing system tailored to small-medium collaborative projects. It offers a web-ready solution for setting up data collection or crowdsourcing campaigns.

CLEF is designed to facilitate admin tasks, and to allow collaborators to produce high quality linked open data via user interface, without the burden of understanding what all this fuzz around LOD is about!

Some highlights

Install and run

On Mac you can install via installer, from source or with Docker

With the installer

From source code (No virtualenv)

With Docker

See section Setup for detail on how to change default ports.


On Windows you can install CLEF with Vagrant

If you change the configuration or update the git repository, reload the running vagrant vagrant reload

Default setup

You can modify the default configuration of the application from the Member area > Setup. Changes to the configuration have immediate effect (no need to restart the application).

What you can modify:

PROJECT (SHORT) NAME: Personalise the project name and the payoff, which appear in the homepage and the menu across pages.

MY ENDPOINT: The local SPARQL endpoint runs on port 3000. Changes are disabled. To modify the default port you'll have to modify the following files:

MY PUBLIC ENDPOINT: the public URL of your SPARQL endpoint, used for front-end functionalities, e.g. autocomplete. Default URI is Similarly to the local endpoint address, to change the port of the web application, change the aforementioned files.

URI BASE: The URI base is the persistent URI of new entities. You can use external services such as w3id. Be aware that content negotiation is not automatically enabled. See an example on how to enable it with w3id.

LIMIT REQUESTS: Limit the number of daily anonymous contributions per user - requests coming from the same IP address.

PAGINATION LIMIT: Choose the number of records to be shown in the backend and frontend (pagination of results).

GitHub backup

CLEF can be synchronised with a GitHub repository to (1) create a versioned backup of data (and keep track of changes to records) and (2) to create a team to collaborate to the catalogue.

Before modifying the Setup page, you will need:

When selecting the scopes of the permissions for the bearer token, ensure repo rights are selected.

In the setup page you can then enable Github synchronization by modifying the following fields:

Remember the repository must be either yours or of an organisation where you have been accredited as a collaborator with admin privileges.


To allow only a restricted number of collaborators to access the backend of your application, you can use Github for authentication. Every collaborator must have:

In addition to prior requirements for synchronization, the owner of the repositoy must also create a github OAuth application connected to the repository and the web application

GitHub authentication is strongly recommended for applications that run online. If you do not enable it, any visitor will be able to access the backend of your application. Once it is enabled, only accredited GitHub users that are collaborators of your repository will be allowed to access the Member area from the menu. Notice that anonymous contributions will still be possible (from the menu Add a resource). Only accredited users will be able to review and publish the new record though.

Templates and ontologies

Templates are interfaces designed to setup the web forms for data collection. Each template corresponds to a topic of interest to be described (an entity). Templates can be created, modified, and deleted from the Member area.

Resource templates

To create a new template, click on Create a new template in the Member area. First, you must provide a meaningful unique name (e.g. Book, Person) and the URI of a OWL class (e.g. Notice that name and class cannot be modified in later stages (you can only delete and start a new template).

Once filled in name and class, you are redirected to a dedicated webpage for customizing the template. Templates are lists of fields, each corresponding to a RDF property having as a subject an instance of the class already specified.

You can create new fields choosing between: textboxes (short texts), text areas (long texts), dropdown (select 1 term from list), and checkbox (multiple terms from list).

For each new field you are asked to fill in a few details, like: display name (to be shown in the final web form for data entry), a description (that will appear next to the field), the RDF property associated to the field, values type or list (see below), and placeholder (an example value of the field).

Tip! You can type the complete URI of the RDF property or, if known, start with the prefix and property name. Autocompletion suggestions will appear (powered by LOV). To accept a suggestion, click on the short name in violet. If you ignore suggestions, type the full URI of your property.

You can modify the order of fields or delete fields using the icons that appear at the bottom of each field box.


Text boxes can be used to record 3 types of information. In the VALUE TYPE field you can choose between:

Tip! Fields of type Free text (Literal) can be used to record URLs of web resources (e.g. an online video or a blog post). In the final form you will be asked if a copy of the website should be preserved in the long term. While you cannot store external documents in CLEF, CLEF sends a request to the Internet archive Wayback machine to store a copy of your favourite webpages.


A text area can include a long textual description.

In this field you can automatically extract Named Entities (powered by SpaCy). After filling in the field, press return and wait for suggestions to appear at the bottom of the text area. Suggested entities (people, places, organisations, etc.) are matched to Wikidata entities and are stored as keywords associated to the record (schema:keywords) - not as values of the field. You can accept or reject suggestions.


Dropdown and checkboxes behave similarly. These allow the final user to choose one or more terms from a list of controlled values. Specifically, dropdowns restrict the selection to one term from the list, while checkboxes allow multiple choice selections.

In the field VALUES you must provide a comma-separated list of terms to fill in the final list in dropdown or checkbox. List of terms are URIs associated with a label. Both internal and external vocabularies can be used (and mixed). In each line, write the URI of a resource, followed by comma and the label.

Data model

In CLEF it is not possible to import ontologies and vocabularies. Instead, the data model is created from the classes and properties specified in templates. The final data model is documented in a dedicated web page called Data Model, available from the footer of every web page. For each template, class, and property, the LOV catalogue is queried to retrieve original labels associated with the URI. If the class or property is not indexed in LOV, the local label (the display name) is shown instead.

Getting started!

CLEF comes with the following webpages, accessible from the menu:

From the footer:

Moreover, each record has a dedicated webpage, permanently identified by its URI. Likewise, terms from internal controlled vocabularies and new entities referenced in records (e.g. new entities created in Free text fields) have a dedicated webpage. External terms do not have dedicated page. Rather, a link to the source is provided (e.g. Wikidata items)

Create a record

Users can create a new record in two modalities: anonymous or authenticated mode.

The first step to create a new record is to select the template for the resource to be created.

After selecting the template the user is redirected to the data entry interface. The page shows the fields specified in the associated template. On the top-right, a light bulb icon shows a shortcut to the editorial guidelines, including tips for data entry. Click on the icon to toggle the helper.

Each field includes three elements: label, description (i icon), and input area. A tooltip shows a description of the expected value when hovering with the mouse. Likewise, a placeholder in the input area can show an example value. The input area can be a free-text field, a dropdown, or a checkbox.


If the label is followed by *, the field is mandatory and it is used to associate a title to the record. While typing, a lookup service searches the catalogue to show whether records with a similar title already exist and prevents from data duplication. However, the lookup does not enforce any behaviour (duplicates could be created).

If followed by the Wikidata icon (a bar code), an autocomplete service is called while typing. The user is encouraged to reuse terms from Wikidata. If no matches with the input text are found in Wikidata, terms from the catalogue are also suggested, to encourage reuse and consistency of data.

To accept a suggestion, click on the link in the result (e.g. Federico Fellini). The selected value appears under the input area (highlighted in violet). Users can reject all suggestions and create a new value: press enter and the new value will appear under the input area (highlighted in orange). Multiple values are allowed in this type of field.

Text areas allow longer descriptions to be included and can be expanded vertically.

Once users finish typing, they can press return and wait for named entities to be extracted from the text (e.g. people, places, organizations). Such entities are reconciled to Wikidata and are stored as keywords associated to the record (not to the specific field). Suggestions can be rejected by clicking on the x

Manage records

Records are accessible via the member area to authenticated users. Records are paginated and sorted by date (from the most recent to the oldest).

The list of records can be filtered by publication status:

In the column ACTIONS, the button modify allows a reviewer to modify a record. When clicking, the template is shown filled in with data, and values can be modified. After saving changes, the reviewer's name (if Gitub authentication is enabled) appears in the backend in the column modified by, and the status of the record changes to modified. Once it is reviewed at least once, the record appears in the Explore page.

NB. Before being reviewed, records do not appear in the Explore page. Records must be reviewed at least once before being published. Once a record has been published it cannot be temporarily removed from the Explore page (e.g. modifying a published record). Rather, the record keeps appearing in the Explore page, and the title is flagged with the label draft.

The button review allows a reviewer to modify a record and, if the review process is deemed over, to publish it straightaway. When modifying the record, the reviewer may decide to save the changes without publishing the record. After publication, the status changes to published and the label draft is removed from the title of published records.

To remove the record from the Explore page, it must be deleted. The button delete in the column ACTIONS allows a reviewer to delete a record permanently. If Github synchronisation is enabled, the action affects also the file stored in the repository.

Visualize records

New records are available at {YOURDOMAIN}/view-{RESOURCEID}>. The web page shows fields in the same order as in the template. When clicking on values, the website can redirect users to Wikidata pages (e.g. Department of classical philology), geonames pages (e.g. Bologna) or to internal pages describing terms belonging to controlled vocabularies (e.g. female).

Explore and search

Records can be browsed in the page Explore. Records are grouped by template in tabs, also showing the number of records falling uder that cetegory. In each tab, sections are shown for each field specified as a filter in the template.

By default, an initial filter is created for the text field defined as primary label, and records are sorted alphabetically. Filters based on entities (i.e. text fields referencing entity URIs and locations, or dropdwon and checkboxes referencing controlled vocabularies) are grouped by frequency of values, and then sorted alphabetically.

The top-right search bar in the menu looks into the catalogue for resources titles. The search is performed on the primary label of records. Suggestions are shown while typing.

Data access


New resources (records) are associated with the class and the URI base specified in the template. Instead, URIs from Wikidata and geonames are directly reused and no information on their classes or properties are stored.

For every new resource a named graph is generated, which includes triples all having the same subject the {resourceURI} identifying the resource. The named graph appears in the form {resourceURI}/ (the same URI of the resource, followed by a slash).

Basic provenance information is associated with named graphs. Whenever applicable the PROV ontology is reused, namely:


A reference page dedicated to the data model is automatically generated by the system to support developers in data reuse. The webpage is available at {YOURDOMAIN}>/model (link in the footer). The documentation is automatically generated by querying CLEF, to retrieve class and properties effectively used, and Linked Open Vocabularies (LOV), to retrieve labels and comments associated to the original specification in the ontology. If a property is not available from the LOV catalogue, a default label is shown.


New records are available at {YOURDOMAIN}/view-{RESOURCEID}. The webpage also serves data as RDFa (according to NB. Dereferentiation is not a built-in feature. Users must refer and configure external persistent URI providers (e.g. w3id).

SPARQL endpoint

CLEF comes with a built-in SPARQL endpoint. A GUI for querying the SPARQL endpoint (read only) and a REST API for programmatic querying the triplestore is available at {YOURDOMAIN}/sparql.


When Github backup is enabled, a backup of records is there provided in Turtle (a file for each record). By default, files are included in a folder called records. Versioning is provided by github. Every time a change happens to a record in the application, an update is sent to Github. Be aware that the synchronization between the triplestore and the repository is one-way, that is, changes happening on github only are not sent to the triplestore.


CLEF is based on webpy. To deploy CLEF in production server, you'll need a professional web server process, such as gunicorn, which will serve the app. See how to deploy applications.


The is set up for the server deployment using three docker containers, with nginx handling the static files as described above.

CLEF is part of Polifonia, a H2020 funded project (101004746). The repository is maintained by