User Guide Help

Metadata Discovery Model

Cafe Variome V3 can consume, store, serve, and query metadata both locally or as part of a federated network. This guide explains the model of metadata that CV3 accepts, and the principles behind using metadata functions.

Concept of metadata

Generally, metadata refers to the "data about data." For example, information about a dataset, or about a record inside a dataset. In CV3, the definition is slightly different: metadata refers to the general information on a data collection, either inside of CV3 or hosted somewhere else. The information about a subject does not differentiate between data and metadata, and is all stored inside a data source. Examples of metadata include:

  • Name, email or address of the contacting PI

  • Name, email or URL for the data publisher

  • The license this dataset is released under

  • The use conditions or agreement for the data

  • ...

In general, any information about a collection of data, including dataset, catalog, cohort, etc. can be stored in CV3. This is done via a concept called Meta Source.

Meta-sources

A meta-source is a single document representing a dataset or a collection of subject records. Internally, it's stored inside the collection of subject records (or dataset), directly in the database.

The meta-source can link to a dataset inside CV3, meaning that the metadata recorded is for that data source. Alternatively, the meta-source may refer to a dataset outside the system, meaning that the admininistators of the installation are aware of the data, have metadata about it, but cannot or are unwilling to host the data in CV3.

In principle, the metadata recorded in the system is open for discovery, where no authentication is necessary for accessing them. All formats generated by CV3 for these metadata, for example, a FAIR Data Point, will also be open to the public.

Meta source types

Meta-sources are designed to be flexible, and can contain extra information with custom fields. There are also pre-defined meta-source models, which fit commonly used metadata models for a given resource type. They can be selected according to the type of data being described, and/or the metadata collected.

We currently have the following pre-defined metadata models:

  • EPND Cohort: The cohort model used by the European Platform for Neurodegenerative Disease to register their collaborative cohorts. An official catalog of cohorts can be found at EPND Cohort Catalog.

  • Dataset: A collection of records that have been collected for a defined purpose, such as to answer a specific research question. Datasets are atomic, meaning that they cannot be further divided into smaller datasets. As such, all requests and permissions granted apply to the entire dataset.

  • Data collection: A collection of datasets that have been collated for a common purpose, but which can be further divided into smaller collections. For example, all data is collected as part of a given research programme or within a given consortium. Users are only expected to be granted partial access, and hence typically will only request access to a subset of a data collection.

  • Catalog: A collection of datasets and/or data collections. Nesting of catalogues is permitted, meaning it is possible to create a "catalog of catalogs."

  • Biobank: A collection of biological samples (and associated data) collected from patients. These collections are usually based on a certain criteria, for example, a disease or a geographical location.

  • Registry: A patient registry containing information about patients, usually clinical information, but any information about patients can be stored in a registry.

  • Guideline: A collection of one or more guidelines for processes relating to a certain disease or a type of data collection.

  • Custom: This is a general type, and can be used to describe any type of data collection that is not covered in the above types.

Relationship between meta sources

Meta-sources can be related to each other, forming a hierarchy or a graph of metadata. Like the following diagram:

MetaSourceCatalogCohortDatasetDataCollection0..*0..10..*0..1

Relationship between meta sources and data sources

Meta sources are designed to store metadata. Thus, internally, they can be linked to record level data stored in regular data sources, mainly with the " belong to " relationship.

The Basic model

The structure of the metadata follows a general model. The basic model is used in all cases and is then extended to accommodate other types of data with more detailed fields.

Internal fields and manual assignment

The fields explained in the sections below are the fields that form the metadata about a given resource. However, there are several other fields used internally within Cafe Variome to enable specific features, like interlinking of metadata entries. These are not visible when using the editing interface, but may be manually assigned providing the data is accurate, and the admin conforms to the required processes to format them correctly.

sourceId

The UUID of the source. If omitted, Cafe Variome will assign a UUID to the source. If present, it should be a valid UUID4 string. This is used within each Cafe Variome instance to identify the source (UUID may not be unique accross instances). It can be used to link one metadata entry to another, for example, by filling the datasetIds fields in the cohort model.

connectionId

The UUID of a data source this metadata entry describes. This is usually only valid when the metadata describes a dataset, but in rare cases can be assigned to other metadata. This field is not recommended to assign directly, as there is no other way to know the UUID of a data source except from checking the database.

Common fields definition

These are the fields that are present in all types of meta-sources.

sourceId

Source ID is a UUID4 compatible code generated by the system. However, if the inputted metadata contains some form of hierarchical structure and the ID is used to denote the parent-child relationship, the ID may be included in a manual uploading file. It must be unique and in UUID4 format for the system to accept it.

sourceName

The name of the source.

sourceType

The type of the source. This is very important, as it determines how this entry is interpreted. It's an enumerable field and can only be one of the following (case-sensitive):

custom

A custom type that is not covered by the following types.

cohort

A EPND Cohort metadata model.

catalog

A catalog of datasets.

biobank

A collection of biological samples.

registry

A patient registry.

guideline

A guideline.

dataset

A collection of records.

dataCollection

A collection of datasets.

resourceUrls

The URLs of the source. They should be fully qualified URLs with schema (e.g. https://example.com/). They should point to the main resource of the source, its description, or any point of interest that a requesting user may need.

publisher

The publisher of the resource. It's a nested JSON object, with the following fields:

publisherType

The type of the publisher. It's an enumerable field, can only be one of the following (case sensitive):

individual

An individual person.

organization

An organization.

agency

An agency that is not the generator/owner of the resource, but is responsible for managing the resource.

other

Any other type of publisher.

name

The name of the publisher.

contactEmail

The contact email of the publisher.

contactName

The contact name of the publisher, in case the publisher is not an individual.

url

The URL of the publisher.

location

The location of the publisher. A string containing free text, for example, it can be UK or Leicester, UK, Europe.

description

The description of the source. It can be empty, but not recommended, as this is the main field that will be used to describe the source and used in free-text search.

themes

The themes of the source. It's an array of strings, each string being a URI to a RDF data structure theme. Useful when a custom theme is required when generating FDP data from this source. If omitted, the default theme will be used.

releaseLicense

The release license of the source. It should be a URL to the license, and if omitted, it will be considered that there is "no license," meaning all rights reserved with no permission to use, modify, or distribute the data.

language

The language of the source. It should be a 2-character code adhering to ISO639-1 standard, in lower case.

customFields

The custom fields of the source. It's a key-value or key[values] pairs, where the key is a string, and the value is a string or an array of strings. It's designed to contain the custom metadata in a searchable form. The key cannot contain special characters including dot ., dollar sign $, slash /, or backslash \. If a key is present, the value cannot be null but can be an empty string or an empty array.

Basic model example

The basic model, aka the Custom type, expects the following fields:

{ "sourceId": "8df136d8-7fb0-4bec-a72a-5deed972bbb6", "sourceName": "A maximum custom source", "sourceType": "custom", "publisher": { "publisherType": "organization", "name": "University of Leicester", "contactEmail": "brookeslab@le.ac.uk", "contactName": "John Doe", "url": "https://www.le.ac.uk", "location": "Leicester, UK, Europe" }, "resourceUrls": ["https://www.example.com"], "description": "This is a maximum example of a custom source", "themes": [ "https://example.com/theme1", "https://example.com/theme2" ], "releaseLicense": "https://opensource.org/licenses/MIT", "language": "en", "connectionId": "b1120b19-e631-46ad-915c-c964c8a278a2", "customFields": { "Some custom field": "Some value", "Another custom field": [ "Value 1", "Value 2" ] } }

This metadata model is also used in other meta-source types, where the metadata is similar or same in structure.

EPND project extensions

Cohort

Cohort fields definition

The cohort model extends the base model to contain more information, specifically the ones deemed necessary in EPND project. Aside from the fields explained above, it can also have:

cohortDetails

The details of the cohort. It's a nested JSON object, with the following fields:

siteType

The type of the site. It's an enumerable string, being one of the following (case-sensitive):

singleSite

A cohort that has only a single site.

multiSite

A cohort that has multiple sites.

multiCountry

A cohort that has multiple sites in multiple countries.

country

The country of the cohort. It's a 2-character code adhering to the ISO3166-1 standard, in upper case.

yearStart

The year the cohort started. It should be a 4-digit integer.

collectedTypes

The types of data collected for the cohort. It's a nested JSON object, with the following fields:

participants

The participants amount of the cohort, and their conditions. It's a nested JSON object, with the following fields:

diseases

The diseases that the participants are categorized by. An empty array will cause the entire participants object to be ignored. It's an array of enumerable strings, each can only be one of the following (case-sensitive):

controlGroup

Control group participants.

ad

Alzheimer's disease.

pd

Parkinson's disease.

irbd

Isolated REM Sleep Behavior Disorder.

dlb

Dementia with Lewy Bodies.

caa

Cerebral Amyloid Angiopathy.

ftd

Frontotemporal Dementia.

als

Amyotrophic Lateral Sclerosis.

psp

Progressive Supranuclear Palsy.

cbd

Corticobasal Degeneration.

msa

Multiple System Atrophy.

hd

Huntington's Disease.

ataxia

Ataxia.

other

Other diseases not listed above.

numberOfSubjects

The number of subjects in the cohort. It's a number and should be above 0. If it's 0, the entire participants object will be ignored.

bioSamples

The types of samples collected in the study. It's an array of enumerable strings, each being one of the following (case-sensitive):

csf

Cerebrospinal fluid.

serum

Serum.

plasma

Plasma.

dna

DNA.

saliva

Saliva.

urine

Urine.

stool

Stool.

images

The types of images collected for the study. It's an array of enumerable strings, each being one of the following (case-sensitive):

mri

MRI.

petAmyloid

PET Amyloid.

petTau

PET Tau.

spect

SPECT.

ocular

Ocular.

cognitiveData

The types of cognitive data collected for the study. It's an array of enumerable strings, each being one of the following (case-sensitive):

crossSectional

Cross-sectional data.

longitudinal

Longitudinal data.

datasetIds

The UUIDs of the dataset type metadata entries that are related to this cohort. It's an array of strings, each being a UUID of a dataset metadata entry. The datasets have to be either present in the same file or already uploaded to the system.

Cohort model example

{ "sourceId": "a6e001cb-bb60-48b9-a47a-3dccee13c085", "sourceName": "A maximum cohort", "sourceType": "cohort", "publisher": { "publisherType": "organization", "name": "University of Leicester", "contactEmail": "brookeslab@le.ac.uk", "contactName": "John Doe", "url": "https://www.le.ac.uk", "location": "Leicester, UK, Europe" }, "resourceUrls": ["https://www.example.com"], "description": "This is a maximum example of a cohort", "releaseLicense": "https://opensource.org/licenses/MIT", "language": "en", "themes": [ "https://example.com/theme1", "https://example.com/theme2" ], "cohortDetails": { "siteType": "multiSite", "country": "UK", "yearStart": 2023 }, "collectedTypes": { "participants": { "diseases": [ "controlGroup", "ad", "hd" ], "numberOfSubjects": 1000 }, "bioSamples": [ "csf", "serum", "plasma", "dna", "saliva", "urine" ], "images": [ "mri", "petTau", "datScan" ], "cognitiveData": [ "crossSectional" ] }, "connectionId": "6c3968af-3d29-4f81-8747-b2337c1cc01b", "datasetIds": [ "adbec8c2-9460-4814-9574-06a0dfe2efb5" ], "customFields": { "Some custom field": "Some value", "Another custom field": [ "Value 1", "Value 2" ] } }

Dataset

Dataset fields definition

Datasets contain the following fields:

datasetVersions

The versions of the dataset. It's an array of nested JSON objects, each object being a representation of a version of the dataset. Each object has the following fields:

datasetDetails

The details of the dataset. It's a nested JSON object, with the following fields:

versionId

The UUID of the version. If omitted, Cafe Variome will assign a UUID to the version. If present, it should be a valid UUID 4 string. Either way, it will always be present in the database.

versionName

The version number of the dataset. Semantic versioning is recommended. Using a format that does not fit with semantic versioning will disable the parsing, comparison, and sorting of the versions.

keywords

The keywords of the dataset. It's an array of strings, each being a keyword.

publishedDate

The date when the dataset is released. It should be a date string in the format of YYYY-MM-DD.

updateDate

The update date of the dataset. It should be a date string in the format of YYYY-MM-DD.

datasetContent

The information on the data content of the dataset. It's a nested JSON object, with the following fields:

numberOfSubjects

The number of data records in the dataset. This should be an integer above 0. It may be an approximated number if the exact number is kept private for confidentiality reasons.

minAge

The minimum age of the participants in the dataset. It should be a number and should be above 0.

maxAge

The maximum age of the participants in the dataset. It should be a number and should be above 0.

countries

The countries of the participants in the dataset. It's an array of 2-character codes adhering to the ISO3166-1 standard, in upper case.

diseases

The diseases of the participants in the dataset. It's an array of enumerable strings, each can only be one of the following (case-sensitive):

controlGroup

Control group participants.

ad

Alzheimer's disease.

pd

Parkinson's disease.

irbd

Isolated REM Sleep Behavior Disorder.

dlb

Dementia with Lewy Bodies.

caa

Cerebral Amyloid Angiopathy.

ftd

Frontotemporal Dementia.

als

Amyotrophic Lateral Sclerosis.

psp

Progressive Supranuclear Palsy.

cbd

Corticobasal Degeneration.

msa

Multiple System Atrophy.

hd

Huntington's Disease.

ataxia

Ataxia.

other

Other diseases not listed above.

sex

The genders covered in the dataset. It's an array of enumerable strings, each can only be one of the following (case-sensitive):

male

Biological male.

female

Biological female.

other

Other biological genders.

undifferential

Gender data is irrelevant to the data, or is not differentiated on purpose.

unknown

No information regarding the gender composition or collection status.

clinical

The clinical data collected within the dataset. It's an array of enumerable strings, each can only be one of the following (case-sensitive):

comorbidities

Comorbidities.

medicationUse

Medication use.

familyHistory

Family history.

ageOfSymptomOnset

Age of symptom onset.

clinicalDiagnosis

Clinical diagnosis.

exposure

Exposure.

lifeStyleInfo

Lifestyle information.

vitalSigns

Vital signs.

markers

The biological or digital markers collected within the dataset. It's an array of enumerable strings, each can only be one of the following (case-sensitive):

amyloid

Amyloid protein markers.

tau

Tau protein markers.

neurofilamentLightChain

Neurofilament light chain protein markers.

alphaSynuclein

Alpha-synuclein protein markers.

dat

Direct Antibody Test.

images

array[string] optional The types of images collected within the dataset. It's an array of enumerable strings, each can only be one of the following (case-sensitive):

mri

MRI.

petAmyloid

PET Amyloid.

petTau

PET Tau.

spect

SPECT.

ocular

Ocular.

electrophysiology

The types of electrophysiology data collected within the dataset. It's an array of enumerable strings, each can only be one of the following (case-sensitive):

eeg

EEG.

meg

MEG.

erp

ERP.

dataTypes

The types of data collected within the dataset. It's an array of enumerable strings, each can only be one of the following (case-sensitive):

demographics

Demographics.

clinical

Clinical.

lifestyle

Lifestyle.

functionalRatings

Functional ratings.

motor

Motor.

neuropsychiatric

Neuropsychiatric.

neuropsychological

Neuropsychological.

qualityOfLife

Quality of life.

sleepScales

Sleep scales.

digitalData

Digital data.

imaging

Imaging.

electrophysiology

Electrophysiology.

neuroPathology

Neuro pathology.

other

Other.

The information about each version is stored in the datasetVersions field, which is a list. Each version contains the details of the version, and the content of the dataset. Semantic versioning is recommended to enable version comparison and search.

Dataset model example

{ "sourceId": "adbec8c2-9460-4814-9574-06a0dfe2efb5", "sourceName": "A maximum dataset", "sourceType": "dataset", "publisher": { "publisherType": "organization", "name": "University of Leicester", "contactEmail": "brookeslab@le.ac.uk", "contactName": "John Doe", "url": "https://www.le.ac.uk", "location": "Leicester, UK, Europe" }, "resourceUrls": ["https://www.example.com"], "description": "This is a maximum example of a custom source", "themes": [ "https://example.com/theme1", "https://example.com/theme2" ], "datasetVersions": [ { "datasetDetails": { "versionId": "1b71b513-33be-45ee-b6e9-a24b2bc9dc05", "versionName": "v1.0.0", "keywords": [ "keyword1", "keyword2" ], "publishedDate": "2023-12-02", "updateDate": "2023-12-12" }, "datasetContent": { "numberOfSubjects": 100, "minAge": 18, "maxAge": 35, "countries": [ "UK", "US" ], "diseases": [ "controlGroup", "ad", "hd" ], "sex": [ "male", "female" ], "clinical": [ "lifeStyleInfo", "vitalSigns" ], "markers": [ "amyloid", "tau" ], "images": [ "mri", "petTau", "datScan" ], "electrophysiology": [ "eeg", "meg" ], "dataTypes": [ "demographics" ] } }, { "datasetDetails": { "versionId": "4114682d-73f5-45eb-9b7c-023e18cd12c9", "versionName": "v2.0.0", "keywords": [ "keyword3", "keyword4" ], "publishedDate": "2024-01-01", "updateDate": "2024-01-11" }, "datasetContent": { "numberOfSubjects": 200, "minAge": 20, "maxAge": 40, "countries": [ "UK", "US", "CA" ], "diseases": [ "controlGroup", "ad", "hd" ], "sex": [ "male", "female", "other" ], "clinical": [ "lifeStyleInfo", "vitalSigns" ], "markers": [ "amyloid", "tau", "neurofilamentLightChain" ], "images": [ "mri", "petTau", "datScan", "spect" ], "electrophysiology": [ "eeg", "meg", "erp" ], "dataTypes": [ "demographics", "clinical", "lifestyle" ] } } ], "releaseLicense": "https://opensource.org/licenses/MIT", "language": "en", "connectionId": "ac743200-c8ff-485e-a82d-45d0e636f862", "customFields": { "Some custom field": "Some value", "Another custom field": [ "Value 1", "Value 2" ] } }

Data collection

Data collection fields definition

These are the fields specific to the data collection type.

dataCollectionDetails

The details of the data collection. It's a nested JSON object, with the following fields:

keywords

The keywords of the dataset. It's an array of strings, each being a keyword.

publishedDate

The date when the dataset is released. It should be a date string in the format of YYYY-MM-DD.

updateDate

The update date of the dataset. It should be a date string in the format of YYYY-MM-DD.

dataCollectionContent

The information on the data content of the data collection. It's a nested JSON object, with the following fields:

numberOfSubjects

The amount of data records in the dataset. It should be a number and should be above 0. It may be an approximated number if the exact number is kept private for confidentiality reasons.

minAge

The minimum age of the participants in the dataset. It should be a number and should be above 0.

maxAge

The maximum age of the participants in the dataset. It should be a number and should be above 0.

countries

The countries of the participants in the dataset. It's an array of 2-character codes adhering to the ISO3166-1 standard, in upper case.

diseases

The diseases of the participants in the dataset. It's an array of enumerable strings, each can only be one of the following (case-sensitive):

controlGroup

Control group participants.

ad

Alzheimer's disease.

pd

Parkinson's disease.

irbd

Isolated REM Sleep Behavior Disorder.

dlb

Dementia with Lewy Bodies.

caa

Cerebral Amyloid Angiopathy.

ftd

Frontotemporal Dementia.

als

Amyotrophic Lateral Sclerosis.

psp

Progressive Supranuclear Palsy.

cbd

Corticobasal Degeneration.

msa

Multiple System Atrophy.

hd

Huntington's Disease.

ataxia

Ataxia.

other

Other diseases not listed above.

sex

The genders covered in the dataset. It's an array of enumerable strings, each can only be one of the following (case-sensitive):

male

Biological male.

female

Biological female.

other

Biological other.

undifferential

Gender data is irrelevant to the data, or is not differentiated on purpose.

unknown

No information regarding the gender composition or collection status.

clinical

The clinical data collected within the dataset. It's an array of enumerable strings, each can only be one of the following (case-sensitive):

comorbidities

Comorbidities.

medicationUse

Medication use.

familyHistory

Family history.

ageOfSymptomOnset

Age of symptom onset.

clinicalDiagnosis

Clinical diagnosis.

exposure

Exposure.

lifeStyleInfo

Lifestyle information.

vitalSigns

Vital signs.

markers

The biological or digital markers collected within the dataset. It's an array of enumerable strings, each can only be one of the following (case-sensitive):

amyloid

Amyloid protein markers.

tau

Tau protein markers.

neurofilamentLightChain

Neurofilament light chain protein markers.

alphaSynuclein

Alpha-synuclein protein markers.

dat

Direct Antibody Test.

images

The types of images collected within the dataset. It's an array of enumerable strings, each can only be one of the following (case-sensitive):

mri

MRI.

petAmyloid

PET Amyloid.

petTau

PET Tau.

spect

SPECT.

ocular

Ocular.

electrophysiology

The types of electrophysiology data collected within the dataset. It's an array of enumerable strings, each can only be one of the following (case-sensitive):

eeg

EEG.

meg

MEG.

erp

ERP.

dataTypes

The types of data collected within the dataset. It's an array of enumerable strings, each can only be one of the following (case-sensitive):

demographics

Demographics.

clinical

Clinical.

lifestyle

Lifestyle.

functionalRatings

Functional ratings.

motor

Motor.

neuropsychiatric

Neuropsychiatric.

neuropsychological

Neuropsychological.

qualityOfLife

Quality of life.

sleepScales

Sleep scales.

digitalData

Digital data.

imaging

Imaging.

electrophysiology

Electrophysiology.

neuroPathology

Neuro pathology.

other

Other.

Data collection model example

{ "sourceId": "adbec8c2-9460-4814-9574-06a0dfe2efb5", "sourceName": "A maximum dataset", "sourceType": "dataset", "publisher": { "publisherType": "organization", "name": "University of Leicester", "contactEmail": "brookeslab@le.ac.uk", "contactName": "John Doe", "url": "https://www.le.ac.uk", "location": "Leicester, UK, Europe" }, "resourceUrls": ["https://www.example.com"], "description": "This is a maximum example of a custom source", "themes": [ "https://example.com/theme1", "https://example.com/theme2" ], "dataCollectionDetails": { "keywords": [ "keyword1", "keyword2" ], "publishedDate": "2023-12-02", "updateDate": "2023-12-12" }, "dataCollectionContent": { "numberOfSubjects": 100, "minAge": 18, "maxAge": 35, "countries": [ "UK", "US" ], "diseases": [ "controlGroup", "ad", "hd" ], "sex": [ "male", "female" ], "clinical": [ "lifeStyleInfo", "vitalSigns" ], "markers": [ "amyloid", "tau" ], "images": [ "mri", "petTau", "datScan" ], "electrophysiology": [ "eeg", "meg" ], "dataTypes": [ "demographics" ] }, "releaseLicense": "https://opensource.org/licenses/MIT", "language": "en", "connectionId": "ac743200-c8ff-485e-a82d-45d0e636f862", "customFields": { "Some custom field": "Some value", "Another custom field": [ "Value 1", "Value 2" ] } }
Last modified: 08 November 2024