Metadata Discovery Model

Cafe Variome V3 can consume, store, serve, and query metadata both locally or as part of a federated network. This guide explains the model of metadata that CV3 accepts, and the principles behind using metadata functions.

Concept of metadata

Generally, metadata refers to the "data about data." For example, information about a dataset, or about a record inside a dataset. In CV3, the definition is slightly different - metadata refers to the general information on a data collection, either inside of CV3 or hosted somewhere else. The information about a subject does not differentiate between data and metadata, and is all stored inside a data source. Examples of metadata include:

Name, email or address of the contacting PI
Name, email or URL for the data publisher
The license this dataset is released under
The use conditions or agreement for the data

In general, any information about a collection of data, including dataset, catalog, cohort, etc. can be stored in CV3. This is done via a concept called Meta Source.

Meta sources

A meta-source is a single document representing a dataset or a collection of subject records. Internally, it's stored directly within the corresponding collection of subject records (or dataset) in the database.

The meta-source can link to a dataset inside CV3, meaning that the metadata recorded is for that data source. Alternatively, the meta-source may refer to a dataset outside the system, meaning that the admininistators of the installation are aware of the data, have metadata about it, but cannot or are unwilling to host the data in CV3.

In principle, the metadata recorded in the system is open for discovery, where no authentication is necessary for accessing them. All formats generated by CV3 for these metadata (for example, a FAIR Data Point) will also be open to the public.

Meta source types

Meta-sources are designed to be flexible, and can contain extra information with custom fields. There are also pre-defined meta-source models, which fit commonly used metadata models for a given resource type. They can be selected according to the type of data being described, and/or the metadata collected.

We currently have the following pre-defined metadata models:

EPND Cohort: The cohort model used by the European Platform for Neurodegenerative Disease to register their collaborative cohorts. An official catalog of cohorts can be found at EPND Cohort Catalog.
Dataset: A collection of records that have been collected for a defined purpose, such as to answer a specific research question. Datasets are atomic, meaning that they cannot be further divided into smaller datasets. As such, all requests and permissions granted apply to the entire dataset.
Data collection: A collection of datasets that have been collated for a common purpose, but which can be further divided into smaller collections. For example, all data is collected as part of a given research programme or within a given consortium. Users are only expected to be granted partial access, and hence typically will only request access to a subset of a data collection.
Catalog: A collection of datasets and/or data collections. Nesting of catalogues is permitted, meaning it is possible to create a "catalog of catalogs."
Biobank: A collection of biological samples (and associated data) collected from patients. These collections are usually based on a certain criteria, for example, a disease or a geographical location.
Registry: A patient registry containing information about patients, usually clinical information, but any information about patients can be stored in a registry.
Guideline: A collection of one or more guidelines for processes relating to a certain disease or a type of data collection.
Custom: This is a general type, and can be used to describe any type of data collection that is not covered in the above types.

Relationship between meta sources

Meta-sources can be related to each other, forming a hierarchy or a graph of metadata. Like the following diagram:

Relationship between meta sources and data sources

Meta sources are designed to store metadata. Thus, internally, they can be linked to record level data stored in regular data sources, mainly with the " belong to " relationship.

The Basic model

The structure of the metadata follows a general model. The basic model is used in all cases and is then extended to accommodate other types of data with more detailed fields.

Internal fields and manual assignment

The fields explained in the sections below are the fields that form the metadata about a given resource. However, there are several other fields used internally within Cafe Variome to enable specific features, like interlinking of metadata entries. These are not visible when using the editing interface, but may be manually assigned providing the data is accurate, and the admin conforms to the required processes to format them correctly.

sourceId: The UUID of the source. If omitted, Cafe Variome will assign a UUID to the source. If present, it should be a valid UUID4 string. This is used within each Cafe Variome instance to identify the source (UUID may not be unique accross instances). It can be used to link one metadata entry to another, for example, by filling the datasetIds fields in the cohort model.
connectionId: The UUID of a data source this metadata entry describes. This is usually only valid when the metadata describes a dataset, but in rare cases can be assigned to other metadata. This field is not recommended to assign directly, as there is no other way to know the UUID of a data source except from checking the database.

Common fields definition

These are the fields that are present in all types of meta-sources.

sourceId

Source ID is a UUID4 compatible code generated by the system. However, if the inputted metadata contains some form of hierarchical structure and the ID is used to denote the parent-child relationship, the ID may be included in a manual uploading file. It must be unique and in UUID4 format for the system to accept it.

sourceName

The name of the source.

sourceType

The type of the source. This is very important, as it determines how this entry is interpreted. It's an enumerable field and can only be one of the following (case-sensitive):

custom: A custom type that is not covered by the following types.
cohort: An EPND Cohort metadata model.
catalog: A catalog of datasets.
biobank: A collection of biological samples.
registry: A patient registry.
guideline: A guideline.
dataset: A collection of records.
dataCollection: A collection of datasets.

resourceUrls

The URLs of the source. They should be fully qualified URLs with schema (e.g. https://example.com/). They should point to the main resource of the source, its description, or any point of interest that a requesting user may need.

publisher

The publisher of the resource. It's a nested JSON object, with the following fields:

publisherType

The type of the publisher. It's an enumerable field, can only be one of the following (case sensitive):

individual: An individual person.
organization: An organization.
agency: An agency that is not the generator/owner of the resource, but is responsible for managing the resource.
other: Any other type of publisher.

name

The name of the publisher.

contactEmail

The contact email of the publisher.

contactName

The contact name of the publisher, in case the publisher is not an individual.

url

The URL of the publisher.

location

The location of the publisher. A string containing free text, for example, it can be UK or Leicester, UK, Europe.

description

The description of the source. While it can be left empty, it's recommended to include a description, as this field primarily describes the source and is used for free-text searches.

themes

The themes of the source. This is an array of strings, with each string being a URI to an RDF data structure theme. Useful when a custom theme is required when generating FDP data from this source. If omitted, the default theme will be used.

releaseLicense

The release license of the source. This should be provided as a URL linking to the license. If omitted, the system will assume there is "no license," meaning all rights are reserved, and no permission is granted to use, modify, or distribute the data.

language

The language of the source. It should be a 2-character code adhering to ISO639-1 standard, in lower case.

customFields

The custom fields of the source. It's a key-value or key[values] pairs, where the key is a string, and the value is a string or an array of strings. It's designed to contain the custom metadata in a searchable form. The key cannot contain special characters including dot ., dollar sign $, slash /, or backslash \. If a key is present, the value cannot be null but can be an empty string or an empty array.

Basic model example

The basic model, aka the Custom type, expects the following fields:

This metadata model is also used in other meta-source types, where the metadata is similar or same in structure.

EPND project extensions

Cohort

Cohort fields definition

The cohort model extends the base model to contain more information, specifically the ones deemed necessary in the EPND project. Aside from the fields explained above, it can also have:

cohortDetails

The details of the cohort. This is a nested JSON object, with the following fields:

siteType

The type of the site. It's an enumerable string, being one of the following (case-sensitive):

singleSite: A cohort that has only a single site.
multiSite: A cohort that has multiple sites.
multiCountry: A cohort that has multiple sites in multiple countries.

country

The country of the cohort. It's a 2-character code adhering to the ISO3166-1 standard, in upper case.

yearStart

The year the cohort started. It should be a 4-digit integer.

collectedTypes

The types of data collected for the cohort. This is a nested JSON object, with the following fields:

participants

The number of participants in the cohort and their associated conditions. This information is stored in a nested JSON object containing the following fields:

diseases

The diseases that the participants are categorized by. An empty array will cause the entire participants object to be ignored. This is an array of enumerable strings, each can only be one of the following (case-sensitive):

controlGroup: Control group participants.
ad: Alzheimer's disease.
pd: Parkinson's disease.
irbd: Isolated REM Sleep Behavior Disorder.
dlb: Dementia with Lewy Bodies.
caa: Cerebral Amyloid Angiopathy.
ftd: Frontotemporal Dementia.
als: Amyotrophic Lateral Sclerosis.
psp: Progressive Supranuclear Palsy.
cbd: Corticobasal Degeneration.
msa: Multiple System Atrophy.
hd: Huntington's Disease.
ataxia: Ataxia.
other: Other diseases not listed above.

numberOfSubjects

The number of subjects in the cohort. This is a number and should be above 0. If it's 0, the entire participants object will be ignored.

bioSamples

The types of samples collected in the study. This is an array of enumerable strings, each being one of the following (case-sensitive):

csf: Cerebrospinal fluid.
serum: Serum.
plasma: Plasma.
dna: DNA.
saliva: Saliva.
urine: Urine.
stool: Stool.

images

The types of images collected for the study. This is an array of enumerable strings, each being one of the following (case-sensitive):

mri: MRI.
petAmyloid: PET Amyloid.
petTau: PET Tau.
spect: SPECT.
ocular: Ocular.

cognitiveData

The types of cognitive data collected for the study. This is an array of enumerable strings, each being one of the following (case-sensitive):

crossSectional: Cross-sectional data.
longitudinal: Longitudinal data.

datasetIds

The UUIDs of dataset type metadata entries associated with this cohort. This is an array of strings, where each string is a UUID corresponding to a dataset metadata entry. Each referenced dataset must either be included in the same file or already uploaded to the system.

Cohort model example

Dataset

Dataset fields definition

Datasets contain the following fields:

datasetVersions

The versions of the dataset. This is an array of nested JSON objects, each representing a specific version of the dataset. Each object contains the following fields:

datasetDetails

The details of the dataset. This is a nested JSON object, with the following fields:

versionId: The UUID of the version. If omitted, Cafe Variome will assign a UUID to the version. If present, it should be a valid UUID 4 string. Either way, it will always be present in the database.
versionName: The version number of the dataset. Semantic versioning is recommended. Using a format that does not fit with semantic versioning will disable the parsing, comparison, and sorting of the versions.
keywords: The keywords of the dataset. This is an array of strings, each being a keyword.
publishedDate: The date when the dataset is released. It should be a date string in the format of YYYY-MM-DD.
updateDate: The update date of the dataset. It should be a date string in the format of YYYY-MM-DD.

datasetContent

The information on the data content of the dataset. It's a nested JSON object, with the following fields:

numberOfSubjects

The number of data records in the dataset. This should be an integer above 0. It may be an approximated number if the exact number is kept private for confidentiality reasons.

minAge

The minimum age of the participants in the dataset. It should be a number and should be above 0.

maxAge

The maximum age of the participants in the dataset. It should be a number and should be above 0.

countries

The countries of the participants in the dataset. This is an array of 2-character codes adhering to the ISO3166-1 standard, in upper case.

diseases

The diseases of the participants in the dataset. This is an array of enumerable strings, each can only be one of the following (case-sensitive):

controlGroup: Control group participants.
ad: Alzheimer's disease.
pd: Parkinson's disease.
irbd: Isolated REM Sleep Behavior Disorder.
dlb: Dementia with Lewy Bodies.
caa: Cerebral Amyloid Angiopathy.
ftd: Frontotemporal Dementia.
als: Amyotrophic Lateral Sclerosis.
psp: Progressive Supranuclear Palsy.
cbd: Corticobasal Degeneration.
msa: Multiple System Atrophy.
hd: Huntington's Disease.
ataxia: Ataxia.
other: Other diseases not listed above.

sex

The genders covered in the dataset. This is an array of enumerable strings, each can only be one of the following (case-sensitive):

male: Biological male.
female: Biological female.
other: Other biological genders.
undifferential: Gender data is irrelevant to the data, or is not differentiated on purpose.
unknown: No information regarding the gender composition or collection status.

clinical

The clinical data collected within the dataset. This is an array of enumerable strings, each can only be one of the following (case-sensitive):

comorbidities: Comorbidities.
medicationUse: Medication use.
familyHistory: Family history.
ageOfSymptomOnset: Age of symptom onset.
clinicalDiagnosis: Clinical diagnosis.
exposure: Exposure.
lifeStyleInfo: Lifestyle information.
vitalSigns: Vital signs.

markers

The biological or digital markers collected within the dataset. This is an array of enumerable strings, each can only be one of the following (case-sensitive):

amyloid: Amyloid protein markers.
tau: Tau protein markers.
neurofilamentLightChain: Neurofilament light chain protein markers.
alphaSynuclein: Alpha-synuclein protein markers.
dat: Direct Antibody Test.

images

array[string] optional The types of images collected within the dataset. This is an array of enumerable strings, each can only be one of the following (case-sensitive):

mri: MRI.
petAmyloid: PET Amyloid.
petTau: PET Tau.
spect: SPECT.
ocular: Ocular.

electrophysiology

The types of electrophysiology data collected within the dataset. This is an array of enumerable strings, each can only be one of the following (case-sensitive):

eeg: EEG.
meg: MEG.
erp: ERP.

dataTypes

The types of data collected within the dataset. This is an array of enumerable strings, each can only be one of the following (case-sensitive):

demographics: Demographics.
clinical: Clinical.
lifestyle: Lifestyle.
functionalRatings: Functional ratings.
motor: Motor.
neuropsychiatric: Neuropsychiatric.
neuropsychological: Neuropsychological.
qualityOfLife: Quality of life.
sleepScales: Sleep scales.
digitalData: Digital data.
imaging: Imaging.
electrophysiology: Electrophysiology.
neuroPathology: Neuro pathology.
other: Other.

The information about each version is stored in the datasetVersions field, which is a list. Each version contains the details of the version, and the content of the dataset. Semantic versioning is recommended to enable version comparison and search.

Dataset model example

Data collection

Data collection fields definition

These are the fields specific to the data collection type.

dataCollectionDetails

The details of the data collection. It's a nested JSON object, with the following fields:

keywords: The keywords of the dataset. This is an array of strings, each being a keyword.
publishedDate: The date when the dataset is released. It should be a date string in the format of YYYY-MM-DD.
updateDate: The update date of the dataset. It should be a date string in the format of YYYY-MM-DD.

dataCollectionContent

The information on the data content of the data collection. It's a nested JSON object, with the following fields:

numberOfSubjects

The amount of data records in the dataset. It should be a number and should be above 0. It may be an approximated number if the exact number is kept private for confidentiality reasons.

minAge

The minimum age of the participants in the dataset. It should be a number and should be above 0.

maxAge

The maximum age of the participants in the dataset. It should be a number and should be above 0.

countries

The countries of the participants in the dataset. This is an array of 2-character codes adhering to the ISO3166-1 standard, in upper case.

diseases

The diseases of the participants in the dataset. This is an array of enumerable strings, each can only be one of the following (case-sensitive):

controlGroup: Control group participants.
ad: Alzheimer's disease.
pd: Parkinson's disease.
irbd: Isolated REM Sleep Behavior Disorder.
dlb: Dementia with Lewy Bodies.
caa: Cerebral Amyloid Angiopathy.
ftd: Frontotemporal Dementia.
als: Amyotrophic Lateral Sclerosis.
psp: Progressive Supranuclear Palsy.
cbd: Corticobasal Degeneration.
msa: Multiple System Atrophy.
hd: Huntington's Disease.
ataxia: Ataxia.
other: Other diseases not listed above.

sex

The genders covered in the dataset. This is an array of enumerable strings, each can only be one of the following (case-sensitive):

male: Biological male.
female: Biological female.
other: Biological other.
undifferential: Gender data is irrelevant to the data, or is not differentiated on purpose.
unknown: No information regarding the gender composition or collection status.

clinical

The clinical data collected within the dataset. This is an array of enumerable strings, each can only be one of the following (case-sensitive):

comorbidities: Comorbidities.
medicationUse: Medication use.
familyHistory: Family history.
ageOfSymptomOnset: Age of symptom onset.
clinicalDiagnosis: Clinical diagnosis.
exposure: Exposure.
lifeStyleInfo: Lifestyle information.
vitalSigns: Vital signs.

markers

The biological or digital markers collected within the dataset. This is an array of enumerable strings, each can only be one of the following (case-sensitive):

amyloid: Amyloid protein markers.
tau: Tau protein markers.
neurofilamentLightChain: Neurofilament light chain protein markers.
alphaSynuclein: Alpha-synuclein protein markers.
dat: Direct Antibody Test.

images

The types of images collected within the dataset. This is an array of enumerable strings, each can only be one of the following (case-sensitive):

mri: MRI.
petAmyloid: PET Amyloid.
petTau: PET Tau.
spect: SPECT.
ocular: Ocular.

electrophysiology

The types of electrophysiology data collected within the dataset. This is an array of enumerable strings, each can only be one of the following (case-sensitive):

eeg: EEG.
meg: MEG.
erp: ERP.

dataTypes

The types of data collected within the dataset. This is an array of enumerable strings, each can only be one of the following (case-sensitive):

demographics: Demographics.
clinical: Clinical.
lifestyle: Lifestyle.
functionalRatings: Functional ratings.
motor: Motor.
neuropsychiatric: Neuropsychiatric.
neuropsychological: Neuropsychological.
qualityOfLife: Quality of life.
sleepScales: Sleep scales.
digitalData: Digital data.
imaging: Imaging.
electrophysiology: Electrophysiology.
neuroPathology: Neuro pathology.
other: Other.

Data collection model example

Last modified: 31 March 2025