Metadata Discovery Model
Cafe Variome V3 can consume, store, serve, and query metadata both locally or as part of a federated network. This guide explains the model of metadata that CV3 accepts, and the principles behind using metadata functions.
Concept of metadata
Generally, metadata refers to the "data about data." For example, information about a dataset, or about a record inside a dataset. In CV3, the definition is slightly different: metadata refers to the general information on a data collection, either inside of CV3 or hosted somewhere else. The information about a subject does not differentiate between data and metadata, and is all stored inside a data source. Examples of metadata include:
Name, email or address of the contacting PI
Name, email or URL for the data publisher
The license this dataset is released under
The use conditions or agreement for the data
...
In general, any information about a collection of data, including dataset, catalog, cohort, etc. can be stored in CV3. This is done via a concept called Meta Source.
Meta-sources
A meta-source is a single document representing a dataset or a collection of subject records. Internally, it's stored inside the collection of subject records (or dataset), directly in the database.
The meta-source can link to a dataset inside CV3, meaning that the metadata recorded is for that data source. Alternatively, the meta-source may refer to a dataset outside the system, meaning that the admininistators of the installation are aware of the data, have metadata about it, but cannot or are unwilling to host the data in CV3.
In principle, the metadata recorded in the system is open for discovery, where no authentication is necessary for accessing them. All formats generated by CV3 for these metadata, for example, a FAIR Data Point, will also be open to the public.
Meta source types
Meta-sources are designed to be flexible, and can contain extra information with custom fields. There are also pre-defined meta-source models, which fit commonly used metadata models for a given resource type. They can be selected according to the type of data being described, and/or the metadata collected.
We currently have the following pre-defined metadata models:
EPND Cohort: The cohort model used by the European Platform for Neurodegenerative Disease to register their collaborative cohorts. An official catalog of cohorts can be found at EPND Cohort Catalog.
Dataset: A collection of records that have been collected for a defined purpose, such as to answer a specific research question. Datasets are atomic, meaning that they cannot be further divided into smaller datasets. As such, all requests and permissions granted apply to the entire dataset.
Data collection: A collection of datasets that have been collated for a common purpose, but which can be further divided into smaller collections. For example, all data is collected as part of a given research programme or within a given consortium. Users are only expected to be granted partial access, and hence typically will only request access to a subset of a data collection.
Catalog: A collection of datasets and/or data collections. Nesting of catalogues is permitted, meaning it is possible to create a "catalog of catalogs."
Biobank: A collection of biological samples (and associated data) collected from patients. These collections are usually based on a certain criteria, for example, a disease or a geographical location.
Registry: A patient registry containing information about patients, usually clinical information, but any information about patients can be stored in a registry.
Guideline: A collection of one or more guidelines for processes relating to a certain disease or a type of data collection.
Custom: This is a general type, and can be used to describe any type of data collection that is not covered in the above types.
Relationship between meta sources
Meta-sources can be related to each other, forming a hierarchy or a graph of metadata. Like the following diagram:
Relationship between meta sources and data sources
Meta sources are designed to store metadata. Thus, internally, they can be linked to record level data stored in regular data sources, mainly with the " belong to " relationship.
The Basic model
The structure of the metadata follows a general model. The basic model is used in all cases and is then extended to accommodate other types of data with more detailed fields.
Internal fields and manual assignment
The fields explained in the sections below are the fields that form the metadata about a given resource. However, there are several other fields used internally within Cafe Variome to enable specific features, like interlinking of metadata entries. These are not visible when using the editing interface, but may be manually assigned providing the data is accurate, and the admin conforms to the required processes to format them correctly.
- sourceId
The UUID of the source. If omitted, Cafe Variome will assign a UUID to the source. If present, it should be a valid UUID4 string. This is used within each Cafe Variome instance to identify the source (UUID may not be unique accross instances). It can be used to link one metadata entry to another, for example, by filling the
datasetIds
fields in the cohort model.- connectionId
The UUID of a data source this metadata entry describes. This is usually only valid when the metadata describes a dataset, but in rare cases can be assigned to other metadata. This field is not recommended to assign directly, as there is no other way to know the UUID of a data source except from checking the database.
Common fields definition
These are the fields that are present in all types of meta-sources.
- sourceId
Source ID is a UUID4 compatible code generated by the system. However, if the inputted metadata contains some form of hierarchical structure and the ID is used to denote the parent-child relationship, the ID may be included in a manual uploading file. It must be unique and in UUID4 format for the system to accept it.
- sourceName
The name of the source.
- sourceType
The type of the source. This is very important, as it determines how this entry is interpreted. It's an enumerable field and can only be one of the following (case-sensitive):
- custom
A custom type that is not covered by the following types.
- cohort
A EPND Cohort metadata model.
- catalog
A catalog of datasets.
- biobank
A collection of biological samples.
- registry
A patient registry.
- guideline
A guideline.
- dataset
A collection of records.
- dataCollection
A collection of datasets.
- resourceUrls
The URLs of the source. They should be fully qualified URLs with schema (e.g.
https://example.com/
). They should point to the main resource of the source, its description, or any point of interest that a requesting user may need.- publisher
The publisher of the resource. It's a nested JSON object, with the following fields:
- publisherType
The type of the publisher. It's an enumerable field, can only be one of the following (case sensitive):
- individual
An individual person.
- organization
An organization.
- agency
An agency that is not the generator/owner of the resource, but is responsible for managing the resource.
- other
Any other type of publisher.
- name
The name of the publisher.
- contactEmail
The contact email of the publisher.
- contactName
The contact name of the publisher, in case the publisher is not an individual.
- url
The URL of the publisher.
- location
The location of the publisher. A string containing free text, for example, it can be
UK
orLeicester, UK, Europe
.
- description
The description of the source. It can be empty, but not recommended, as this is the main field that will be used to describe the source and used in free-text search.
- themes
The themes of the source. It's an array of strings, each string being a URI to a RDF data structure theme. Useful when a custom theme is required when generating FDP data from this source. If omitted, the default theme will be used.
- releaseLicense
The release license of the source. It should be a URL to the license, and if omitted, it will be considered that there is "no license," meaning all rights reserved with no permission to use, modify, or distribute the data.
- language
The language of the source. It should be a 2-character code adhering to ISO639-1 standard, in lower case.
- customFields
The custom fields of the source. It's a key-value or key[values] pairs, where the key is a string, and the value is a string or an array of strings. It's designed to contain the custom metadata in a searchable form. The key cannot contain special characters including dot
.
, dollar sign$
, slash/
, or backslash\
. If a key is present, the value cannot benull
but can be an empty string or an empty array.
Basic model example
The basic model, aka the Custom type, expects the following fields:
This metadata model is also used in other meta-source types, where the metadata is similar or same in structure.
EPND project extensions
Cohort
Cohort fields definition
The cohort model extends the base model to contain more information, specifically the ones deemed necessary in EPND project. Aside from the fields explained above, it can also have:
- cohortDetails
The details of the cohort. It's a nested JSON object, with the following fields:
- siteType
The type of the site. It's an enumerable string, being one of the following (case-sensitive):
- singleSite
A cohort that has only a single site.
- multiSite
A cohort that has multiple sites.
- multiCountry
A cohort that has multiple sites in multiple countries.
- country
The country of the cohort. It's a 2-character code adhering to the ISO3166-1 standard, in upper case.
- yearStart
The year the cohort started. It should be a 4-digit integer.
- collectedTypes
The types of data collected for the cohort. It's a nested JSON object, with the following fields:
- participants
The participants amount of the cohort, and their conditions. It's a nested JSON object, with the following fields:
- diseases
The diseases that the participants are categorized by. An empty array will cause the entire
participants
object to be ignored. It's an array of enumerable strings, each can only be one of the following (case-sensitive):- controlGroup
Control group participants.
- ad
Alzheimer's disease.
- pd
Parkinson's disease.
- irbd
Isolated REM Sleep Behavior Disorder.
- dlb
Dementia with Lewy Bodies.
- caa
Cerebral Amyloid Angiopathy.
- ftd
Frontotemporal Dementia.
- als
Amyotrophic Lateral Sclerosis.
- psp
Progressive Supranuclear Palsy.
- cbd
Corticobasal Degeneration.
- msa
Multiple System Atrophy.
- hd
Huntington's Disease.
- ataxia
Ataxia.
- other
Other diseases not listed above.
- numberOfSubjects
The number of subjects in the cohort. It's a number and should be above 0. If it's 0, the entire
participants
object will be ignored.
- bioSamples
The types of samples collected in the study. It's an array of enumerable strings, each being one of the following (case-sensitive):
- csf
Cerebrospinal fluid.
- serum
Serum.
- plasma
Plasma.
- dna
DNA.
- saliva
Saliva.
- urine
Urine.
- stool
Stool.
- images
The types of images collected for the study. It's an array of enumerable strings, each being one of the following (case-sensitive):
- mri
MRI.
- petAmyloid
PET Amyloid.
- petTau
PET Tau.
- spect
SPECT.
- ocular
Ocular.
- cognitiveData
The types of cognitive data collected for the study. It's an array of enumerable strings, each being one of the following (case-sensitive):
- crossSectional
Cross-sectional data.
- longitudinal
Longitudinal data.
- datasetIds
The UUIDs of the
dataset
type metadata entries that are related to this cohort. It's an array of strings, each being a UUID of a dataset metadata entry. The datasets have to be either present in the same file or already uploaded to the system.
Cohort model example
Dataset
Dataset fields definition
Datasets contain the following fields:
- datasetVersions
The versions of the dataset. It's an array of nested JSON objects, each object being a representation of a version of the dataset. Each object has the following fields:
- datasetDetails
The details of the dataset. It's a nested JSON object, with the following fields:
- versionId
The UUID of the version. If omitted, Cafe Variome will assign a UUID to the version. If present, it should be a valid UUID 4 string. Either way, it will always be present in the database.
- versionName
The version number of the dataset. Semantic versioning is recommended. Using a format that does not fit with semantic versioning will disable the parsing, comparison, and sorting of the versions.
- keywords
The keywords of the dataset. It's an array of strings, each being a keyword.
- publishedDate
The date when the dataset is released. It should be a date string in the format of
YYYY-MM-DD
.- updateDate
The update date of the dataset. It should be a date string in the format of
YYYY-MM-DD
.
- datasetContent
The information on the data content of the dataset. It's a nested JSON object, with the following fields:
- numberOfSubjects
The number of data records in the dataset. This should be an integer above 0. It may be an approximated number if the exact number is kept private for confidentiality reasons.
- minAge
The minimum age of the participants in the dataset. It should be a number and should be above 0.
- maxAge
The maximum age of the participants in the dataset. It should be a number and should be above 0.
- countries
The countries of the participants in the dataset. It's an array of 2-character codes adhering to the ISO3166-1 standard, in upper case.
- diseases
The diseases of the participants in the dataset. It's an array of enumerable strings, each can only be one of the following (case-sensitive):
- controlGroup
Control group participants.
- ad
Alzheimer's disease.
- pd
Parkinson's disease.
- irbd
Isolated REM Sleep Behavior Disorder.
- dlb
Dementia with Lewy Bodies.
- caa
Cerebral Amyloid Angiopathy.
- ftd
Frontotemporal Dementia.
- als
Amyotrophic Lateral Sclerosis.
- psp
Progressive Supranuclear Palsy.
- cbd
Corticobasal Degeneration.
- msa
Multiple System Atrophy.
- hd
Huntington's Disease.
- ataxia
Ataxia.
- other
Other diseases not listed above.
- sex
The genders covered in the dataset. It's an array of enumerable strings, each can only be one of the following (case-sensitive):
- male
Biological male.
- female
Biological female.
- other
Other biological genders.
- undifferential
Gender data is irrelevant to the data, or is not differentiated on purpose.
- unknown
No information regarding the gender composition or collection status.
- clinical
The clinical data collected within the dataset. It's an array of enumerable strings, each can only be one of the following (case-sensitive):
- comorbidities
Comorbidities.
- medicationUse
Medication use.
- familyHistory
Family history.
- ageOfSymptomOnset
Age of symptom onset.
- clinicalDiagnosis
Clinical diagnosis.
- exposure
Exposure.
- lifeStyleInfo
Lifestyle information.
- vitalSigns
Vital signs.
- markers
The biological or digital markers collected within the dataset. It's an array of enumerable strings, each can only be one of the following (case-sensitive):
- amyloid
Amyloid protein markers.
- tau
Tau protein markers.
- neurofilamentLightChain
Neurofilament light chain protein markers.
- alphaSynuclein
Alpha-synuclein protein markers.
- dat
Direct Antibody Test.
- images
array[string]
optional
The types of images collected within the dataset. It's an array of enumerable strings, each can only be one of the following (case-sensitive):- mri
MRI.
- petAmyloid
PET Amyloid.
- petTau
PET Tau.
- spect
SPECT.
- ocular
Ocular.
- electrophysiology
The types of electrophysiology data collected within the dataset. It's an array of enumerable strings, each can only be one of the following (case-sensitive):
- eeg
EEG.
- meg
MEG.
- erp
ERP.
- dataTypes
The types of data collected within the dataset. It's an array of enumerable strings, each can only be one of the following (case-sensitive):
- demographics
Demographics.
- clinical
Clinical.
- lifestyle
Lifestyle.
- functionalRatings
Functional ratings.
- motor
Motor.
- neuropsychiatric
Neuropsychiatric.
- neuropsychological
Neuropsychological.
- qualityOfLife
Quality of life.
- sleepScales
Sleep scales.
- digitalData
Digital data.
- imaging
Imaging.
- electrophysiology
Electrophysiology.
- neuroPathology
Neuro pathology.
- other
Other.
The information about each version is stored in the datasetVersions
field, which is a list. Each version contains the details of the version, and the content of the dataset. Semantic versioning is recommended to enable version comparison and search.
Dataset model example
Data collection
Data collection fields definition
These are the fields specific to the data collection type.
- dataCollectionDetails
The details of the data collection. It's a nested JSON object, with the following fields:
- keywords
The keywords of the dataset. It's an array of strings, each being a keyword.
- publishedDate
The date when the dataset is released. It should be a date string in the format of
YYYY-MM-DD
.- updateDate
The update date of the dataset. It should be a date string in the format of
YYYY-MM-DD
.
- dataCollectionContent
The information on the data content of the data collection. It's a nested JSON object, with the following fields:
- numberOfSubjects
The amount of data records in the dataset. It should be a number and should be above 0. It may be an approximated number if the exact number is kept private for confidentiality reasons.
- minAge
The minimum age of the participants in the dataset. It should be a number and should be above 0.
- maxAge
The maximum age of the participants in the dataset. It should be a number and should be above 0.
- countries
The countries of the participants in the dataset. It's an array of 2-character codes adhering to the ISO3166-1 standard, in upper case.
- diseases
The diseases of the participants in the dataset. It's an array of enumerable strings, each can only be one of the following (case-sensitive):
- controlGroup
Control group participants.
- ad
Alzheimer's disease.
- pd
Parkinson's disease.
- irbd
Isolated REM Sleep Behavior Disorder.
- dlb
Dementia with Lewy Bodies.
- caa
Cerebral Amyloid Angiopathy.
- ftd
Frontotemporal Dementia.
- als
Amyotrophic Lateral Sclerosis.
- psp
Progressive Supranuclear Palsy.
- cbd
Corticobasal Degeneration.
- msa
Multiple System Atrophy.
- hd
Huntington's Disease.
- ataxia
Ataxia.
- other
Other diseases not listed above.
- sex
The genders covered in the dataset. It's an array of enumerable strings, each can only be one of the following (case-sensitive):
- male
Biological male.
- female
Biological female.
- other
Biological other.
- undifferential
Gender data is irrelevant to the data, or is not differentiated on purpose.
- unknown
No information regarding the gender composition or collection status.
- clinical
The clinical data collected within the dataset. It's an array of enumerable strings, each can only be one of the following (case-sensitive):
- comorbidities
Comorbidities.
- medicationUse
Medication use.
- familyHistory
Family history.
- ageOfSymptomOnset
Age of symptom onset.
- clinicalDiagnosis
Clinical diagnosis.
- exposure
Exposure.
- lifeStyleInfo
Lifestyle information.
- vitalSigns
Vital signs.
- markers
The biological or digital markers collected within the dataset. It's an array of enumerable strings, each can only be one of the following (case-sensitive):
- amyloid
Amyloid protein markers.
- tau
Tau protein markers.
- neurofilamentLightChain
Neurofilament light chain protein markers.
- alphaSynuclein
Alpha-synuclein protein markers.
- dat
Direct Antibody Test.
- images
The types of images collected within the dataset. It's an array of enumerable strings, each can only be one of the following (case-sensitive):
- mri
MRI.
- petAmyloid
PET Amyloid.
- petTau
PET Tau.
- spect
SPECT.
- ocular
Ocular.
- electrophysiology
The types of electrophysiology data collected within the dataset. It's an array of enumerable strings, each can only be one of the following (case-sensitive):
- eeg
EEG.
- meg
MEG.
- erp
ERP.
- dataTypes
The types of data collected within the dataset. It's an array of enumerable strings, each can only be one of the following (case-sensitive):
- demographics
Demographics.
- clinical
Clinical.
- lifestyle
Lifestyle.
- functionalRatings
Functional ratings.
- motor
Motor.
- neuropsychiatric
Neuropsychiatric.
- neuropsychological
Neuropsychological.
- qualityOfLife
Quality of life.
- sleepScales
Sleep scales.
- digitalData
Digital data.
- imaging
Imaging.
- electrophysiology
Electrophysiology.
- neuroPathology
Neuro pathology.
- other
Other.