Data Repository: A data repository is an online collection of data, either raw or processed. To distinguish between different types of data repositories, the term data repository has come to mean a repository that stores processed data from academic research, versus a data lake, which is a data repository that stores raw data in various forms.
The data repository makes data sets available for use and further study. A data repository may be public or private (access), generalist or specialist (datasets), institutional or open (source of datasets).
Public data repositories allow anyone to access the datasets stored in the repository. Some repositories charge fees for access, others, such as Data.gov, provide free access. Private data repositories only allow members of the owning institution or organization to access the datasets. Many institutional repositories are also private.
A generalist repository, which focuses on general collections of data from various subjects (e.g. Dryad, Vivli, or Zenodo), or it can be a specialist repository and focus on data from select subjects, disciplines, or narrow areas of research (e.g. ImmPort, GitHub, or GenBank). Specialist repositories may focus on a very narrow type of dataset (e.g. ImmPort, which only hosts immunology related data), or may have a subject or discipline focus and accept a larger variety of datasets (e.g. GitHub, which has a discipline focus on coding data).
Like other types of collections, a data repository be institutional, which means that deposits of datasets are limited to research done within their institution or organization, or open, which means that the repository accepts datasets from any qualified research that falls within their areas of emphasis, regardless of where the research was done.
Data Lake; A data lake is a general repository, used by industry, that accepts large volumes of data in various forms, such as sales transaction data, cellular phone data, search engine data, or drug research data. These repositories are usually institutional and private, which means that they only accept data from the owning companies, and only researchers from those companies may access the data stored in the repository. (One big exception is government data lakes, which are institutional but offer public access). The data can be structured (e.g., Excel spreadsheets), semi-structured (e.g., webpages), or unstructured (e.g., images, tweets). The data can also be raw, cleansed or curated so that users can find datasets that fit their analysis needs. Data is captured and deposited from a variety of sources, such as social media, IoT devices, mobile apps, and more. The data is deposited as it comes in, into giant datasets. Data lakes are used for big data analytics, machine learning and other types of large volume data analysis.
Data Lake House: A data lake house is an open source, open standards solution to quality problems that can often arise with data in a data lake. It sits on top of a data lake and allows data scientists and engineers to perform deep analysis and processing on data stored within the lake.
Data Warehouse: A data warehouse is the industry version of a specialist data repository. It is a data lake that is built for one purpose, and all of the data and datasets that are deposited into the warehouse are treated and transformed to serve that purpose. Because industry generates data continuously for specific needs, data warehouses allow industry to regularly analyze that data and support decisions and business needs based on that data.
Data repository versus data source: what is the difference?
Data repositories are created to store, curate, and make available datasets that their customers deposit with them. Thus, the customer traffic moves two ways: depositing and withdrawal. Data repositories may be public (e.g. Dryad, Zenodo, Vivli) or private (e.g. a university institutional repository), The datasets that are deposited in repositories are processed and ready for curation. Datasets in public repositories usually come from academic research.
Data sources are special repositories that operate in only one direction for customers: withdrawal of stored datasets. The repository owner (e.g., Data.gov, Sage Data, National Institute of Health) selects and curates specific datasets that are made available for withdrawals. These datasets may be generated by the organization that owns the repository, or they may be collected from various repositories and made available to researchers.
Why submit datasets to a repository?
Many researchers deposit their data in repositories for preservation and reproducibility purposes. Doing this assures that the data connected to a research project will be saved, even if research groups change their personnel or research interests. Some publishers (e.g., Nature) or funders require the sharing of research data. Sharing data through repositories enables easier analysis of published research and supports the research pillar of Reproducibility. Researchers also reap the same benefits as sharing articles and publications: their work is easily findable and discoverable, resulting in more views, citations, and impact.
More Information
Directories of Data Repositories
Florida Atlantic University Libraries
777 Glades Road
Boca Raton, FL 33431
(561) 297-6911