LibGuides: Data Management: E: Preserving Data: Repositories, Lakes and Warehouses

About Data Repositories

Data File

Data Repository: A data repository is an online collection of data, either raw or processed. To distinguish between different types of data repositories, the term data repository has come to mean a repository that stores processed data from academic research, versus a data lake, which is a data repository that stores raw data in various forms.

The data repository makes data sets available for use and further study. A data repository may be public or private (access), generalist or specialist (datasets), institutional or open (source of datasets).

Public data repositories allow anyone to access the datasets stored in the repository. Some repositories charge fees for access, others, such as Data.gov, provide free access. Private data repositories only allow members of the owning institution or organization to access the datasets. Many institutional repositories are also private.

A generalist repository, which focuses on general collections of data from various subjects (e.g. Dryad, Vivli, or Zenodo), or it can be a specialist repository and focus on data from select subjects, disciplines, or narrow areas of research (e.g. ImmPort, GitHub, or GenBank). Specialist repositories may focus on a very narrow type of dataset (e.g. ImmPort, which only hosts immunology related data), or may have a subject or discipline focus and accept a larger variety of datasets (e.g. GitHub, which has a discipline focus on coding data).

Like other types of collections, a data repository be institutional, which means that deposits of datasets are limited to research done within their institution or organization, or open, which means that the repository accepts datasets from any qualified research that falls within their areas of emphasis, regardless of where the research was done.

Data Lake; A data lake is a general repository, used by industry, that accepts large volumes of data in various forms, such as sales transaction data, cellular phone data, search engine data, or drug research data. These repositories are usually institutional and private, which means that they only accept data from the owning companies, and only researchers from those companies may access the data stored in the repository. (One big exception is government data lakes, which are institutional but offer public access). The data can be structured (e.g., Excel spreadsheets), semi-structured (e.g., webpages), or unstructured (e.g., images, tweets). The data can also be raw, cleansed or curated so that users can find datasets that fit their analysis needs. Data is captured and deposited from a variety of sources, such as social media, IoT devices, mobile apps, and more. The data is deposited as it comes in, into giant datasets. Data lakes are used for big data analytics, machine learning and other types of large volume data analysis.

Data Lake House: A data lake house is an open source, open standards solution to quality problems that can often arise with data in a data lake. It sits on top of a data lake and allows data scientists and engineers to perform deep analysis and processing on data stored within the lake.

Data Warehouse: A data warehouse is the industry version of a specialist data repository. It is a data lake that is built for one purpose, and all of the data and datasets that are deposited into the warehouse are treated and transformed to serve that purpose. Because industry generates data continuously for specific needs, data warehouses allow industry to regularly analyze that data and support decisions and business needs based on that data.

Data repository versus data source: what is the difference?

Data repositories are created to store, curate, and make available datasets that their customers deposit with them. Thus, the customer traffic moves two ways: depositing and withdrawal. Data repositories may be public (e.g. Dryad, Zenodo, Vivli) or private (e.g. a university institutional repository), The datasets that are deposited in repositories are processed and ready for curation. Datasets in public repositories usually come from academic research.

Data sources are special repositories that operate in only one direction for customers: withdrawal of stored datasets. The repository owner (e.g., Data.gov, Sage Data, National Institute of Health) selects and curates specific datasets that are made available for withdrawals. These datasets may be generated by the organization that owns the repository, or they may be collected from various repositories and made available to researchers.

Why submit datasets to a repository?

Many researchers deposit their data in repositories for preservation and reproducibility purposes. Doing this assures that the data connected to a research project will be saved, even if research groups change their personnel or research interests. Some publishers (e.g., Nature) or funders require the sharing of research data. Sharing data through repositories enables easier analysis of published research and supports the research pillar of Reproducibility. Researchers also reap the same benefits as sharing articles and publications: their work is easily findable and discoverable, resulting in more views, citations, and impact.

More Information

Florida Atlantic University Research Registries and Repositories Policy
Florida Atlantic University Research Data Policy
Florida Atlantic University Data Storage Recommendations
Division of Research Data Security, Storage, Access, Transfer and Destruction recommendations
Florida Atlantic University Library NIH Data Management Resources
NIH Data Management information and policies for the College of Medicine
What is Open Data? (SPARC Europe)
Recommended Data Repositories (Nature)
This list includes recommended generalist data repositories, and also those for STEM and social sciences.

Directories of Data Repositories

DataCite Commons
Find a data repository within the Re3Data.org directory.
Re3Data.Org
A search engine of data repositories. Search by discipline, subject, or topic.
OpenDOAR - Directory of Open Access Repositories
Includes some data repository information.

Find a Generalist or Domain-Specific Data Repository

US Department of Education - Data and Statistics
Open ICPSR
Find and share social, behavioral, and health sciences research data.
Child & Family Data Archive
LDbase
A data repository for educational and developmental sciences.

NIH-Supported Data Sharing Resources
A list of data sharing repositories by the National Institutes of Health (NIH).
Open ICPSR
Find and share social, behavioral, and health sciences research data.
Health and Medical Care Archive (ICPSR)
A data archive in ICPSR with focus on health science data.
PubMed Central Article Datasets
PubMed Central and the NCBI Bookshelf offer several large datasets of journal articles and other scientific publications made available for retrieval under license terms that generally allow for more liberal redistribution and reuse than a traditional copyrighted work (e.g., Creative Commons licenses).
Vivli
A repository of data from clinical trials.

CORE: Open Access for the Humanities
Formerly known as Humanities Commons.
Isidore
Collects data, documents, and more in social sciences and the humanities.
National Data on Arts and Culture (ICPSR)
Qualitative Data Repository

Data.gov
EarthChem
Data respository for earth science.
GitHub
One of the largest repositories for Open Source software
HEPData
Repository for publication-related High-Energy Physics data.
Marine Geoscience Data System
OpenNeuro
A free and open platform for sharing MRI, MEG, EEG, iEEG, and ECoG data.
ODISCat
Oceanographic data from the intergovernmental Oceanographic Commission of UNESCO. International Oceanographic Data and Information Exchange.
SAO/NASA Astrophysics Data System
Science Data Repository (NASA)
Includes datasets on Earth, heliophysics, planetary, and astrophysics observations.
Research Data Archive (National Center for Atmospheric Research - NCAR)
Biological and Chemical Oceanography Data Management Office (BCO-DMO)
CAIDA Resource Catalog
https://catalog.caida.org/

more... less...

A repository of datasets, software, solutions, and more.
National Oceanographic Data Center (NODC)

Child & Family Data Archive
Isidore
Collects data, documents, and more in social sciences and the humanities.
National Archive on Criminal Justice Data (ICPSR)
Open ICPSR
Find and share social, behavioral, and health sciences research data.
Resource Center for Minority Data (ICSPR)
Qualitative Data Repository
LDbase
A data repository for educational and developmental sciences.

Government Datasets, Statistical Data & Census Information (Research Guide)
Find access points to government datasets and statistical data.
OAD: Data repositories
Wiki listing of data repositories.
Dryad
FigShare
Harvard DataVerse
ICSPR (University of Michigan)
A large, online data repository spanning many subjects in the social sciences, education, arts and humanities, and healthcare.
ICSPR Thematic Data Collections (University of Michigan)
ICSPR hosts thematic collections of data in its repositories.
Mendeley Data (Elsevier)
Open Context
A fee-based data repository service and site.
Synapse
Vivli
A repository of data from clinical trials.
Zenodo
With an emphasis in science, Zenodo includes other areas. It also integrates with GitHub and provides altmetrics on items in its collection.

Find a Repository: Generalist and Domain-Specific

arXiv.org
Cogprints
Repository of cognitive sciences.
Cryptology ePrint Archive
CERN Document Server
NSF Public Access Registry (PAR)
A portal where scholars can log in to deposit their NSF-funded work.

CORE: Open Access for the Humanities
Formerly known as Humanities Commons.
The Digital Archaeological Record (tDAR)
Isidore
Collects data, documents, and more in social sciences and the humanities.
Semantics Archive

Open Anthropology Research Repository
RePec: Research Papers in Economics
Pre-print server for economics and related topics.
Social Science Research Network (SSRN)
more... less...

Social Science Research Network (SSRN) is a world wide collaborative of over 171,000 authors and more than 1.3 million users that is devoted to the rapid worldwide dissemination of social science research. It is composed of a number of specialized research networks in each of the social sciences. Each of SSRN's networks encourages the early distribution of research results by reviewing and distributing submitted abstracts and full text papers from scholars around the world.
Policy Archive
A repository of public policy research.
Isidore
Collects data, documents, and more in social sciences and the humanities.

PubMed Central (PMC)
more... less...

NIH repository for peer-reviewed primary research reports in the life sciences. View the full text of articles online.
Vivli
A repository of data from clinical trials.
NIH-Supported Data Sharing Resources
searchRxiv

Dryad
Harvard DataVerse
FigShare
Mendeley Data (Elsevier)
OSF - Open Science Framework Repository
searchRxiv
Synapse
Vivli
A repository of data from clinical trials.
Zenodo
With an emphasis in science, Zenodo includes other areas. It also integrates with GitHub and provides altmetrics on items in its collection.
Zenodo Communities
Explore Zenodo's expansive list of collections, ranging from linguistics, COVID-19, and more!

Last updated on Nov 25, 2025 9:55 AM