LibGuides: Data Management: E: Data Repositories, Lakes and Warehouses

About Data Repositories

Data File

Data Repository: A data repository is an online collection of data from academic research. It makes data sets available for use and further study. A data repository can focus on general collections of data from various subjects, or it can be discipline-specific with data from select subjects or areas of research. Like other types of collections, a data repository may limit the data in their collections to those from research done within their institution or organization (also known as institutional data repositories) or may accept data collections that fall with in their areas of emphasis.

Data Lake; A data lake is a general repository, used by industry, that accepts large volumes of data in various forms, such as sales transaction data, cellular phone data, search engine data, or drug research data. These repositories are usually private (though not always), so they only accept data from the owning companies, and only researchers from those companies may access the data stored in the repository, The data can be structured (e.g., Excel spreadsheets), semi-structured (e.g., webpages), or unstructured (e.g., images, tweets). The data can also be raw, cleansed or curated so that users can find datasets that fit their analysis needs. Data is captured and deposited from a variety of sources, such as social media, IoT devices, mobile apps, and more. The data is deposited as it comes in, into giant datasets. Data lakes are used for big data analytics, machine learning and other types of large volume data analysis.

Data Lakehouse: A data lakehouse is an open source, open standards solution to quality problems that can often arise with data in a data lake. It sits on top of a data lake and allows data scientists and engineers to perform deep analysis and processing on data stored within the lake.

Data warehouse: A data warehouse is a data lake that is built for one purpose, and all of the datasets that are deposited into the warehouse are treated and transformed to serve that purpose. Because industry generates data continuously for specific needs, data warehouses allow industry to regularly analyze that data and support decisions and business needs based on that data.

Data repository versus data source: what is the difference?

Data repositories are created to store, curate, and make available datasets that their customers deposit with them. Thus, the customer traffic moves two ways: depositing and withdrawal. Data repositories may be public (e.g. Dryad, Zenodo, Vivli) or private (e.g. a university institutional repository), The datasets that are deposited are processed and ready for curation, and usually come from academic research.

Data sources are special repositories that operate in only one direction for customers: withdrawal of stored datasets. The repository owner (e.g., Data.gov, Sage Data, National Institute of Health) selects and curates specific datasets that are made available for withdrawals. These datasets may be generated by the organization that owns the repository, or they may be collected from various repositories and made available to researchers.

Why post data in a repository?

Many researchers deposit their data in repositories for preservation and reproducibility purposes. Doing this assures that the data connected to a research project will be saved, even if research groups change their personnel or research interests. Some publishers (e.g., Nature) or funders require the sharing of research data. Sharing data through repositories enables easier analysis of published research and supports the research pillar of Reproducibility. Researchers also reap the same benefits as sharing articles and publications: their work is easily discoverable, resulting in more views, citations, and impact.

More Information

Florida Atlantic University Research Registries and Repositories Policy
Florida Atlantic University Research Data Policy
Florida Atlantic University Data Storage Recommendations
Division of Research Data Security, Storage, Access, Transfer and Destruction recommendations
Florida Atlantic University Library NIH Data Management Resources
NIH Data Management information and policies for the College of Medicine
What is Open Data? (SPARC Europe)
Recommended Data Repositories (Nature)
This list includes recommended generalist data repositories, and also those for STEM and social sciences.

Directories of Data Repositories

DataCite Commons
Find a data repository within the Re3Data.org directory.
Re3Data.Org
A search engine of data repositories. Search by discipline, subject, or topic.
OpenDOAR - Directory of Open Access Repositories
Includes some data repository information.

Find a Generalist or Domain-Specific Data Repository

US Department of Education - Data and Statistics
Open ICPSR
Find and share social, behavioral, and health sciences research data.
Child & Family Data Archive
LDbase
A data repository for educational and developmental sciences.

NIH-Supported Data Sharing Resources
A list of data sharing repositories by the National Institutes of Health (NIH).
Open ICPSR
Find and share social, behavioral, and health sciences research data.
Health and Medical Care Archive (ICPSR)
A data archive in ICPSR with focus on health science data.
PubMed Central Article Datasets
PubMed Central and the NCBI Bookshelf offer several large datasets of journal articles and other scientific publications made available for retrieval under license terms that generally allow for more liberal redistribution and reuse than a traditional copyrighted work (e.g., Creative Commons licenses).
Vivli
A repository of data from clinical trials.

CORE: Open Access for the Humanities
Formerly known as Humanities Commons.
Isidore
Collects data, documents, and more in social sciences and the humanities.
National Data on Arts and Culture (ICPSR)
Qualitative Data Repository

Data.gov
EarthChem
Data respository for earth science.
GitHub
One of the largest repositories for Open Source software
HEPData
Repository for publication-related High-Energy Physics data.
Marine Geoscience Data System
OpenNeuro
A free and open platform for sharing MRI, MEG, EEG, iEEG, and ECoG data.
ODISCat
Oceanographic data from the intergovernmental Oceanographic Commission of UNESCO. International Oceanographic Data and Information Exchange.
SAO/NASA Astrophysics Data System The SAO/NASA Astrophysics Data System (ADS) is a Digital Library portal for researchers in Astronomy and Physics, operated by the Smithsonian Astrophysical Observatory (SAO) under a NASA grant. The ADS maintains three bibliographic databases containing more than 9.9 million records: Astronomy and Astrophysics, Physics, and arXiv e-prints.
Science Data Repository (NASA)
Includes datasets on Earth, heliophysics, planetary, and astrophysics observations.
Research Data Archive (National Center for Atmospheric Research - NCAR)
Biological and Chemical Oceanography Data Management Office (BCO-DMO)
CAIDA Resource Catalog
https://catalog.caida.org/

more... less...

A repository of datasets, software, solutions, and more.
National Oceanographic Data Center (NODC)

Child & Family Data Archive
Isidore
Collects data, documents, and more in social sciences and the humanities.
National Archive on Criminal Justice Data (ICPSR)
Open ICPSR
Find and share social, behavioral, and health sciences research data.
Resource Center for Minority Data (ICSPR)
Qualitative Data Repository
LDbase
A data repository for educational and developmental sciences.

Government Datasets, Statistical Data & Census Information (Research Guide)
Find access points to government datasets and statistical data.
OAD: Data repositories
Wiki listing of data repositories.
Dryad
FigShare
Harvard DataVerse
ICSPR (University of Michigan)
A large, online data repository spanning many subjects in the social sciences, education, arts and humanities, and healthcare.
ICSPR Thematic Data Collections (University of Michigan)
ICSPR hosts thematic collections of data in its repositories.
Mendeley Data (Elsevier)
Open Context
A fee-based data repository service and site.
Synapse
Vivli
A repository of data from clinical trials.
Zenodo
With an emphasis in science, Zenodo includes other areas. It also integrates with GitHub and provides altmetrics on items in its collection.

Find a Repository: Generalist and Domain-Specific

arXiv.org
Cogprints
Repository of cognitive sciences.
Cryptology ePrint Archive
CERN Document Server
NSF Public Access Registry (PAR)
A portal where scholars can log in to deposit their NSF-funded work.

CORE: Open Access for the Humanities
Formerly known as Humanities Commons.
The Digital Archaeological Record (tDAR)
Isidore
Collects data, documents, and more in social sciences and the humanities.
Semantics Archive

Open Anthropology Research Repository
RePec: Research Papers in Economics
Pre-print server for economics and related topics.
Social Science Research Network (SSRN)
more... less...

Social Science Research Network (SSRN) is a world wide collaborative of over 171,000 authors and more than 1.3 million users that is devoted to the rapid worldwide dissemination of social science research. It is composed of a number of specialized research networks in each of the social sciences. Each of SSRN's networks encourages the early distribution of research results by reviewing and distributing submitted abstracts and full text papers from scholars around the world.
Policy Archive
A repository of public policy research.
Isidore
Collects data, documents, and more in social sciences and the humanities.

PubMed Central (PMC)
more... less...

NIH repository for peer-reviewed primary research reports in the life sciences. View the full text of articles online.
Vivli
A repository of data from clinical trials.
NIH-Supported Data Sharing Resources This page provides a list of NIH-supported domain-specific data repositories. Researchers can submit and see data that is accessible and open for reuse. Data in these repositories is usually limited to a discipline or certain types of data.
searchRxiv

Dryad
Harvard DataVerse
FigShare
Mendeley Data (Elsevier)
OSF - Open Science Framework Repository
searchRxiv
Synapse
Vivli
A repository of data from clinical trials.
Zenodo
With an emphasis in science, Zenodo includes other areas. It also integrates with GitHub and provides altmetrics on items in its collection.
Zenodo Communities
Explore Zenodo's expansive list of collections, ranging from linguistics, COVID-19, and more!

Last updated on May 7, 2025 3:35 PM