Data Repository: A data repository is an online collection of data from academic research. It makes data sets available for use and further study. A data repository can focus on general collections of data from various subjects, or it can be discipline-specific with data from select subjects or areas of research. Like other types of collections, a data repository may limit the data in their collections to those from research done within their institution or organization (also known as institutional data repositories) or may accept data collections that fall with in their areas of emphasis.
Data Lake; A data lake is a general repository, used by industry, that accepts large volumes of data in various forms, such as sales transaction data, cellular phone data, search engine data, or drug research data. These repositories are usually private (though not always), so they only accept data from the owning companies, and only researchers from those companies may access the data stored in the repository, The data can be structured (e.g., Excel spreadsheets), semi-structured (e.g., webpages), or unstructured (e.g., images, tweets). The data can also be raw, cleansed or curated so that users can find datasets that fit their analysis needs. Data is captured and deposited from a variety of sources, such as social media, IoT devices, mobile apps, and more. The data is deposited as it comes in, into giant datasets. Data lakes are used for big data analytics, machine learning and other types of large volume data analysis.
Data Lakehouse: A data lakehouse is an open source, open standards solution to quality problems that can often arise with data in a data lake. It sits on top of a data lake and allows data scientists and engineers to perform deep analysis and processing on data stored within the lake.
Data warehouse: A data warehouse is a data lake that is built for one purpose, and all of the datasets that are deposited into the warehouse are treated and transformed to serve that purpose. Because industry generates data continuously for specific needs, data warehouses allow industry to regularly analyze that data and support decisions and business needs based on that data.
Data repository versus data source: what is the difference?
Data repositories are created to store, curate, and make available datasets that their customers deposit with them. Thus, the customer traffic moves two ways: depositing and withdrawal. Data repositories may be public (e.g. Dryad, Zenodo, Vivli) or private (e.g. a university institutional repository), The datasets that are deposited are processed and ready for curation, and usually come from academic research.
Data sources are special repositories that operate in only one direction for customers: withdrawal of stored datasets. The repository owner (e.g., Data.gov, Sage Data, National Institute of Health) selects and curates specific datasets that are made available for withdrawals. These datasets may be generated by the organization that owns the repository, or they may be collected from various repositories and made available to researchers.
Why post data in a repository?
Many researchers deposit their data in repositories for preservation and reproducibility purposes. Doing this assures that the data connected to a research project will be saved, even if research groups change their personnel or research interests. Some publishers (e.g., Nature) or funders require the sharing of research data. Sharing data through repositories enables easier analysis of published research and supports the research pillar of Reproducibility. Researchers also reap the same benefits as sharing articles and publications: their work is easily discoverable, resulting in more views, citations, and impact.
More Information
Directories of Data Repositories
Florida Atlantic University Libraries
777 Glades Road
Boca Raton, FL 33431
(561) 297-6911