Dataget¶
Dataget is an easy to use, framework-agnostic, dataset library that gives you quick access to a collection of Machine Learning datasets through a simple API.
Main features:
- Minimal: Downloads entire datasets with just 1 line of code.
- Framework Agnostic: Loads data as
numpyarrays orpandasdataframes which can be easily used with the majority of Machine Learning frameworks. - Transparent: By default stores the data in your current project so you can easily inspect it.
- Memory Efficient: When a dataset doesn't fit in memory it will return metadata instead so you can iteratively load it.
- Integrates with Kaggle: Supports loading datasets directly from Kaggle in a variety of formats.
Checkout the documentation for the list of available datasets.
Getting Started¶
In dataget you just have to do two things:
- Instantiate a
Datasetfrom our collection. - Call the
getmethod to download the data to disk and load it into memory.
Both are usually done in one line:
import dataget X_train, y_train, X_test, y_test = dataget.image.mnist().get()
This example downloads the MNIST dataset to ./data/image_mnist and loads it as numpy arrays.
Kaggle Support¶
Kaggle promotes the use of csv files and dataget loves it! With dataget you can quickly download any dataset from the platform and have immediate access to the data:
import dataget df_train, df_test = dataget.kaggle(dataset="cristiangarcia/pointcloudmnist2d").get( files=["train.csv", "test.csv"] )
- Be able to load any file that
numpyorpandascan read. - Have generic support for other types of datasets like images, audio, video, etc.
- e.g
dataget.data.kaggle(..., type="image").get(...)
- e.g
Installation¶
pip install dataget
Contributing¶
Adding a new dataset is easy! Read our guide on Creating a Dataset if you are interested in contributing a dataset.
License¶
MIT License