dataget.text.imdb_reviews

Downloads the IMDB Reviews dataset and loads it as pandas dataframes.

import dataget

df_train, df_test = dataget.text.imdb_reviews().get()
This dataset also contains unsupervised sample, to load them set the include_unsupervised argument:

import dataget

df_train, df_test = dataget.text.imdb_reviews().get(include_unsupervised=True)
All unsupervised sample will have a label of -1.

Format

type shape
df_train pd.DataFrame (75_000, 3)
df_test pd.DataFrame (25_000, 3)

Features

column type
text str
label int64
text_path str

Info

  • Folder name: text_imdb_reviews
  • Size on disk: 490MB

API Reference

imdb_reviews

load(self, include_unlabeled=False)

Show source code in imdb_reviews.py
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
    def load(self, include_unlabeled=False):
        """
        Arguments:
            include_unlabeled: whether or not to include the unlabeled samples.
        """
        train_path = self.path / "aclImdb" / "train"
        test_path = self.path / "aclImdb" / "test"

        # train
        df_train = [
            self.load_df(train_path / "pos", label=1),
            self.load_df(train_path / "neg", label=0),
        ]

        if include_unlabeled:
            df_train.append(self.load_df(train_path / "unsup", label=-1))

        df_train = pd.concat(df_train, axis=0)

        # test
        df_test = pd.concat(
            [
                self.load_df(test_path / "pos", label=1),
                self.load_df(test_path / "neg", label=0),
            ],
            axis=0,
        )

        return df_train, df_test

Parameters

Name Type Description Default
include_unlabeled _empty whether or not to include the unlabeled samples. False