

dataget.text.imdb_reviews¶

Downloads the IMDB Reviews dataset and loads it as pandas dataframes.

import dataget

df_train, df_test = dataget.text.imdb_reviews().get()

This dataset also contains unsupervised sample, to load them set the include_unsupervised argument:

import dataget

df_train, df_test = dataget.text.imdb_reviews().get(include_unsupervised=True)

All unsupervised sample will have a label of -1.

Format¶

	type	shape
df_train	pd.DataFrame	`(75_000, 3)`
df_test	pd.DataFrame	`(25_000, 3)`

Features¶

column	type
text	str
label	int64
text_path	str

Info¶

Folder name: text_imdb_reviews
Size on disk: 490MB

API Reference¶

`imdb_reviews`¶

`load(self, include_unlabeled=False)`¶

Show source code in imdb_reviews.py

    def load(self, include_unlabeled=False):
        """
        Arguments:
            include_unlabeled: whether or not to include the unlabeled samples.
        """
        train_path = self.path / "aclImdb" / "train"
        test_path = self.path / "aclImdb" / "test"

        # train
        df_train = [
            self.load_df(train_path / "pos", label=1),
            self.load_df(train_path / "neg", label=0),
        ]

        if include_unlabeled:
            df_train.append(self.load_df(train_path / "unsup", label=-1))

        df_train = pd.concat(df_train, axis=0)

        # test
        df_test = pd.concat(
            [
                self.load_df(test_path / "pos", label=1),
                self.load_df(test_path / "neg", label=0),
            ],
            axis=0,
        )

        return df_train, df_test

Parameters

Name	Type	Description	Default
`include_unlabeled`	`_empty`	whether or not to include the unlabeled samples.	`False`