Package pandas_profiling

Main module of pandas-profiling.

Pandas Profiling

Pandas Profiling Logo Header

Build Status Code Coverage Release Version Python Version Code style: black

Documentation | Slack | Stack Overflow

Generates profile reports from a pandas DataFrame.

The pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

  • Type inference: detect the types of columns in a dataframe.
  • Essentials: type, unique values, missing values
  • Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
  • Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
  • Most frequent values
  • Histogram
  • Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
  • Missing values matrix, count, heatmap and dendrogram of missing values
  • Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.
  • File and Image analysis extract file sizes, creation dates and dimensions and scan for truncated images or those containing EXIF information.

Announcements

Version v3.0.0 released in which the report configuration was completely overhauled, providing a more intuitive API and fixing issues inherent to the previous global config.

This is the first release to adhere to the Semver and Conventional Commits specifications.

Spark backend in progress: We can happily announce that we're nearing v1 for the Spark backend for generating profile reports. Beta testers wanted! The Spark backend will be released as a pre-release for this package.

Support pandas-profiling

The development of pandas-profiling relies completely on contributions. If you find value in the package, we welcome you to support the project directly through GitHub Sponsors! Please help me to continue to support this package. It's extra exciting that GitHub matches your contribution for the first year.

Find more information here:

May 9, 2021 ūüíė


Contents: Examples | Installation | Documentation | Large datasets | Command line usage | Advanced usage | integrations | Support | Types | How to contribute | Editor Integration | Dependencies


Examples

The following examples can give you an impression of what the package can do:

  • Census Income (US Adult Census data relating income)
  • NASA Meteorites (comprehensive set of meteorite landings) Open In Colab Binder
  • Titanic (the "Wonderwall" of datasets) Open In Colab Binder
  • NZA (open data from the Dutch Healthcare Authority)
  • Stata Auto (1978 Automobile data)
  • Vektis (Vektis Dutch Healthcare data)
  • Colors (a simple colors dataset)
  • UCI Bank Dataset (banking marketing dataset)
  • RDW (RDW, the Dutch DMV's vehicle registration 10 million rows, 71 features)

Specific features:

Tutorials:

Installation

Using pip

PyPi Downloads PyPi Monthly Downloads PyPi Version

You can install using the pip package manager by running

pip install pandas-profiling[notebook]

Alternatively, you could install the latest version directly from Github:

pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

Using conda

Conda Downloads Conda Version

You can install using the conda package manager by running

conda install -c conda-forge pandas-profiling

From source

Download the source code by cloning the repository or by pressing 'Download ZIP' on this page.

Install by navigating to the proper directory and running:

python setup.py install

Documentation

The documentation for pandas_profiling can be found here. Previous documentation is still available here.

Getting started

Start by loading in your pandas DataFrame, e.g. by using:

import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport

df = pd.DataFrame(np.random.rand(100, 5), columns=["a", "b", "c", "d", "e"])

To generate the report, run:

profile = ProfileReport(df, title="Pandas Profiling Report")

Explore deeper

You can configure the profile report in any way you like. The example code below loads the explorative configuration file, that includes many features for text (length distribution, unicode information), files (file size, creation time) and images (dimensions, exif information). If you are interested what exact settings were used, you can compare with the default configuration file.

profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)

Learn more about configuring pandas-profiling on the Advanced usage page.

Jupyter Notebook

We recommend generating reports interactively by using the Jupyter notebook. There are two interfaces (see animations below): through widgets and through a HTML report.

Notebook Widgets

This is achieved by simply displaying the report. In the Jupyter Notebook, run:

profile.to_widgets()

The HTML report can be included in a Jupyter notebook:

HTML

Run the following code:

profile.to_notebook_iframe()

Saving the report

If you want to generate a HTML report file, save the ProfileReport to an object and use the to_file() function:

profile.to_file("your_report.html")

Alternatively, you can obtain the data as JSON:

# As a string
json_data = profile.to_json()

# As a file
profile.to_file("your_report.json")

Large datasets

Version 2.4 introduces minimal mode.

This is a default configuration that disables expensive computations (such as correlations and duplicate row detection).

Use the following syntax:

profile = ProfileReport(large_dataset, minimal=True)
profile.to_file("output.html")

Benchmarks are available here.

Command line usage

For standard formatted CSV files that can be read immediately by pandas, you can use the pandas_profiling executable.

Run the following for information about options and arguments.

pandas_profiling -h

Advanced usage

A set of options is available in order to adapt the report generated.

  • title (str): Title for the report ('Pandas Profiling Report' by default).
  • pool_size (int): Number of workers in thread pool. When set to zero, it is set to the number of CPUs available (0 by default).
  • progress_bar (bool): If True, pandas-profiling will display a progress bar.
  • infer_dtypes (bool): When True (default) the dtype of variables are inferred using visions using the typeset logic (for instance a column that has integers stored as string will be analyzed as if being numeric).

More settings can be found in the default configuration file and minimal configuration file.

You find the configuration docs on the advanced usage page here

Example

profile = df.profile_report(
    title="Pandas Profiling Report", plot={"histogram": {"bins": 8}}
)
profile.to_file("output.html")

Integrations

Great Expectations

Great Expectations Profiling your data is closely related to data validation: often validation rules are defined in terms of well-known statistics. For that purpose, `pandas-profiling` integrates with [Great Expectations](https://www.greatexpectations.io). This a world-class open-source library that helps you to maintain data quality and improve communication about data between teams. Great Expectations allows you to create Expectations (which are basically unit tests for your data) and Data Docs (conveniently shareable HTML data reports). `pandas-profiling` features a method to create a suite of Expectations based on the results of your ProfileReport, which you can store, and use to validate another (or future) dataset. You can find more details on the Great Expectations integration [here](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/great_expectations_integration.html)

Supporting open source

Maintaining and developing the open-source code for pandas-profiling, with millions of downloads and thousands of users, would not be possible without support of our gracious sponsors.

Lambda Labs [Lambda workstations](https://lambdalabs.com/), servers, laptops, and cloud services power engineers and researchers at Fortune 500 companies and 94% of the top 50 universities. [Lambda Cloud](https://lambdalabs.com/service/gpu-cloud) offers 4 & 8 GPU instances starting at $1.50 / hr. Pre-installed with TensorFlow, PyTorch, Ubuntu, CUDA, and cuDNN.

We would like to thank our generous Github Sponsors supporters who make pandas-profiling possible:

Martin Sotir, Brian Lee, Stephanie Rivera, abdulAziz, gramster

More info if you would like to appear here: Github Sponsor page

Types

Types are a powerful abstraction for effective data analysis, that goes beyond the logical data types (integer, float etc.). pandas-profiling currently, recognizes the following types: Boolean, Numerical, Date, Categorical, URL, Path, File and Image.

We have developed a type system for Python, tailored for data analysis: visions. Choosing an appropriate typeset can both improve the overall expressiveness and reduce the complexity of your analysis/code. To learn more about pandas-profiling's type system, check out the default implementation here. In the meantime, user customized summarizations and type definitions are now fully supported - if you have a specific use-case please reach out with ideas or a PR!

Contributing

Read on getting involved in the Contribution Guide.

A low threshold place to ask questions or start contributing is by reaching out on the pandas-profiling Slack. Join the Slack community.

Editor integration

PyCharm integration

  1. Install pandas-profiling via the instructions above
  2. Locate your pandas-profiling executable.
    • On macOS / Linux / BSD: sh $ which pandas_profiling (example) /usr/local/bin/pandas_profiling
    • On Windows: console $ where pandas_profiling (example) C:\ProgramData\Anaconda3\Scripts\pandas_profiling.exe
  3. In PyCharm, go to Settings (or Preferences on macOS) > Tools > External tools
  4. Click the + icon to add a new external tool
  5. Insert the following values
    • Name: Pandas Profiling
    • Program: The location obtained in step 2
    • Arguments: "$FilePath$" "$FileDir$/$FileNameWithoutAllExtensions$_report.html"
    • Working Directory: $ProjectFileDir$

PyCharm Integration

To use the PyCharm Integration, right click on any dataset file:

External Tools > Pandas Profiling.

Other integrations

Other editor integrations may be contributed via pull requests.

Dependencies

The profile report is written in HTML and CSS, which means pandas-profiling requires a modern browser.

You need Python 3 to run this package. Other dependencies can be found in the requirements files:

Filename Requirements
requirements.txt Package requirements
requirements-dev.txt Requirements for development
requirements-test.txt Requirements for testing
setup.py Requirements for Widgets etc.
Expand source code
"""Main module of pandas-profiling.

.. include:: ../../README.md
"""

from pandas_profiling.controller import pandas_decorator
from pandas_profiling.profile_report import ProfileReport
from pandas_profiling.version import __version__

__all__ = [
    "pandas_decorator",
    "ProfileReport",
    "__version__",
]

Sub-modules

pandas_profiling.config

Configuration for the package.

pandas_profiling.controller

The controller module handles all user interaction with the package (console, jupyter, etc.).

pandas_profiling.expectations_report
pandas_profiling.model

The model module handles all logic/calculations, e.g. calculate statistics, testing for special conditions.

pandas_profiling.profile_report
pandas_profiling.report

All functionality concerned with presentation to the user.

pandas_profiling.serialize_report
pandas_profiling.utils

Utility functions for the complete package.

pandas_profiling.version

This file is auto-generated by setup.py, please do not alter.

pandas_profiling.visualisation

Code for generating plots

Classes

class ProfileReport (df: Union[pandas.core.frame.DataFrame, NoneType] = None, minimal: bool = False, explorative: bool = False, sensitive: bool = False, dark_mode: bool = False, orange_mode: bool = False, sample: Union[dict, NoneType] = None, config_file: Union[pathlib.Path, str] = None, lazy: bool = True, typeset: Union[visions.typesets.typeset.VisionsTypeset, NoneType] = None, summarizer: Union[BaseSummarizer, NoneType] = None, config: Union[Settings, NoneType] = None, **kwargs)

Generate a profile report from a Dataset stored as a pandas DataFrame.

Used as is, it will output its content as an HTML report in a Jupyter notebook.

Generate a ProfileReport based on a pandas DataFrame

Args

df
the pandas DataFrame
minimal
minimal mode is a default configuration with minimal computation
config_file
a config file (.yml), mutually exclusive with minimal
lazy
compute when needed
sample
optional dict(name="Sample title", caption="Caption", data=pd.DataFrame())
typeset
optional user typeset to use for type inference
summarizer
optional user summarizer to generate custom summary output
**kwargs
other arguments, for valid arguments, check the default configuration file.
Expand source code
class ProfileReport(SerializeReport, ExpectationsReport):
    """Generate a profile report from a Dataset stored as a pandas `DataFrame`.

    Used as is, it will output its content as an HTML report in a Jupyter notebook.
    """

    _description_set = None
    _report = None
    _html = None
    _widgets = None
    _json = None
    config: Settings

    def __init__(
        self,
        df: Optional[pd.DataFrame] = None,
        minimal: bool = False,
        explorative: bool = False,
        sensitive: bool = False,
        dark_mode: bool = False,
        orange_mode: bool = False,
        sample: Optional[dict] = None,
        config_file: Union[Path, str] = None,
        lazy: bool = True,
        typeset: Optional[VisionsTypeset] = None,
        summarizer: Optional[BaseSummarizer] = None,
        config: Optional[Settings] = None,
        **kwargs,
    ):
        """Generate a ProfileReport based on a pandas DataFrame

        Args:
            df: the pandas DataFrame
            minimal: minimal mode is a default configuration with minimal computation
            config_file: a config file (.yml), mutually exclusive with `minimal`
            lazy: compute when needed
            sample: optional dict(name="Sample title", caption="Caption", data=pd.DataFrame())
            typeset: optional user typeset to use for type inference
            summarizer: optional user summarizer to generate custom summary output
            **kwargs: other arguments, for valid arguments, check the default configuration file.
        """

        if df is None and not lazy:
            raise ValueError("Can init a not-lazy ProfileReport with no DataFrame")

        report_config: Settings = Settings() if config is None else config

        if config_file is not None and minimal:
            raise ValueError(
                "Arguments `config_file` and `minimal` are mutually exclusive."
            )

        if config_file or minimal:
            if not config_file:
                config_file = get_config("config_minimal.yaml")

            with open(config_file) as f:
                data = yaml.safe_load(f)

            report_config = report_config.parse_obj(data)

        if explorative:
            report_config = report_config.update(Config.get_arg_groups("explorative"))
        if sensitive:
            report_config = report_config.update(Config.get_arg_groups("sensitive"))
        if dark_mode:
            report_config = report_config.update(Config.get_arg_groups("dark_mode"))
        if orange_mode:
            report_config = report_config.update(Config.get_arg_groups("orange_mode"))
        if len(kwargs) > 0:
            report_config = report_config.update(Config.shorthands(kwargs))

        self.df = None
        self.config = report_config
        self._df_hash = None
        self._sample = sample
        self._typeset = typeset
        self._summarizer = summarizer

        if df is not None:
            # preprocess df
            self.df = self.preprocess(df)

        if not lazy:
            # Trigger building the report structure
            _ = self.report

    def invalidate_cache(self, subset: Optional[str] = None) -> None:
        """Invalidate report cache. Useful after changing setting.

        Args:
            subset:
            - "rendering" to invalidate the html, json and widget report rendering
            - "report" to remove the caching of the report structure
            - None (default) to invalidate all caches

        Returns:
            None
        """
        if subset is not None and subset not in ["rendering", "report"]:
            raise ValueError(
                "'subset' parameter should be None, 'rendering' or 'report'"
            )

        if subset is None or subset in ["rendering", "report"]:
            self._widgets = None
            self._json = None
            self._html = None

        if subset is None or subset == "report":
            self._report = None

        if subset is None:
            self._description_set = None

    @property
    def typeset(self) -> Optional[VisionsTypeset]:
        if self._typeset is None:
            self._typeset = ProfilingTypeSet(self.config)
        return self._typeset

    @property
    def summarizer(self) -> BaseSummarizer:
        if self._summarizer is None:
            self._summarizer = PandasProfilingSummarizer(self.typeset)
        return self._summarizer

    @property
    def description_set(self) -> Dict[str, Any]:
        if self._description_set is None:
            self._description_set = describe_df(
                self.config,
                self.df,
                self.summarizer,
                self.typeset,
                self._sample,
            )
        return self._description_set

    @property
    def df_hash(self) -> Optional[str]:
        if self._df_hash is None and self.df is not None:
            self._df_hash = hash_dataframe(self.df)
        return self._df_hash

    @property
    def report(self) -> Root:
        if self._report is None:
            self._report = get_report_structure(self.config, self.description_set)
        return self._report

    @property
    def html(self) -> str:
        if self._html is None:
            self._html = self._render_html()
        return self._html

    @property
    def json(self) -> str:
        if self._json is None:
            self._json = self._render_json()
        return self._json

    @property
    def widgets(self) -> Renderable:
        if self._widgets is None:
            self._widgets = self._render_widgets()
        return self._widgets

    def get_duplicates(self) -> Optional[pd.DataFrame]:
        """Get duplicate rows and counts based on the configuration

        Returns:
            A DataFrame with the duplicate rows and their counts.
        """
        return self.description_set["duplicates"]

    def get_sample(self) -> dict:
        """Get head/tail samples based on the configuration

        Returns:
            A dict with the head and tail samples.
        """
        return self.description_set["sample"]

    def get_description(self) -> dict:
        """Return the description (a raw statistical summary) of the dataset.

        Returns:
            Dict containing a description for each variable in the DataFrame.
        """
        return self.description_set

    def get_rejected_variables(self) -> set:
        """Get variables that are rejected for analysis (e.g. constant, mixed data types)

        Returns:
            a set of column names that are unsupported
        """
        return {
            message.column_name
            for message in self.description_set["messages"]
            if message.message_type == MessageType.REJECTED
        }

    def to_file(self, output_file: Union[str, Path], silent: bool = True) -> None:
        """Write the report to a file.

        By default a name is generated.

        Args:
            output_file: The name or the path of the file to generate including the extension (.html, .json).
            silent: if False, opens the file in the default browser or download it in a Google Colab environment
        """
        if not isinstance(output_file, Path):
            output_file = Path(str(output_file))

        if output_file.suffix == ".json":
            data = self.to_json()
        else:
            if not self.config.html.inline:
                self.config.html.assets_path = str(output_file.parent)
                if self.config.html.assets_prefix is None:
                    self.config.html.assets_prefix = str(output_file.stem) + "_assets"
                create_html_assets(self.config, output_file)

            data = self.to_html()

            if output_file.suffix != ".html":
                suffix = output_file.suffix
                output_file = output_file.with_suffix(".html")
                warnings.warn(
                    f"Extension {suffix} not supported. For now we assume .html was intended. "
                    f"To remove this warning, please use .html or .json."
                )

        disable_progress_bar = not self.config.progress_bar
        with tqdm(
            total=1, desc="Export report to file", disable=disable_progress_bar
        ) as pbar:
            output_file.write_text(data, encoding="utf-8")
            pbar.update()

        if not silent:
            try:
                from google.colab import files  # noqa: F401

                files.download(output_file.absolute().as_uri())
            except ModuleNotFoundError:
                import webbrowser

                webbrowser.open_new_tab(output_file.absolute().as_uri())

    def _render_html(self) -> str:
        from pandas_profiling.report.presentation.flavours import HTMLReport

        report = self.report

        with tqdm(
            total=1, desc="Render HTML", disable=not self.config.progress_bar
        ) as pbar:
            html = HTMLReport(copy.deepcopy(report)).render(
                nav=self.config.html.navbar_show,
                offline=self.config.html.use_local_assets,
                inline=self.config.html.inline,
                assets_prefix=self.config.html.assets_prefix,
                primary_color=self.config.html.style.primary_color,
                logo=self.config.html.style.logo,
                theme=self.config.html.style.theme,
                title=self.description_set["analysis"]["title"],
                date=self.description_set["analysis"]["date_start"],
                version=self.description_set["package"]["pandas_profiling_version"],
            )

            if self.config.html.minify_html:
                from htmlmin.main import minify

                html = minify(html, remove_all_empty_space=True, remove_comments=True)
            pbar.update()
        return html

    def _render_widgets(self) -> Renderable:
        from pandas_profiling.report.presentation.flavours import WidgetReport

        report = self.report

        with tqdm(
            total=1,
            desc="Render widgets",
            disable=not self.config.progress_bar,
            leave=False,
        ) as pbar:
            widgets = WidgetReport(copy.deepcopy(report)).render()
            pbar.update()
        return widgets

    def _render_json(self) -> str:
        def encode_it(o: Any) -> Any:
            if isinstance(o, dict):
                return {encode_it(k): encode_it(v) for k, v in o.items()}
            else:
                if isinstance(o, (bool, int, float, str)):
                    return o
                elif isinstance(o, list):
                    return [encode_it(v) for v in o]
                elif isinstance(o, set):
                    return {encode_it(v) for v in o}
                elif isinstance(o, (pd.DataFrame, pd.Series)):
                    return encode_it(o.to_dict(orient="records"))
                elif isinstance(o, np.ndarray):
                    return encode_it(o.tolist())
                elif isinstance(o, Sample):
                    return encode_it(o.dict())
                elif isinstance(o, np.generic):
                    return o.item()
                else:
                    return str(o)

        description = self.description_set

        with tqdm(
            total=1, desc="Render JSON", disable=not self.config.progress_bar
        ) as pbar:
            description = format_summary(description)
            description = encode_it(description)
            data = json.dumps(description, indent=4)
            pbar.update()
        return data

    def to_html(self) -> str:
        """Generate and return complete template as lengthy string
            for using with frameworks.

        Returns:
            Profiling report html including wrapper.

        """
        return self.html

    def to_json(self) -> str:
        """Represent the ProfileReport as a JSON string

        Returns:
            JSON string
        """

        return self.json

    def to_notebook_iframe(self) -> None:
        """Used to output the HTML representation to a Jupyter notebook.
        When config.notebook.iframe.attribute is "src", this function creates a temporary HTML file
        in `./tmp/profile_[hash].html` and returns an Iframe pointing to that contents.
        When config.notebook.iframe.attribute is "srcdoc", the same HTML is injected in the "srcdoc" attribute of
        the Iframe.

        Notes:
            This constructions solves problems with conflicting stylesheets and navigation links.
        """
        from IPython.core.display import display

        from pandas_profiling.report.presentation.flavours.widget.notebook import (
            get_notebook_iframe,
        )

        # Ignore warning: https://github.com/ipython/ipython/pull/11350/files
        with warnings.catch_warnings():
            warnings.simplefilter("ignore")
            display(get_notebook_iframe(self.config, self))

    def to_widgets(self) -> None:
        """The ipython notebook widgets user interface."""
        try:
            from google.colab import files  # noqa: F401

            warnings.warn(
                "Ipywidgets is not yet fully supported on Google Colab (https://github.com/googlecolab/colabtools/issues/60)."
                "As an alternative, you can use the HTML report. See the documentation for more information."
            )
        except ModuleNotFoundError:
            pass

        from IPython.core.display import display

        display(self.widgets)

    def _repr_html_(self) -> None:
        """The ipython notebook widgets user interface gets called by the jupyter notebook."""
        self.to_notebook_iframe()

    def __repr__(self) -> str:
        """Override so that Jupyter Notebook does not print the object."""
        return ""

    @staticmethod
    def preprocess(df: pd.DataFrame) -> pd.DataFrame:
        """Preprocess the dataframe

        - Appends the index to the dataframe when it contains information
        - Rename the "index" column to "df_index", if exists
        - Convert the DataFrame's columns to str

        Args:
            df: the pandas DataFrame

        Returns:
            The preprocessed DataFrame
        """
        # Treat index as any other column
        if (
            not pd.Index(np.arange(0, len(df))).equals(df.index)
            or df.index.dtype != np.int64
        ):
            df = df.reset_index()

        # Rename reserved column names
        df = rename_index(df)

        # Ensure that columns are strings
        df.columns = df.columns.astype("str")
        return df

Ancestors

Class variables

var config : Settings

Static methods

def preprocess(df:¬†pandas.core.frame.DataFrame) ‚ÄĎ>¬†pandas.core.frame.DataFrame

Preprocess the dataframe

  • Appends the index to the dataframe when it contains information
  • Rename the "index" column to "df_index", if exists
  • Convert the DataFrame's columns to str

Args

df
the pandas DataFrame

Returns

The preprocessed DataFrame

Expand source code
@staticmethod
def preprocess(df: pd.DataFrame) -> pd.DataFrame:
    """Preprocess the dataframe

    - Appends the index to the dataframe when it contains information
    - Rename the "index" column to "df_index", if exists
    - Convert the DataFrame's columns to str

    Args:
        df: the pandas DataFrame

    Returns:
        The preprocessed DataFrame
    """
    # Treat index as any other column
    if (
        not pd.Index(np.arange(0, len(df))).equals(df.index)
        or df.index.dtype != np.int64
    ):
        df = df.reset_index()

    # Rename reserved column names
    df = rename_index(df)

    # Ensure that columns are strings
    df.columns = df.columns.astype("str")
    return df

Instance variables

var description_set : Dict[str, Any]
Expand source code
@property
def description_set(self) -> Dict[str, Any]:
    if self._description_set is None:
        self._description_set = describe_df(
            self.config,
            self.df,
            self.summarizer,
            self.typeset,
            self._sample,
        )
    return self._description_set
var df_hash : Union[str, NoneType]
Expand source code
@property
def df_hash(self) -> Optional[str]:
    if self._df_hash is None and self.df is not None:
        self._df_hash = hash_dataframe(self.df)
    return self._df_hash
var html : str
Expand source code
@property
def html(self) -> str:
    if self._html is None:
        self._html = self._render_html()
    return self._html
var json : str
Expand source code
@property
def json(self) -> str:
    if self._json is None:
        self._json = self._render_json()
    return self._json
var report : Root
Expand source code
@property
def report(self) -> Root:
    if self._report is None:
        self._report = get_report_structure(self.config, self.description_set)
    return self._report
var summarizer : BaseSummarizer
Expand source code
@property
def summarizer(self) -> BaseSummarizer:
    if self._summarizer is None:
        self._summarizer = PandasProfilingSummarizer(self.typeset)
    return self._summarizer
var typeset : Union[visions.typesets.typeset.VisionsTypeset, NoneType]
Expand source code
@property
def typeset(self) -> Optional[VisionsTypeset]:
    if self._typeset is None:
        self._typeset = ProfilingTypeSet(self.config)
    return self._typeset
var widgets : Renderable
Expand source code
@property
def widgets(self) -> Renderable:
    if self._widgets is None:
        self._widgets = self._render_widgets()
    return self._widgets

Methods

def get_description(self) ‚ÄĎ>¬†dict

Return the description (a raw statistical summary) of the dataset.

Returns

Dict containing a description for each variable in the DataFrame.

Expand source code
def get_description(self) -> dict:
    """Return the description (a raw statistical summary) of the dataset.

    Returns:
        Dict containing a description for each variable in the DataFrame.
    """
    return self.description_set
def get_duplicates(self) ‚ÄĎ>¬†Union[pandas.core.frame.DataFrame,¬†NoneType]

Get duplicate rows and counts based on the configuration

Returns

A DataFrame with the duplicate rows and their counts.

Expand source code
def get_duplicates(self) -> Optional[pd.DataFrame]:
    """Get duplicate rows and counts based on the configuration

    Returns:
        A DataFrame with the duplicate rows and their counts.
    """
    return self.description_set["duplicates"]
def get_rejected_variables(self) ‚ÄĎ>¬†set

Get variables that are rejected for analysis (e.g. constant, mixed data types)

Returns

a set of column names that are unsupported

Expand source code
def get_rejected_variables(self) -> set:
    """Get variables that are rejected for analysis (e.g. constant, mixed data types)

    Returns:
        a set of column names that are unsupported
    """
    return {
        message.column_name
        for message in self.description_set["messages"]
        if message.message_type == MessageType.REJECTED
    }
def get_sample(self) ‚ÄĎ>¬†dict

Get head/tail samples based on the configuration

Returns

A dict with the head and tail samples.

Expand source code
def get_sample(self) -> dict:
    """Get head/tail samples based on the configuration

    Returns:
        A dict with the head and tail samples.
    """
    return self.description_set["sample"]
def invalidate_cache(self, subset:¬†Union[str,¬†NoneType]¬†=¬†None) ‚ÄĎ>¬†NoneType

Invalidate report cache. Useful after changing setting.

Args

subset: - "rendering" to invalidate the html, json and widget report rendering - "report" to remove the caching of the report structure - None (default) to invalidate all caches

Returns

None

Expand source code
def invalidate_cache(self, subset: Optional[str] = None) -> None:
    """Invalidate report cache. Useful after changing setting.

    Args:
        subset:
        - "rendering" to invalidate the html, json and widget report rendering
        - "report" to remove the caching of the report structure
        - None (default) to invalidate all caches

    Returns:
        None
    """
    if subset is not None and subset not in ["rendering", "report"]:
        raise ValueError(
            "'subset' parameter should be None, 'rendering' or 'report'"
        )

    if subset is None or subset in ["rendering", "report"]:
        self._widgets = None
        self._json = None
        self._html = None

    if subset is None or subset == "report":
        self._report = None

    if subset is None:
        self._description_set = None
def to_file(self, output_file:¬†Union[str,¬†pathlib.Path], silent:¬†bool¬†=¬†True) ‚ÄĎ>¬†NoneType

Write the report to a file.

By default a name is generated.

Args

output_file
The name or the path of the file to generate including the extension (.html, .json).
silent
if False, opens the file in the default browser or download it in a Google Colab environment
Expand source code
def to_file(self, output_file: Union[str, Path], silent: bool = True) -> None:
    """Write the report to a file.

    By default a name is generated.

    Args:
        output_file: The name or the path of the file to generate including the extension (.html, .json).
        silent: if False, opens the file in the default browser or download it in a Google Colab environment
    """
    if not isinstance(output_file, Path):
        output_file = Path(str(output_file))

    if output_file.suffix == ".json":
        data = self.to_json()
    else:
        if not self.config.html.inline:
            self.config.html.assets_path = str(output_file.parent)
            if self.config.html.assets_prefix is None:
                self.config.html.assets_prefix = str(output_file.stem) + "_assets"
            create_html_assets(self.config, output_file)

        data = self.to_html()

        if output_file.suffix != ".html":
            suffix = output_file.suffix
            output_file = output_file.with_suffix(".html")
            warnings.warn(
                f"Extension {suffix} not supported. For now we assume .html was intended. "
                f"To remove this warning, please use .html or .json."
            )

    disable_progress_bar = not self.config.progress_bar
    with tqdm(
        total=1, desc="Export report to file", disable=disable_progress_bar
    ) as pbar:
        output_file.write_text(data, encoding="utf-8")
        pbar.update()

    if not silent:
        try:
            from google.colab import files  # noqa: F401

            files.download(output_file.absolute().as_uri())
        except ModuleNotFoundError:
            import webbrowser

            webbrowser.open_new_tab(output_file.absolute().as_uri())
def to_html(self) ‚ÄĎ>¬†str

Generate and return complete template as lengthy string for using with frameworks.

Returns

Profiling report html including wrapper.

Expand source code
def to_html(self) -> str:
    """Generate and return complete template as lengthy string
        for using with frameworks.

    Returns:
        Profiling report html including wrapper.

    """
    return self.html
def to_json(self) ‚ÄĎ>¬†str

Represent the ProfileReport as a JSON string

Returns

JSON string

Expand source code
def to_json(self) -> str:
    """Represent the ProfileReport as a JSON string

    Returns:
        JSON string
    """

    return self.json
def to_notebook_iframe(self) ‚ÄĎ>¬†NoneType

Used to output the HTML representation to a Jupyter notebook. When config.notebook.iframe.attribute is "src", this function creates a temporary HTML file in ./tmp/profile_[hash].html and returns an Iframe pointing to that contents. When config.notebook.iframe.attribute is "srcdoc", the same HTML is injected in the "srcdoc" attribute of the Iframe.

Notes

This constructions solves problems with conflicting stylesheets and navigation links.

Expand source code
def to_notebook_iframe(self) -> None:
    """Used to output the HTML representation to a Jupyter notebook.
    When config.notebook.iframe.attribute is "src", this function creates a temporary HTML file
    in `./tmp/profile_[hash].html` and returns an Iframe pointing to that contents.
    When config.notebook.iframe.attribute is "srcdoc", the same HTML is injected in the "srcdoc" attribute of
    the Iframe.

    Notes:
        This constructions solves problems with conflicting stylesheets and navigation links.
    """
    from IPython.core.display import display

    from pandas_profiling.report.presentation.flavours.widget.notebook import (
        get_notebook_iframe,
    )

    # Ignore warning: https://github.com/ipython/ipython/pull/11350/files
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        display(get_notebook_iframe(self.config, self))
def to_widgets(self) ‚ÄĎ>¬†NoneType

The ipython notebook widgets user interface.

Expand source code
def to_widgets(self) -> None:
    """The ipython notebook widgets user interface."""
    try:
        from google.colab import files  # noqa: F401

        warnings.warn(
            "Ipywidgets is not yet fully supported on Google Colab (https://github.com/googlecolab/colabtools/issues/60)."
            "As an alternative, you can use the HTML report. See the documentation for more information."
        )
    except ModuleNotFoundError:
        pass

    from IPython.core.display import display

    display(self.widgets)

Inherited members