Module pandas_profiling

Main module of pandas-profiling.

Pandas Profiling

Build Status Code Coverage Release Version Code style: black

Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

  • Essentials: type, unique values, missing values
  • Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
  • Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
  • Most frequent values
  • Histogram
  • Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
  • Missing values matrix, count, heatmap and dendrogram of missing values

Examples

The following examples can give you an impression of what the package can do:

Installation

Using pip

PyPi Downloads PyPi Monthly Downloads PyPi Version

You can install using the pip package manager by running

pip install pandas-profiling

Alternatively, you could install directly from Github:

pip install <https://github.com/pandas-profiling/pandas-profiling/archive/master.zip>

Using conda

Conda Downloads Conda Version

You can install using the conda package manager by running

conda install -c anaconda pandas-profiling

From source

Download the source code by cloning the repository or by pressing 'Download ZIP' on this page. Install by navigating to the proper directory and running

python setup.py install

Usage

The profile report is written in HTML5 and CSS3, which means pandas-profiling requires a modern browser.

Documentation

The documentation for pandas_profiling can be found here. The documentation is generated using pdoc3. If you are contributing to this project, you can rebuild the documentation using:

make docs

or on Windows:

make.bat docs

Jupyter Notebook

We recommend generating reports interactively by using the Jupyter notebook.

Start by loading in your pandas DataFrame, e.g. by using

import numpy as np
import pandas as pd
import pandas_profiling

df = pd.DataFrame(
    np.random.rand(100, 5),
    columns=['a', 'b', 'c', 'd', 'e']
)

To display the report in a Jupyter notebook, run:

df.profile_report(style={'full_width':True})

To retrieve the list of variables which are rejected due to high correlation:

profile = df.profile_report()
rejected_variables = profile.get_rejected_variables(threshold=0.9)

If you want to generate a HTML report file, save the ProfileReport to an object and use the to_file() function:

profile = df.profile_report(title='Pandas Profiling Report')
profile.to_file(output_file="output.html")

Command line usage

For standard formatted CSV files that can be read immediately by pandas, you can use the pandas_profiling executable. Run

pandas_profiling -h

for information about options and arguments.

Advanced usage

A set of options is available in order to adapt the report generated.

  • title (str): Title for the report ('Pandas Profiling Report' by default).
  • pool_size (int): Number of workers in thread pool. When set to zero, it is set to the number of CPUs available (0 by default).
  • minify_html (boolean): Whether to minify the output HTML.

More settings can be found in the default configuration file.

Example

profile = df.profile_report(title='Pandas Profiling Report', plot={'histogram': {'bins': 8}})
profile.to_file(output_file="output.html")

How to contribute

The package is actively maintained and developed as open-source software. If pandas-profiling was helpful or interesting to you, you might want to get involved. There are several ways of contributing and helping our thousands of users. If you would like to be a industry partner or sponsor, please drop us a line.

Read more on getting involved in the Contribution Guide.

Dependencies

You need Python 3 to run this package. Other dependencies can be found in the requirements files:

Filename Requirements
requirements.txt Package requirements
requirements-dev.txt Requirements for development
requirements-test.txt Requirements for testing
Source code
"""Main module of pandas-profiling.

.. include:: ../README.md
"""
import sys
import warnings

import pandas as pd

from pandas_profiling.version import __version__
from pandas_profiling.utils.dataframe import clean_column_names, rename_index
from pandas_profiling.utils.paths import get_config_default, get_project_root

from pathlib import Path
import numpy as np

from pandas_profiling.config import config
from pandas_profiling.controller import pandas_decorator
import pandas_profiling.view.templates as templates
from pandas_profiling.model.describe import describe as describe_df
from pandas_profiling.view.notebook import display_notebook_iframe
from pandas_profiling.view.report import to_html


class ProfileReport(object):
    """Generate a profile report from a Dataset stored as a pandas `DataFrame`.
    
    Used has is it will output its content as an HTML report in a Jupyter notebook.
    """

    html = ""
    """the HTML representation of the report, without the wrapper (containing `<head>` etc.)"""

    def __init__(self, df, **kwargs):
        config.set_kwargs(kwargs)

        # Treat index as any other column
        if (
            not pd.Index(np.arange(0, len(df))).equals(df.index)
            or df.index.dtype != np.int64
        ):
            df = df.reset_index()

        # Rename reserved column names
        df = rename_index(df)

        # Remove spaces and colons from column names
        df = clean_column_names(df)

        # Sort column names
        sort = config["sort"].get(str)
        if sys.version_info[1] <= 5 and sort != "None":
            warnings.warn("Sorting is supported from Python 3.6+")

        if sort in ["asc", "ascending"]:
            df = df.reindex(sorted(df.columns, key=lambda s: s.casefold()), axis=1)
        elif sort in ["desc", "descending"]:
            df = df.reindex(
                reversed(sorted(df.columns, key=lambda s: s.casefold())), axis=1
            )
        elif sort != "None":
            raise ValueError('"sort" should be "ascending", "descending" or None.')

        # Store column order
        config["column_order"] = df.columns.tolist()

        # Get dataset statistics
        description_set = describe_df(df)

        # Get sample
        sample = {}
        n_head = config["samples"]["head"].get(int)
        if n_head > 0:
            sample["head"] = df.head(n=n_head)

        n_tail = config["samples"]["tail"].get(int)
        if n_tail > 0:
            sample["tail"] = df.tail(n=n_tail)

        # Render HTML
        self.html = to_html(sample, description_set)
        self.minify_html = config["minify_html"].get(bool)
        self.use_local_assets = config["use_local_assets"].get(bool)
        self.title = config["title"].get(str)
        self.description_set = description_set
        self.sample = sample

    def get_description(self) -> dict:
        """Return the description (a raw statistical summary) of the dataset.
        
        Returns:
            Dict containing a description for each variable in the DataFrame.
        """
        return self.description_set

    def get_rejected_variables(self, threshold: float = 0.9) -> list:
        """Return a list of variable names being rejected for high 
        correlation with one of remaining variables.
        
        Args:
            threshold: correlation value which is above the threshold are rejected (Default value = 0.9)

        Returns:
            A list of rejected variables.
        """
        variable_profile = self.description_set["variables"]
        result = []
        for col, values in variable_profile.items():
            if "correlation" in values:
                if values["correlation"] > threshold:
                    result.append(col)
        return result

    def to_file(self, output_file: Path or str) -> None:
        """Write the report to a file.
        
        By default a name is generated.

        Args:
            output_file: The name or the path of the file to generate including the extension (.html).        
        """
        if type(output_file) == str:
            output_file = Path(output_file)

        with output_file.open("w", encoding="utf8") as f:
            wrapped_html = self.to_html()
            if self.minify_html:
                from htmlmin.main import minify

                wrapped_html = minify(
                    wrapped_html, remove_all_empty_space=True, remove_comments=True
                )
            f.write(wrapped_html)

    def to_html(self) -> str:
        """Generate and return complete template as lengthy string
            for using with frameworks.

        Returns:
            Profiling report html including wrapper.
        
        """
        return templates.template("wrapper.html").render(
            content=self.html,
            title=self.title,
            correlation=len(self.description_set["correlations"]) > 0,
            missing=len(self.description_set["missing"]) > 0,
            sample=len(self.sample) > 0,
            version=__version__,
            offline=self.use_local_assets,
            primary_color=config["style"]["primary_color"].get(str),
            theme=config["style"]["theme"].get(str),
        )

    def get_unique_file_name(self):
        """Generate a unique file name."""
        return (
            "profile_"
            + str(np.random.randint(1000000000, 9999999999, dtype=np.int64))
            + ".html"
        )

    def _repr_html_(self):
        """Used to output the HTML representation to a Jupyter notebook.
        When config.notebook.iframe.attribute is "src", this function creates a temporary HTML file
        in `./tmp/profile_[hash].html` and returns an Iframe pointing to that contents.
        When config.notebook.iframe.attribute is "srcdoc", the same HTML is injected in the "srcdoc" attribute of
        the Iframe.

        Notes:
            This constructions solves problems with conflicting stylesheets and navigation links.
        """
        display_notebook_iframe(self)

    def __repr__(self):
        """Override so that Jupyter Notebook does not print the object."""
        return ""

Sub-modules

pandas_profiling.config

Configuration for the package is handled in this wrapper for confuse.

pandas_profiling.controller

The controller module handles all user interaction with the package (console, jupyter, etc.).

pandas_profiling.model

The model module handles all logic/calculations, e.g. calculate statistics, testing for special conditions.

pandas_profiling.utils

Utility functions for the complete package.

pandas_profiling.version

This file is auto-generated by setup.py, please do not alter.

pandas_profiling.view

All functionality concerned with presentation to the user.

Classes

class ProfileReport (df, **kwargs)

Generate a profile report from a Dataset stored as a pandas DataFrame.

Used has is it will output its content as an HTML report in a Jupyter notebook.

Source code
class ProfileReport(object):
    """Generate a profile report from a Dataset stored as a pandas `DataFrame`.
    
    Used has is it will output its content as an HTML report in a Jupyter notebook.
    """

    html = ""
    """the HTML representation of the report, without the wrapper (containing `<head>` etc.)"""

    def __init__(self, df, **kwargs):
        config.set_kwargs(kwargs)

        # Treat index as any other column
        if (
            not pd.Index(np.arange(0, len(df))).equals(df.index)
            or df.index.dtype != np.int64
        ):
            df = df.reset_index()

        # Rename reserved column names
        df = rename_index(df)

        # Remove spaces and colons from column names
        df = clean_column_names(df)

        # Sort column names
        sort = config["sort"].get(str)
        if sys.version_info[1] <= 5 and sort != "None":
            warnings.warn("Sorting is supported from Python 3.6+")

        if sort in ["asc", "ascending"]:
            df = df.reindex(sorted(df.columns, key=lambda s: s.casefold()), axis=1)
        elif sort in ["desc", "descending"]:
            df = df.reindex(
                reversed(sorted(df.columns, key=lambda s: s.casefold())), axis=1
            )
        elif sort != "None":
            raise ValueError('"sort" should be "ascending", "descending" or None.')

        # Store column order
        config["column_order"] = df.columns.tolist()

        # Get dataset statistics
        description_set = describe_df(df)

        # Get sample
        sample = {}
        n_head = config["samples"]["head"].get(int)
        if n_head > 0:
            sample["head"] = df.head(n=n_head)

        n_tail = config["samples"]["tail"].get(int)
        if n_tail > 0:
            sample["tail"] = df.tail(n=n_tail)

        # Render HTML
        self.html = to_html(sample, description_set)
        self.minify_html = config["minify_html"].get(bool)
        self.use_local_assets = config["use_local_assets"].get(bool)
        self.title = config["title"].get(str)
        self.description_set = description_set
        self.sample = sample

    def get_description(self) -> dict:
        """Return the description (a raw statistical summary) of the dataset.
        
        Returns:
            Dict containing a description for each variable in the DataFrame.
        """
        return self.description_set

    def get_rejected_variables(self, threshold: float = 0.9) -> list:
        """Return a list of variable names being rejected for high 
        correlation with one of remaining variables.
        
        Args:
            threshold: correlation value which is above the threshold are rejected (Default value = 0.9)

        Returns:
            A list of rejected variables.
        """
        variable_profile = self.description_set["variables"]
        result = []
        for col, values in variable_profile.items():
            if "correlation" in values:
                if values["correlation"] > threshold:
                    result.append(col)
        return result

    def to_file(self, output_file: Path or str) -> None:
        """Write the report to a file.
        
        By default a name is generated.

        Args:
            output_file: The name or the path of the file to generate including the extension (.html).        
        """
        if type(output_file) == str:
            output_file = Path(output_file)

        with output_file.open("w", encoding="utf8") as f:
            wrapped_html = self.to_html()
            if self.minify_html:
                from htmlmin.main import minify

                wrapped_html = minify(
                    wrapped_html, remove_all_empty_space=True, remove_comments=True
                )
            f.write(wrapped_html)

    def to_html(self) -> str:
        """Generate and return complete template as lengthy string
            for using with frameworks.

        Returns:
            Profiling report html including wrapper.
        
        """
        return templates.template("wrapper.html").render(
            content=self.html,
            title=self.title,
            correlation=len(self.description_set["correlations"]) > 0,
            missing=len(self.description_set["missing"]) > 0,
            sample=len(self.sample) > 0,
            version=__version__,
            offline=self.use_local_assets,
            primary_color=config["style"]["primary_color"].get(str),
            theme=config["style"]["theme"].get(str),
        )

    def get_unique_file_name(self):
        """Generate a unique file name."""
        return (
            "profile_"
            + str(np.random.randint(1000000000, 9999999999, dtype=np.int64))
            + ".html"
        )

    def _repr_html_(self):
        """Used to output the HTML representation to a Jupyter notebook.
        When config.notebook.iframe.attribute is "src", this function creates a temporary HTML file
        in `./tmp/profile_[hash].html` and returns an Iframe pointing to that contents.
        When config.notebook.iframe.attribute is "srcdoc", the same HTML is injected in the "srcdoc" attribute of
        the Iframe.

        Notes:
            This constructions solves problems with conflicting stylesheets and navigation links.
        """
        display_notebook_iframe(self)

    def __repr__(self):
        """Override so that Jupyter Notebook does not print the object."""
        return ""

Class variables

var html

the HTML representation of the report, without the wrapper (containing <head> etc.)

Methods

def get_description(self)

Return the description (a raw statistical summary) of the dataset.

Returns

Dict containing a description for each variable in the DataFrame.

Source code
def get_description(self) -> dict:
    """Return the description (a raw statistical summary) of the dataset.
    
    Returns:
        Dict containing a description for each variable in the DataFrame.
    """
    return self.description_set
def get_rejected_variables(self, threshold=0.9)

Return a list of variable names being rejected for high correlation with one of remaining variables.

Args

threshold
correlation value which is above the threshold are rejected (Default value = 0.9)

Returns

A list of rejected variables.

Source code
def get_rejected_variables(self, threshold: float = 0.9) -> list:
    """Return a list of variable names being rejected for high 
    correlation with one of remaining variables.
    
    Args:
        threshold: correlation value which is above the threshold are rejected (Default value = 0.9)

    Returns:
        A list of rejected variables.
    """
    variable_profile = self.description_set["variables"]
    result = []
    for col, values in variable_profile.items():
        if "correlation" in values:
            if values["correlation"] > threshold:
                result.append(col)
    return result
def get_unique_file_name(self)

Generate a unique file name.

Source code
def get_unique_file_name(self):
    """Generate a unique file name."""
    return (
        "profile_"
        + str(np.random.randint(1000000000, 9999999999, dtype=np.int64))
        + ".html"
    )
def to_file(self, output_file)

Write the report to a file.

By default a name is generated.

Args

output_file
The name or the path of the file to generate including the extension (.html).
Source code
def to_file(self, output_file: Path or str) -> None:
    """Write the report to a file.
    
    By default a name is generated.

    Args:
        output_file: The name or the path of the file to generate including the extension (.html).        
    """
    if type(output_file) == str:
        output_file = Path(output_file)

    with output_file.open("w", encoding="utf8") as f:
        wrapped_html = self.to_html()
        if self.minify_html:
            from htmlmin.main import minify

            wrapped_html = minify(
                wrapped_html, remove_all_empty_space=True, remove_comments=True
            )
        f.write(wrapped_html)
def to_html(self)

Generate and return complete template as lengthy string for using with frameworks.

Returns

Profiling report html including wrapper.

Source code
def to_html(self) -> str:
    """Generate and return complete template as lengthy string
        for using with frameworks.

    Returns:
        Profiling report html including wrapper.
    
    """
    return templates.template("wrapper.html").render(
        content=self.html,
        title=self.title,
        correlation=len(self.description_set["correlations"]) > 0,
        missing=len(self.description_set["missing"]) > 0,
        sample=len(self.sample) > 0,
        version=__version__,
        offline=self.use_local_assets,
        primary_color=config["style"]["primary_color"].get(str),
        theme=config["style"]["theme"].get(str),
    )