Integrations

Pandas

pandas-profiling is built on pandas and numpy. Pandas supports a wide range of data formats including CSV, XLSX, SQL, JSON, HDF5, SAS, BigQuery and Stata. Read more on supported formats by Pandas.

Other frameworks

If you have data in another Python framework, you can use pandas-profiling by converting to a pandas DataFrame. For large datasets you might need to sample. Direct integrations are not yet supported.

PySpark to Pandas
 # Convert spark RDD to a pandas DataFrame
 df = spark_df.toPandas()
Dask to Pandas
 # Convert dask DataFrame to a pandas DataFrame
 df = df.compute()
Vaex to Pandas
 # Convert vaex DataFrame to a pandas DataFrame
 df = df.to_pandas_df()
Modin to Pandas
# Convert modin DataFrame to pandas DataFrame
df = df._to_pandas()

# Note that:
#   "This is not part of the API as pandas.DataFrame, naturally, does not posses such a method.
#   You can use the private method DataFrame._to_pandas() to do this conversion.
#   If you would like to do this through the official API you can always save the Modin DataFrame to
#   storage (csv, hdf, sql, ect) and then read it back using Pandas. This will probably be the safer
#   way when working big DataFrames, to avoid out of memory issues."
# Source: https://github.com/modin-project/modin/issues/896

User interfaces

This section lists the various ways the user can interact with the profiling results.

HTML Report

../_images/iframe.gif

Jupyter Lab/Notebook

../_images/widgets.gif

Command line

Command line usage For standard formatted CSV files that can be read immediately by pandas, you can use the pandas_profiling executable. Run

pandas_profiling -h

for information about options and arguments.

../_images/cli.png

Streamlit

Streamlit is an open-source Python library made to build web-apps for machine learning and data science.

../_images/streamlit-integration.gif
import pandas as pd
import pandas_profiling
import streamlit as st
from streamlit_pandas_profiling import st_profile_report

df = pd.read_csv(
    "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
)
pr = df.profile_report()

st.title("Pandas Profiling in Streamlit")
st.write(df)
st_profile_report(pr)

You can install this Pandas Profiling component for Streamlit with pip:

pip install streamlit-pandas-profiling

Panel

For more information on how to use pandas-profiling in Panel, see https://github.com/pandas-profiling/pandas-profiling/issues/491 and the Pandas Profiling example at https://awesome-panel.org.

Cloud Integrations

Lambda GPU Cloud

https://lambdalabs.com/static/images/lambda-logo.png

pandas-profiling will be pre-installed on one of the Lambda GPU Cloud images. Pandas Profiling itself does not provide GPU acceleration, but does support a workflow in which GPU acceleration is possible, e.g. this is a great setup for profiling your image datasets while developing computer vision applications. Learn how to launch a 4x GPU instance here.

Google Cloud

The Google Cloud Platform documentation features an article that uses pandas-profiling.

Read it here: Building a propensity model for financial services on Google Cloud.

Kaggle

pandas-profiling is available in Kaggle notebooks by default, as it is included in the standard Kaggle image.

Pipeline Integrations

With Python, command-line and Jupyter interfaces, pandas-profiling integrates seamlessly with DAG execution tools like Airflow, Dagster, Kedro and Prefect.

Integration with Dagster or Prefect can be achieved in a similar way as with Airflow.

Airflow

Integration with Airflow can be easily achieved through the BashOperator or the PythonOperator.

# Using the command line interface
profiling_task = BashOperator(
    task_id="Profile Data",
    bash_command="pandas_profiling dataset.csv report.html",
    dag=dag,
)
# Using the Python inferface
import pandas_profiling


def profile_data(file_name, report_file):
    df = pd.read_csv(file_name)
    report = pandas_profiling.ProfileReport(df, title="Profiling Report in Airflow")
    report.to_file(report_file)

    return "Report generated at {}".format(report_file)


profiling_task2 = PythonOperator(
    task_id="Profile Data",
    op_kwargs={"file_name": "dataset.csv", "report_file": "report.html"},
    python_callable=profile_data,
    dag=dag,
)

Kedro

There is a community created Kedro plugin available.

Editor Integrations

PyCharm

  1. Install pandas-profiling via the instructions above

  2. Locate your pandas-profiling executable.

On macOS / Linux / BSD:

$ which pandas_profiling
(example) /usr/local/bin/pandas_profiling

On Windows:

$ where pandas_profiling
(example) C:\ProgramData\Anaconda3\Scripts\pandas_profiling.exe
  1. In Pycharm, go to Settings (or Preferences on macOS) > Tools > External tools

  2. Click the + icon to add a new external tool

  3. Insert the following values

  • Name: Pandas Profiling

    • Program: The location obtained in step 2

    • Arguments: "$FilePath$" "$FileDir$/$FileNameWithoutAllExtensions$_report.html"

    • Working Directory: $ProjectFileDir$

PyCharm Integration

To use the PyCharm Integration, right click on any dataset file: External Tools > Pandas Profiling.