Advanced Usage¶
A set of options is available in order to adapt the report generated.
General settings¶
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
string |
Pandas Profiling Report |
Title for the report, shown in the header and title bar. |
|
integer |
0 |
Number of workers in thread pool. When set to zero, it is set to the number of CPUs available. |
|
boolean |
True |
If True, pandas-profiling will display a progress bar. |
The configuration can be changed in the following ways:
# Change the config when creating the report
profile = df.profile_report(title="Pandas Profiling Report", pool_size=1)
# Change the config after
profile.set_variable("html.minify_html", False)
profile.to_file("output.html")
Variable summary settings¶
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
None, asc or desc |
None |
Sort the variables asc(ending), desc(ending) or None (leaves original sorting). |
|
dict |
{} |
Ability to display a description alongside the descriptive statistics of each variable ({‘var_name’: ‘Description’}). |
|
list[float] |
[0.05,0.25,0.5,0.75,0.95] |
The quantiles to calculate. Note that .25, .5 and .75 are required for other metrics median and IQR. |
|
integer |
20 |
Warn if the skewness is above this threshold. |
|
integer |
5 |
If the number of distinct values is smaller than this number, then the series is considered to be categorical. Set to 0 to disable. |
|
float |
0.999 |
Set to zero to disable chi squared calculation. |
|
boolean |
True |
Check the string length and aggregate values (min, max, mean, media). |
|
boolean |
False |
Check the distribution of characters and their Unicode properties. Often informative, but may be computationally expensive. |
|
boolean |
False |
Check the distribution of words. Often informative, but may be computationally expensive. |
|
integer |
50 |
Warn if the number of distinct values is above this threshold. |
|
integer |
5 |
Display this number of observations. |
|
float |
0.999 |
Same as above. |
|
integer |
3 |
Same as above. |
profile = df.profile_report(
sort='ascending',
vars={
'num':{'low_categorical_threshold': 0},
'cat':{
'length':True,
'characters':False,
'words':False,
'n_obs': 5,
}
}
)
profile.set_variable('variables.descriptions',
{
'files': 'Files in the filesystem',
'datec': 'Creation date',
'datem': 'Modification date',
}
)
profile.to_file("report.html")
Missing data overview plots¶
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
boolean |
True |
Display a bar chart with counts of missing values for each column. |
|
boolean |
True |
Display a matrix of missing values. Similar to the bar chart, but might provide overview of the co-occurrence of missing values in rows. |
|
boolean |
True |
Display a heatmap of missing values, that measures nullity correlation (i.e. how strongly the presence or absence of one variable affects the presence of another). |
|
boolean |
True |
Display a dendrogram. Provides insight in the co-occurrence of missing values (i.e. columns that are both filled or both none). |
profile = df.profile_report(
missing_diagrams={
'heatmap': False,
'dendrogram': False,
}
)
profile.to_file("report.html")
The missing data diagrams are generated by the missingno package.
Correlations¶
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
boolean |
True |
Whether to calculate this coefficient |
|
boolean |
True |
Warn for correlations higher than the threshold |
|
float |
0.9 |
Warning threshold |
|
boolean |
True |
Whether to calculate this coefficient |
|
boolean |
False |
Warn for correlations higher than the threshold |
|
float |
0.9 |
Warning threshold |
|
boolean |
True |
Whether to calculate this coefficient |
|
boolean |
False |
Warn for correlations higher than the threshold |
|
float |
0.9 |
Warning threshold |
|
boolean |
True |
Whether to calculate this coefficient |
|
boolean |
False |
Warn for correlations higher than the threshold |
|
float |
0.9 |
Warning threshold |
|
boolean |
True |
Whether to calculate this coefficient |
|
boolean |
True |
Warn for correlations higher than the threshold |
|
float |
0.9 |
Warning threshold |
Disable all correlations:
profile = df.profile_report(
title="Report without correlations",
correlations={
"pearson": {"calculate": False},
"spearman": {"calculate": False},
"kendall": {"calculate": False},
"phi_k": {"calculate": False},
"cramers": {"calculate": False},
},
)
# or using a shorthand that is available for correlations
profile = df.profile_report(
title="Report without correlations",
correlations=None,
)
Interactions¶
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
boolean |
True |
Generate a 2D scatter plot (or hexagonal binned plot) for all continuous variable pairs. |
|
list |
[] |
When a list of variable names is given, only interactions between these and all other variables are given. |
The HTML Report¶
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
bool |
True |
If True, the output html is minified using the htmlmin package. |
|
bool |
True |
If True, all assets (stylesheets, scripts, images) are stored locally. If False, a CDN is used for some stylesheets and scripts. |
|
boolean |
True |
If True, all assets are contained in the report. If False, then a web export is created, where all assets are stored in the ‘[REPORT_NAME]_assets/’ directory. |
|
boolean |
True |
Whether to include a navigation bar in the report |
|
string |
None |
Select a ‘bootswatch’ theme. Available options: ‘flatly’ (dark) and ‘united’ (orange) |
|
string |
A base64 encoded logo, to display in the navigation bar. |
|
|
string |
#337ab7 |
The primary color to use in the report. |
|
boolean |
False |
By default, the width of the report is fixed. If set to True, the full width of the screen is used. |
Using a custom configuration file¶
To set the configuration of pandas-profiling using a custom file, you can start one of the sample configuration files below. Then, change the configuration to your liking.
from pandas_profiling import ProfileReport
profile = ProfileReport(df, config_file="your_config.yml")
profile.to_file("report.html")
Sample configuration files¶
A great way to get an overview of the possible configuration is to look through sample configuration files. The repository contains the following files:
default configuration file (default),
minimal configuration file (minimal computation, optimized for performance)
Configuration shorthands¶
It’s possible to disable certain groups of features through configuration shorthands.
# Disable samples, correlations, missing diagrams and duplicates at once
r = ProfileReport(samples=None, correlations=None, missing_diagrams=None, duplicates=None, interactions=None)
# Or use the .set_variable method
r = ProfileReport()
r.set_variable("samples", None)
r.set_variable("duplicates", None)
r.set_variable("correlations", None)
r.set_variable("missing_diagrams", None)
r.set_variable("interactions", None)