Changelog v3.0.0

This is the first release to adhere to the SemVer and Conventional Commits specifications.

🎉 Features

  • The report configuration was completely overhauled, providing a more intuitive API and fixing issues inherent to the previous global config.

🐛 Bug fixes

  • Various issues could not be (easily) solved in the previous configuration architecture, are fixed in this release ([584], [644], [698], [720] and [724])

  • Fix crash with exotic characters ([707])

  • Fixed the way (sub)titles were shown in the report grids.

📖 Documentation

  • Enforce QA using flake8 for documentation, for instance checking for backticks and enforcing black code style on examples.

  • Automated configuration documentation API.

👷‍♂️ Internal Improvements

  • CI: mypy type checking was moved to the pre-commit hooks.

🚨 Breaking changes

The configuration syntax has changed!

The yaml configuration now requires the official syntax (e.g. null instead of None). The previously used configuration library could not handle comments with indentation - you are now free to use conventional yaml.

For the python configuration the set_variable method has been replaced by more intuitively accessing the configuration object. For example, you can now set the title in the following way report.config.title = "My title".

The docs provide additional examples.

⬆️ Dependencies

  • pydantic and PyYaml are dependencies for the new configuration.

  • confuse and attrs are no longer (explicit) dependencies.

  • Upgraded tangled-up-in-unicode to 0.0.7.

Changelog v2.13.0

🎉 Features

  • configurable numeric precision

👷‍♂️ Internal Improvements

  • string type detection performance optimization

  • various improvements to software quality (flake8, commitlint)

⬆️ Dependencies

  • upgrade from visions 0.6.0 to 0.7.1

  • upgrade from coverage <5 to ~=5.5

Changelog v2.12.0

🎉 Features

  • Add the number and the percentage of negative values for numerical variables [695] (contributed by @gverbock)

  • Enable setting of typeset/summarizer (contributed by @ieaves)

  • Allow empty data frames [678] (contributed by @spbail, @fwd2020-c)

🐛 Bug fixes

  • Patch args for great_expectations datetime profiler [727] (contributed by @jstammers)

  • Negative exponent formatting [723] (reported by @rdpapworth)

📖 Documentation

  • Fix link syntax (contributed by @ChrisCarini)

👷‍♂️ Internal Improvements

  • Several performance improvements (minimal mode, duplicates, frequency table sorting)

  • Introduce pytest-benchmark in CI to monitor commit performance impact

  • Introduce commitlint in CI to start automating the changelog generation

⬆️ Dependencies

  • The ipywidgets dependency was moved to the [notebook] extra, so most of Jupyter will not be installed alongside this package by default (contributed by @akx)

  • Replaced the (testing only) fastparquet dependency with pyarrow (default pandas parquet engine, contributed by @kurosch)

  • Upgrade phik. This drops the hard dependency on numba (contributed by @akx)

Changelog v2.11.0

🎉 Features

  • Great Expectations integration [430] docs (thanks @spbail, @talagluck and the Great Expectations team).

  • Introduced the infer_dtypes parameter to control automatic inference of data types [676] (thanks @mohith7548 and @ieaves).

  • Improved JSON representation for pd.Series, pd.DataFrame, numpy data and Samples.

🚨 Breaking changes

  • Global config setting removed; config resets on report initialization.

⬆️ Dependencies

  • Update pyupgrade to 2.10.0.

Changelog v2.10.1

🐛 Bug fixes

  • Fixed recursion error for NaN values [683] and [671]

  • Fixed error for empty dataframe [664]

  • Fixed Jupyter notebook widget string rendering issue [668]

  • Fixed histogram of string length with NaNs [642] and [613]

  • Fixed slugify logic for interaction columns [663]

📖 Documentation

  • Update Slack community link on readme [673]

  • Include recent contributions to the “Resources” page.

Changelog v2.10.0

🎉 Features

  • Restructured the overview for categorical variables.

  • Handling of compressed files

  • Option for random sample

  • Restructure categorical variable overview

👷‍♂️ Internal Improvements

  • Full visions integration for type system: read more here.

  • Migrate from Travis CI to Github Actions…

🚨 Breaking changes

  • The configuration parameter is replaced by

Changelog v2.9.0

🎉 Features

  • Description per variable now possible (see the metadata page) or the Census example.

🐛 Bug fixes

  • Fixed bug for small DataFrames with unused categories.

  • Fixed bug where parallelization would have side effects.

  • Removed warning where colormap was modified in place.

  • Distinguish between unique and distinct correctly.

📖 Documentation

  • Extend documentation for frequent issues.

  • Extended documentation for Streamlit and Panel.

  • Provide visibility to our supporters.

⬆️ Dependencies

  • Pandas 1.1.0 contains bugs that make it incompatible. Please up- or downgrade.

  • Upgraded visions to 0.5.0.

Changelog v2.9.0rc1

🎉 Features

  • Working with sensitive data: Introduced sensitive=True option to mask non-aggregated data (such as samples, duplicates, frequency tables for categorical columns) [#503].

  • The sample section can be parametrized with a custom sample (for instance mock data).

  • Introduce shorthands for groups of parameters for styles and explorative mode [#499].

  • Metadata of a dataset can be added to the report (see documentation).

  • Numeric columns now report monotonicity information.

  • A pie chart can be generated for boolean and (low) categorical columns.

🐛 Bug fixes

  • NaT in date columns were interpreted as a date in 1680 by histograms [#507].

  • ValueError: (‘widget type not understood’, ‘select’) [#493].

  • Fixed regression in working with pandas’ nullable integers [#502].

  • Formatting of precision of numeric values has been improved in a few places.

👷‍♂️ Internal Improvements

  • Histograms used to be calculated at view time (single thread) and are now computed in parallel.

  • Matplotlib’s rcParams are now modified through the contextmanager [#494].

📖 Documentation

  • Links to Colab and Binder notebooks [#480 and #497].

  • The documentation for sensitive data, large datasets and metadata have been extended.

🚨 Breaking changes

  • bayesian_blocks binning has been removed, together with the astropy dependency.

  • Config files config_dark.yaml, config_united.yaml and config_explorative.yaml have been removed in favour of shorthand for groups of parameters.

⬆️ Dependencies

  • isort updated to major version 5.

  • attrs is now required for classes.

Changelog v2.8.0

🎉 Features

  • Expanded the Unicode analysis capabilities: next to the most occurring unicode scripts, categories and blocks, it’s now possible to inspect the most frequent characters for each of them.

  • ProfileReport.set_variable now accepts nested parameters such as report.set_variable("variables.descriptions", {"var1": "Identifier"}).

  • Ability to have descriptions of the variables alongside the descriptive statistics (#232, #402).

  • Config: Introducing config shorthands.

  • Config: plot.scatter_threshold allows for configuration above what value scatter plots are replace with hexbin plots.

  • Config: html.inline allows for rendering assets as vector images to package export as folder and file (similar to exporting a website). (#452).

  • It’s now possible to specify which interactions to compute to filter out un-needed interactions between columns (#451).

  • When the output_file is omitted in the CLI, it uses the input_file with HTML extensions. This can be useful when profiling of a complete directory from the command line, e.g. find . -type f -name "*.csv" -exec pandas_profiling {} \;.

  • Config: Split the in and for more control on the summaries.

  • Config: Included a new configuration sample file config_explorative.yml, including Text (length distribution, unicode information), File (file size, creation time), Image (dimensions, exif information).

🐛 Bug fixes

  • Resolved color ValueError on Mac (#464).

  • Style: too many interactions overflowed tabs. Now they elegantly turn into a select control.

  • Unique variables are always uniform and have high cardinality, hence we can remove the redundant labels.

  • The counts for unicode properties were based on unique characters, instead of following the original frequency distribution.

  • Slimmed down the HTML by removing classes and more effective CSS.

👷‍♂️ Internal Improvements

  • CI: Added macOS and Windows to the testing environments (experimental).

  • CI: Added python3.9-dev to the testing environment (experimental).

  • CI: Reduced the number of permutations for code formatting and type checking.

📖 Documentation

  • API documentation is now available.

⚠️ Deprecated

  • The bayesian_bins parameter will be removed in the next release.

🚨 Breaking changes

  • Config: is replaced by and

⬆️ Dependencies

  • Update visions to 0.4.4 for more informative Unicode summaries.

Changelog v2.7.1

⬆️ Dependencies

  • Fix version of visions due to breaking changes in new summarization functions.

Changelog v2.7.0

🎉 Features

  • Reports are built in phases, see issue for details (#421)

  • The most occurring duplicates rows are included in the report.

  • ProfileReports can now be saved to and loaded from disk (for caching).

  • Explicit analysis duration is added to the reproduction section of the report.

  • Doc: this version introduces documentation powered by Sphinx. The previously used pdoc3 has been adequate initially, however misses functionality and extensibility.

  • Doc: Dedicated page for large datasets is created (#420).

  • Doc: The installation instructions have been extended, installation via conda would default to 1.4.1 (#449, #448).

  • CI: Linting, building the documentation and examples and uploading the package to PyPi have been automated using git flow and Github Actions.

🐛 Bug fixes

  • warnings were not shown in the “warnings” tab, but were at variable level (#389).

  • The “median absolute deviation” is now reported instead of the “mean absolute deviation” (#453).

  • Several style-related fixes for Jupyter lab and notebooks (tables, warnings, wide images).

  • pd.NAN introduced in pandas 1 now supported (#437).

  • The logic for calculating infinite values is now correct (#397).

👷‍♂️ Internal Improvements

  • The number of progress bars is reduced. The progress bars are now grouped by build phase (e.g. describing dataset, building report structure, rendering report, exporting to file).

  • The progress bars provide more information about the current step to the user #434).

  • Invalid correlations coefficients do not cause it to drop the complete variable anymore, instead the plot now propagates the NaN (#417).

  • Performance: type inference test now short-circuit, as visions does by default.

  • Performance: the numerical summary is optimized to use numpy directly, instead of slower methods provided by pandas.

  • Config: dynamic histogram bins are now disabled by default default for better default computational performance (#441).

  • Config: type inference to warning when date variables are processed as categorical is set to False by default for being a bottleneck for larger datasets.

  • Warn: the user is warned that the to_widgets does not work in Google Colab, which doesn’t support ipywidgets properly (#462).

  • Cln: Moved ProfileReport out of __init__ to it’s own class file.

  • Cln: removed the output_file parameter form examples.

  • Cln: the HTML representation of the footer and wrapper are moved out of ProfileReport to the report structure.

  • Cln: the imports are automatically ordered with isort.

⚠️ Deprecated

  • Doc: the pdoc3 documentation will be removed in the future.

  • Config: using the config globally is deprecated. In the future, the configuration will be tied to the ProfileReport.

🚨 Breaking changes

  • Doc: the example HTML reports were removed from the repository (still available in the gh-pages branch and documentation).

  • The recoded “correlation” was removed for not being informative enough to justify it’s costs.

⬆️ Dependencies

  • Requirements now correctly excludes pandas 1.0.0, 1.0.1 and 1.0.2. Either user pandas <1 or >= 1.0.3.

Prior to v2.7.0

Previously, there was no explicit changelog. However, changes were included in the release description on Github, which you can find at this page.