Changelog v2.13.0

🎉 Features

  • configurable numeric precision

👷‍♂️ Internal Improvements

  • string type detection performance optimization

  • various improvements to software quality (flake8, commitlint)

⬆️ Dependencies

  • upgrade from visions 0.6.0 to 0.7.1

  • upgrade from coverage <5 to ~=5.5

Changelog v2.12.0

🎉 Features

  • Add the number and the percentage of negative values for numerical variables [695] (contributed by @gverbock)

  • Enable setting of typeset/summarizer (contributed by @ieaves)

  • Allow empty data frames [678] (contributed by @spbail, @fwd2020-c)

🐛 Bug fixes

  • Patch args for great_expectations datetime profiler [727] (contributed by @jstammers)

  • Negative exponent formatting [723] (reported by @rdpapworth)

📖 Documentation

  • Fix link syntax (contributed by @ChrisCarini)

👷‍♂️ Internal Improvements

  • Several performance improvements (minimal mode, duplicates, frequency table sorting)

  • Introduce pytest-benchmark in CI to monitor commit performance impact

  • Introduce commitlint in CI to start automating the changelog generation

⬆️ Dependencies

  • The ipywidgets dependency was moved to the [notebook] extra, so most of Jupyter will not be installed alongside this package by default (contributed by @akx)

  • Replaced the (testing only) fastparquet dependency with pyarrow (default pandas parquet engine, contributed by @kurosch)

  • Upgrade phik. This drops the hard dependency on numba (contributed by @akx)

Changelog v2.11.0

🎉 Features

  • Great Expectations integration [430] docs (thanks @spbail, @talagluck and the Great Expectations team).

  • Introduced the infer_dtypes parameter to control automatic inference of data types [676] (thanks @mohith7548 and @ieaves).

  • Improved JSON representation for pd.Series, pd.DataFrame, numpy data and Samples.

🚨 Breaking changes

  • Global config setting removed; config resets on report initialization.

⬆️ Dependencies

  • Update pyupgrade to 2.10.0.

Changelog v2.10.1

🐛 Bug fixes

  • Fixed recursion error for NaN values [683] and [671]

  • Fixed error for empty dataframe [664]

  • Fixed Jupyter notebook widget string rendering issue [668]

  • Fixed histogram of string length with NaNs [642] and [613]

  • Fixed slugify logic for interaction columns [663] <>

📖 Documentation

  • Update Slack community link on readme [673]

  • Include recent contributions to the “Resources” page.

Changelog v2.10.0

🎉 Features

  • Restructured the overview for categorical variables.

  • Handling of compressed files

  • Option for random sample

  • Restructure categorical variable overview

👷‍♂️ Internal Improvements

  • Full visions integration for type system: read more here.

  • Migrate from Travis CI to Github Actions…

🚨 Breaking changes

  • The configuration parameter is replaced by

Changelog v2.9.0

🎉 Features

  • Description per variable now possible (see the metadata page) or the Census example.

🐛 Bug fixes

  • Fixed bug for small DataFrames with unused categories.

  • Fixed bug where parallelization would have side effects.

  • Removed warning where colormap was modified in place.

  • Distinguish between unique and distinct correctly.

📖 Documentation

  • Extend documentation for frequent issues.

  • Extended documentation for Streamlit and Panel.

  • Provide visibility to our supporters.

⬆️ Dependencies

  • Pandas 1.1.0 contains bugs that make it incompatible. Please up- or downgrade.

  • Upgraded visions to 0.5.0.

Changelog v2.9.0rc1

🎉 Features

  • Working with sensitive data: Introduced sensitive=True option to mask non-aggregated data (such as samples, duplicates, frequency tables for categorical columns) [#503].

  • The sample section can be parametrized with a custom sample (for instance mock data).

  • Introduce shorthands for groups of parameters for styles and explorative mode [#499].

  • Metadata of a dataset can be added to the report (see documentation).

  • Numeric columns now report monotonicity information.

  • A pie chart can be generated for boolean and (low) categorical columns.

🐛 Bug fixes

  • NaT in date columns were interpreted as a date in 1680 by histograms [#507].

  • ValueError: (‘widget type not understood’, ‘select’) [#493].

  • Fixed regression in working with pandas’ nullable integers [#502].

  • Formatting of precision of numeric values has been improved in a few places.

👷‍♂️ Internal Improvements

  • Histograms used to be calculated at view time (single thread) and are now computed in parallel.

  • Matplotlib’s rcParams are now modified through the contextmanager [#494].

📖 Documentation

  • Links to Colab and Binder notebooks [#480 and #497].

  • The documentation for sensitive data, large datasets and metadata have been extended.

🚨 Breaking changes

  • bayesian_blocks binning has been removed, together with the astropy dependency.

  • Config files config_dark.yaml, config_united.yaml and config_explorative.yaml have been removed in favour of shorthand for groups of parameters.

⬆️ Dependencies

  • isort updated to major version 5.

  • attrs is now required for classes.

Changelog v2.8.0

🎉 Features

  • Expanded the Unicode analysis capabilities: next to the most occurring unicode scripts, categories and blocks, it’s now possible to inspect the most frequent characters for each of them.

  • ProfileReport.set_variable now accepts nested parameters such as report.set_variable("variables.descriptions", {"var1": "Identifier"}).

  • Ability to have descriptions of the variables alongside the descriptive statistics (#232, #402).

  • Config: Introducing config shorthands.

  • Config: plot.scatter_threshold allows for configuration above what value scatter plots are replace with hexbin plots.

  • Config: html.inline allows for rendering assets as vector images to package export as folder and file (similar to exporting a website). (#452).

  • It’s now possible to specify which interactions to compute to filter out un-needed interactions between columns (#451).

  • When the output_file is omitted in the CLI, it uses the input_file with HTML extensions. This can be useful when profiling of a complete directory from the command line, e.g. find . -type f -name "*.csv" -exec pandas_profiling {} \;.

  • Config: Split the in and for more control on the summaries.

  • Config: Included a new configuration sample file config_explorative.yml, including Text (length distribution, unicode information), File (file size, creation time), Image (dimensions, exif information).

🐛 Bug fixes

  • Resolved color ValueError on Mac (#464).

  • Style: too many interactions overflowed tabs. Now they elegantly turn into a select control.

  • Unique variables are always uniform and have high cardinality, hence we can remove the redundant labels.

  • The counts for unicode properties were based on unique characters, instead of following the original frequency distribution.

  • Slimmed down the HTML by removing classes and more effective CSS.

👷‍♂️ Internal Improvements

  • CI: Added macOS and Windows to the testing environments (experimental).

  • CI: Added python3.9-dev to the testing environment (experimental).

  • CI: Reduced the number of permutations for code formatting and type checking.

📖 Documentation

  • API documentation is now available.

⚠️ Deprecated

  • The bayesian_bins parameter will be removed in the next release.

🚨 Breaking changes

  • Config: is replaced by and

⬆️ Dependencies

  • Update visions to 0.4.4 for more informative Unicode summaries.

Changelog v2.7.1

⬆️ Dependencies

  • Fix version of visions due to breaking changes in new summarization functions.

Changelog v2.7.0

🎉 Features

  • Reports are built in phases, see issue for details (#421)

  • The most occurring duplicates rows are included in the report.

  • ProfileReports can now be saved to and loaded from disk (for caching).

  • Explicit analysis duration is added to the reproduction section of the report.

  • Doc: this version introduces documentation powered by Sphinx. The previously used pdoc3 has been adequate initially, however misses functionality and extensibility.

  • Doc: Dedicated page for large datasets is created (#420).

  • Doc: The installation instructions have been extended, installation via conda would default to 1.4.1 (#449, #448).

  • CI: Linting, building the documentation and examples and uploading the package to PyPi have been automated using git flow and Github Actions.

🐛 Bug fixes

  • warnings were not shown in the “warnings” tab, but were at variable level (#389).

  • The “median absolute deviation” is now reported instead of the “mean absolute deviation” (#453).

  • Several style-related fixes for Jupyter lab and notebooks (tables, warnings, wide images).

  • pd.NAN introduced in pandas 1 now supported (#437).

  • The logic for calculating infinite values is now correct (#397).

👷‍♂️ Internal Improvements

  • The number of progress bars is reduced. The progress bars are now grouped by build phase (e.g. describing dataset, building report structure, rendering report, exporting to file, #421).

  • The progress bars provide more information about the current step to the user #434).

  • Invalid correlations coefficients do not cause it to drop the complete variable anymore, instead the plot now propagates the NaN (#417).

  • Performance: type inference test now short-circuit, as visions does by default.

  • Performance: the numerical summary is optimized to use numpy directly, instead of slower methods provided by pandas.

  • Config: dynamic histogram bins are now disabled by default default for better default computational performance (#441).

  • Config: type inference to warning when date variables are processed as categorical is set to False by default for being a bottleneck for larger datasets.

  • Warn: the user is warned that the to_widgets does not work in Google Colab, which doesn’t support ipywidgets properly (#462).

  • Cln: Moved ProfileReport out of __init__ to it’s own class file.

  • Cln: removed the output_file parameter form examples.

  • Cln: the HTML representation of the footer and wrapper are moved out of ProfileReport to the report structure.

  • Cln: the imports are automatically ordered with isort.

⚠️ Deprecated

  • Doc: the pdoc3 documentation will be removed in the future.

  • Config: using the config globally is deprecated. In the future, the configuration will be tied to the ProfileReport.

🚨 Breaking changes

  • Doc: the example HTML reports were removed from the repository (still available in the gh-pages branch and documentation).

  • The recoded “correlation” was removed for not being informative enough to justify it’s costs.

⬆️ Dependencies

  • Requirements now correctly excludes pandas 1.0.0, 1.0.1 and 1.0.2. Either user pandas <1 or >= 1.0.3.

Prior to v2.7.0

Previously, there was no explicit changelog. However, changes were included in the release description on Github, which you can find at this page.