Overview

Dataset statistics

Number of variables3
Number of observations1000
Missing cells0
Missing cells (%)0.0%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory23.6 KiB
Average record size in memory24.1 B

Variable types

Categorical3

Warnings

russian has a high cardinality: 995 distinct values High cardinality
english has a high cardinality: 961 distinct values High cardinality
russian is uniformly distributed Uniform
english is uniformly distributed Uniform

Reproduction

Analysis started2021-05-11 22:15:16.358820
Analysis finished2021-05-11 22:15:16.844134
Duration0.49 seconds
Software versionpandas-profiling v3.0.0
Download configurationconfig.json

Variables

russian
Categorical

HIGH CARDINALITY
UNIFORM

Distinct995
Distinct (%)99.5%
Missing0
Missing (%)0.0%
Memory size7.9 KiB
мало
 
2
много
 
2
знать
 
2
что
 
2
пора
 
2
Other values (990)
990 

Length

Max length19
Median length6
Mean length6.117
Min length1

Characters and Unicode

Total characters6117
Distinct characters43
Distinct categories7 ?
Distinct scripts3 ?
Distinct blocks2 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique990 ?
Unique (%)99.0%

Sample

1st rowи
2nd rowв
3rd rowне
4th rowон
5th rowна

Common Values

ValueCountFrequency (%)
мало2
 
0.2%
много2
 
0.2%
знать2
 
0.2%
что2
 
0.2%
пора2
 
0.2%
звук1
 
0.1%
либо1
 
0.1%
смысл1
 
0.1%
мнение1
 
0.1%
приходить1
 
0.1%
Other values (985)985
98.5%

Length

2021-05-11T22:15:17.210559image/svg+xmlMatplotlib v3.4.2, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
мало2
 
0.2%
много2
 
0.2%
знать2
 
0.2%
пора2
 
0.2%
что2
 
0.2%
пытаться1
 
0.1%
власть1
 
0.1%
человеческий1
 
0.1%
приходить1
 
0.1%
стать1
 
0.1%
Other values (987)987
98.5%

Most occurring characters

ValueCountFrequency (%)
о645
 
10.5%
т526
 
8.6%
а484
 
7.9%
е395
 
6.5%
с364
 
6.0%
и345
 
5.6%
н339
 
5.5%
ь316
 
5.2%
р306
 
5.0%
в263
 
4.3%
Other values (33)2134
34.9%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter6106
99.8%
Uppercase Letter3
 
< 0.1%
Decimal Number3
 
< 0.1%
Space Separator2
 
< 0.1%
Open Punctuation1
 
< 0.1%
Other Punctuation1
 
< 0.1%
Close Punctuation1
 
< 0.1%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
о645
 
10.6%
т526
 
8.6%
а484
 
7.9%
е395
 
6.5%
с364
 
6.0%
и345
 
5.7%
н339
 
5.6%
ь316
 
5.2%
р306
 
5.0%
в263
 
4.3%
Other values (24)2123
34.8%
Uppercase Letter
ValueCountFrequency (%)
М1
33.3%
Р1
33.3%
S1
33.3%
Decimal Number
ValueCountFrequency (%)
62
66.7%
31
33.3%
Space Separator
ValueCountFrequency (%)
2
100.0%
Open Punctuation
ValueCountFrequency (%)
(1
100.0%
Other Punctuation
ValueCountFrequency (%)
#1
100.0%
Close Punctuation
ValueCountFrequency (%)
)1
100.0%

Most occurring scripts

ValueCountFrequency (%)
Cyrillic6106
99.8%
Common8
 
0.1%
Latin3
 
< 0.1%

Most frequent character per script

Cyrillic
ValueCountFrequency (%)
о645
 
10.6%
т526
 
8.6%
а484
 
7.9%
е395
 
6.5%
с364
 
6.0%
и345
 
5.7%
н339
 
5.6%
ь316
 
5.2%
р306
 
5.0%
в263
 
4.3%
Other values (25)2123
34.8%
Common
ValueCountFrequency (%)
2
25.0%
62
25.0%
(1
12.5%
#1
12.5%
31
12.5%
)1
12.5%
Latin
ValueCountFrequency (%)
e2
66.7%
S1
33.3%

Most occurring blocks

ValueCountFrequency (%)
Cyrillic6106
99.8%
ASCII11
 
0.2%

Most frequent character per block

Cyrillic
ValueCountFrequency (%)
о645
 
10.6%
т526
 
8.6%
а484
 
7.9%
е395
 
6.5%
с364
 
6.0%
и345
 
5.7%
н339
 
5.6%
ь316
 
5.2%
р306
 
5.0%
в263
 
4.3%
Other values (25)2123
34.8%
ASCII
ValueCountFrequency (%)
2
18.2%
e2
18.2%
62
18.2%
(1
9.1%
S1
9.1%
#1
9.1%
31
9.1%
)1
9.1%

english
Categorical

HIGH CARDINALITY
UNIFORM

Distinct961
Distinct (%)96.1%
Missing0
Missing (%)0.0%
Memory size7.9 KiB
to ask
 
3
to fit, fall; have to
 
3
to leave, go away
 
2
necessary
 
2
long
 
2
Other values (956)
988 

Length

Max length130
Median length11
Mean length13.169
Min length1

Characters and Unicode

Total characters13169
Distinct characters75
Distinct categories9 ?
Distinct scripts4 ?
Distinct blocks4 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique924 ?
Unique (%)92.4%

Sample

1st rowand, though
2nd rowin, at
3rd rownot
4th rowhe
5th rowon, it, at, to

Common Values

ValueCountFrequency (%)
to ask3
 
0.3%
to fit, fall; have to3
 
0.3%
to leave, go away2
 
0.2%
necessary2
 
0.2%
long2
 
0.2%
to write2
 
0.2%
order2
 
0.2%
to drink2
 
0.2%
Russian2
 
0.2%
before, in front of2
 
0.2%
Other values (951)978
97.8%

Length

2021-05-11T22:15:17.581402image/svg+xmlMatplotlib v3.4.2, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
to256
 
11.0%
see33
 
1.4%
be20
 
0.9%
in20
 
0.9%
as18
 
0.8%
for16
 
0.7%
come15
 
0.6%
of14
 
0.6%
the13
 
0.6%
by12
 
0.5%
Other values (1240)1914
82.1%

Most occurring characters

ValueCountFrequency (%)
e1370
 
10.4%
1334
 
10.1%
t1048
 
8.0%
o1021
 
7.8%
a788
 
6.0%
r730
 
5.5%
,673
 
5.1%
n656
 
5.0%
i616
 
4.7%
s611
 
4.6%
Other values (65)4322
32.8%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter10588
80.4%
Space Separator1334
 
10.1%
Other Punctuation834
 
6.3%
Decimal Number191
 
1.5%
Open Punctuation75
 
0.6%
Close Punctuation75
 
0.6%
Uppercase Letter59
 
0.4%
Dash Punctuation8
 
0.1%
Nonspacing Mark5
 
< 0.1%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
e1370
12.9%
t1048
 
9.9%
o1021
 
9.6%
a788
 
7.4%
r730
 
6.9%
n656
 
6.2%
i616
 
5.8%
s611
 
5.8%
l534
 
5.0%
h350
 
3.3%
Other values (34)2864
27.0%
Other Punctuation
ValueCountFrequency (%)
,673
80.7%
;67
 
8.0%
#65
 
7.8%
'11
 
1.3%
"6
 
0.7%
4
 
0.5%
.3
 
0.4%
?2
 
0.2%
!2
 
0.2%
:1
 
0.1%
Decimal Number
ValueCountFrequency (%)
328
14.7%
124
12.6%
924
12.6%
420
10.5%
720
10.5%
620
10.5%
518
9.4%
815
7.9%
213
6.8%
09
 
4.7%
Uppercase Letter
ValueCountFrequency (%)
S49
83.1%
M3
 
5.1%
R3
 
5.1%
G2
 
3.4%
I1
 
1.7%
A1
 
1.7%
Space Separator
ValueCountFrequency (%)
1334
100.0%
Open Punctuation
ValueCountFrequency (%)
(75
100.0%
Close Punctuation
ValueCountFrequency (%)
)75
100.0%
Dash Punctuation
ValueCountFrequency (%)
-8
100.0%
Nonspacing Mark
ValueCountFrequency (%)
́5
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin10604
80.5%
Common2517
 
19.1%
Cyrillic43
 
0.3%
Inherited5
 
< 0.1%

Most frequent character per script

Latin
ValueCountFrequency (%)
e1370
12.9%
t1048
 
9.9%
o1021
 
9.6%
a788
 
7.4%
r730
 
6.9%
n656
 
6.2%
i616
 
5.8%
s611
 
5.8%
l534
 
5.0%
h350
 
3.3%
Other values (22)2880
27.2%
Common
ValueCountFrequency (%)
1334
53.0%
,673
26.7%
(75
 
3.0%
)75
 
3.0%
;67
 
2.7%
#65
 
2.6%
328
 
1.1%
124
 
1.0%
924
 
1.0%
420
 
0.8%
Other values (14)132
 
5.2%
Cyrillic
ValueCountFrequency (%)
о6
14.0%
и5
11.6%
к5
11.6%
н3
 
7.0%
в3
 
7.0%
е3
 
7.0%
м3
 
7.0%
а3
 
7.0%
ч2
 
4.7%
р2
 
4.7%
Other values (8)8
18.6%
Inherited
ValueCountFrequency (%)
́5
100.0%

Most occurring blocks

ValueCountFrequency (%)
ASCII13117
99.6%
Cyrillic43
 
0.3%
Diacriticals5
 
< 0.1%
Punctuation4
 
< 0.1%

Most frequent character per block

ASCII
ValueCountFrequency (%)
e1370
 
10.4%
1334
 
10.2%
t1048
 
8.0%
o1021
 
7.8%
a788
 
6.0%
r730
 
5.6%
,673
 
5.1%
n656
 
5.0%
i616
 
4.7%
s611
 
4.7%
Other values (45)4270
32.6%
Punctuation
ValueCountFrequency (%)
4
100.0%
Cyrillic
ValueCountFrequency (%)
о6
14.0%
и5
11.6%
к5
11.6%
н3
 
7.0%
в3
 
7.0%
е3
 
7.0%
м3
 
7.0%
а3
 
7.0%
ч2
 
4.7%
р2
 
4.7%
Other values (8)8
18.6%
Diacriticals
ValueCountFrequency (%)
́5
100.0%

part of speech
Categorical

Distinct37
Distinct (%)3.7%
Missing0
Missing (%)0.0%
Memory size7.9 KiB
noun
374 
verb
232 
adjective
127 
adverb
112 
preposition
 
37
Other values (32)
118 

Length

Max length26
Median length4
Mean length5.885
Min length3

Characters and Unicode

Total characters5885
Distinct characters24
Distinct categories5 ?
Distinct scripts3 ?
Distinct blocks2 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique20 ?
Unique (%)2.0%

Sample

1st rowconjunction
2nd rowpreposition
3rd rowparticle
4th rowpronoun
5th rowpreposition

Common Values

ValueCountFrequency (%)
noun374
37.4%
verb232
23.2%
adjective127
 
12.7%
adverb112
 
11.2%
preposition37
 
3.7%
pronoun36
 
3.6%
conjunction12
 
1.2%
misc12
 
1.2%
cardinal number11
 
1.1%
particle7
 
0.7%
Other values (27)40
 
4.0%

Length

2021-05-11T22:15:17.930101image/svg+xmlMatplotlib v3.4.2, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
noun378
36.0%
verb234
22.3%
adjective129
 
12.3%
adverb118
 
11.2%
pronoun40
 
3.8%
preposition39
 
3.7%
particle19
 
1.8%
number18
 
1.7%
cardinal16
 
1.5%
conjunction15
 
1.4%
Other values (15)43
 
4.1%

Most occurring characters

ValueCountFrequency (%)
n984
16.7%
e698
11.9%
o588
10.0%
r497
8.4%
v481
8.2%
u456
7.7%
b373
 
6.3%
a309
 
5.3%
i283
 
4.8%
d268
 
4.6%
Other values (14)948
16.1%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter5806
98.7%
Space Separator51
 
0.9%
Other Punctuation26
 
0.4%
Open Punctuation1
 
< 0.1%
Close Punctuation1
 
< 0.1%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
n984
16.9%
e698
12.0%
o588
10.1%
r497
8.6%
v481
8.3%
u456
7.9%
b373
 
6.4%
a309
 
5.3%
i283
 
4.9%
d268
 
4.6%
Other values (10)869
15.0%
Other Punctuation
ValueCountFrequency (%)
,26
100.0%
Space Separator
ValueCountFrequency (%)
51
100.0%
Open Punctuation
ValueCountFrequency (%)
(1
100.0%
Close Punctuation
ValueCountFrequency (%)
)1
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin5805
98.6%
Common79
 
1.3%
Cyrillic1
 
< 0.1%

Most frequent character per script

Latin
ValueCountFrequency (%)
n984
17.0%
e698
12.0%
o588
10.1%
r497
8.6%
v481
8.3%
u456
7.9%
b373
 
6.4%
a309
 
5.3%
i283
 
4.9%
d268
 
4.6%
Other values (9)868
15.0%
Common
ValueCountFrequency (%)
51
64.6%
,26
32.9%
(1
 
1.3%
)1
 
1.3%
Cyrillic
ValueCountFrequency (%)
с1
100.0%

Most occurring blocks

ValueCountFrequency (%)
ASCII5884
> 99.9%
Cyrillic1
 
< 0.1%

Most frequent character per block

ASCII
ValueCountFrequency (%)
n984
16.7%
e698
11.9%
o588
10.0%
r497
8.4%
v481
8.2%
u456
7.7%
b373
 
6.3%
a309
 
5.3%
i283
 
4.8%
d268
 
4.6%
Other values (13)947
16.1%
Cyrillic
ValueCountFrequency (%)
с1
100.0%

Missing values

2021-05-11T22:15:16.643048image/svg+xmlMatplotlib v3.4.2, https://matplotlib.org/
A simple visualization of nullity by column.
2021-05-11T22:15:16.781254image/svg+xmlMatplotlib v3.4.2, https://matplotlib.org/
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

Sample

First rows

russianenglishpart of speech
0иand, thoughconjunction
1вin, atpreposition
2неnotparticle
3онhepronoun
4наon, it, at, topreposition
5яIpronoun
6чтоwhat, that, whyсonjunction, pronoun
7тотthatadjective, pronoun
8бытьto beverb
9сwith, and, from, ofpreposition

Last rows

russianenglishpart of speech
990художникpainter, artistnoun
991знакsignnoun
992заводfactorynoun
993кулакfistnoun
994использоватьto use, utilize, make use ofverb
995стаканglassnoun
996пахнутьto smellverb
997отсюдаfrom hereadverb
998ротmouthnoun
999пораit's time;at times, now and then(See #279)misc