Cluster Gender Changes

This notebook audits for significant changes in cluster gender annotations, to allow us to detect the significance of shifts over time. It depends on the aligned cluster identities in isbn-version-clusters.parquet.

from pathlib import Path
from functools import reduce

import pandas as pd
import polars as pl
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Load Data

Define the versions we care about:

versions = ['pgsql', '2022-03-2.0', '2022-07', '2022-10', '2022-11-2.1', '2023-07', '2025-06-3.0', 'current']

Load the aligned ISBNs:

isbn_clusters = pd.read_parquet('isbn-version-clusters.parquet')
isbn_clusters.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44866744 entries, 0 to 44866743
Data columns (total 10 columns):
 #   Column       Dtype  
---  ------       -----  
 0   isbn         object 
 1   isbn_id      int32  
 2   current      float64
 3   2025-06-3.0  float64
 4   2023-07      float64
 5   2022-11-2.1  float64
 6   2022-10      float64
 7   2022-07      float64
 8   2022-03-2.0  float64
 9   pgsql        float64
dtypes: float64(8), int32(1), object(1)
memory usage: 3.2+ GB

Different Genders

How many clusters changed gender?

To get started, we need a list of genders in order.

genders = [
    'ambiguous', 'female', 'male', 'unknown',
    'no-author-rec', 'no-book-author', 'no-book', 'absent'
]

Let’s make a function to read gender info:

def read_gender(path, map_file=None):
    cg = pl.scan_parquet(path)
    cg = cg.select([
        pl.col('cluster').cast(pl.Int32),
        pl.when(pl.col('gender') == pl.lit('no-loc-author'))
            .then(pl.lit('no-book-author'))
            .when(pl.col('gender') == pl.lit('no-viaf-author'))
            .then(pl.lit('no-author-rec'))
            .otherwise(pl.col('gender'))
            .cast(pl.Categorical)
            .alias('gender')
    ])
    if map_file is not None:
        map = pl.scan_parquet(map_file)
        cg = cg.join(map, on='cluster', how='left')
        cg = cg.select([
            pl.col('common').alias('cluster'),
            pl.col('gender')
        ])
    return cg

Read each data source’s gender info and map to common cluster IDs:

gender_cc = {
    v: read_gender(f'{v}/cluster-genders.parquet', f'{v}/cluster-map.parquet')
    for v in versions if v != 'current'
}
gender_cc['current'] = read_gender('../book-links/cluster-genders.parquet')

Set up a sequence of frames for merging:

to_merge = [
    gender_cc[v].select([
        pl.col('cluster'),
        pl.col('gender').alias(v)
    ]).unique()
    for v in versions
]

Merge and collect results:

cluster_genders = reduce(lambda df1, df2: df1.join(df2, on='cluster', how='outer_coalesce'), to_merge)
cluster_genders = cluster_genders.collect()

/tmp/ipykernel_4174369/2981784951.py:1: DeprecationWarning: Use of `how='outer_coalesce'` should be replaced with `how='full', coalesce=True`.
  cluster_genders = reduce(lambda df1, df2: df1.join(df2, on='cluster', how='outer_coalesce'), to_merge)

For unclear reasons, a few versions have a null cluster. Drop that.

cluster_genders = cluster_genders.filter(cluster_genders['cluster'].is_not_null())

Now we will convert to Pandas and fix missing values:

cluster_genders = cluster_genders.to_pandas().set_index('cluster')

Now we’ll unify the categories and their orders:

cluster_genders = cluster_genders.apply(lambda vdf: vdf.cat.set_categories(genders, ordered=True))
cluster_genders.fillna('absent', inplace=True)
cluster_genders.head()

	pgsql	2022-03-2.0	2022-07	2022-10	2022-11-2.1	2023-07	2025-06-3.0	current
cluster
101559774	absent	absent	absent	absent	absent	absent	absent	no-book-author
452208798	no-book-author	no-book	no-book	no-book-author	no-book-author	no-book-author	no-book-author	no-book-author
127012979	absent	absent	no-book-author	no-book-author	no-book-author	no-book-author	no-book-author	no-book-author
117407548	no-book-author	no-book-author	no-book-author	no-book-author	no-book-author	no-book-author	no-book-author	no-book-author
102777110	female	female	female	female	female	female	female	female

Let’s save this file for further analysis:

cluster_genders.to_parquet('cluster-version-genders.parquet', compression='zstd')

PostgreSQL to Current

Now we are ready to actually compare cluster genders across categories. Let’s start by comparing original data (PostgreSQL) to current:

ct = cluster_genders[['pgsql', 'current']].value_counts().unstack()
ct = ct.reindex(labels=genders, columns=genders)
ct

current	ambiguous	female	male	unknown	no-author-rec	no-book-author	no-book	absent
pgsql
ambiguous	95825.0	4815.0	11222.0	2398.0	1660.0	3175.0	4.0	5109.0
female	17856.0	1106202.0	1224.0	19534.0	9317.0	256.0	31.0	39691.0
male	30709.0	3094.0	3472004.0	28199.0	30960.0	808.0	155.0	79287.0
unknown	3222.0	130124.0	230286.0	1501820.0	19732.0	21.0	12.0	15158.0
no-author-rec	10753.0	60165.0	332493.0	230073.0	1385651.0	1097.0	125.0	15337.0
no-book-author	9084.0	114768.0	222317.0	132370.0	213811.0	2445284.0	898251.0	278188.0
no-book	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
absent	130365.0	1087795.0	2667206.0	1177309.0	4962030.0	17648931.0	728440.0	358958.0

ctf = ct.divide(ct.sum(axis='columns'), axis='rows')
def style_row(row):
    styles = []
    for col, val in zip(row.index, row.values):
        if col == row.name:
            styles.append('font-weight: bold')
        elif val > 0.1:
            styles.append('color: red')
        else:
            styles.append(None)
    return styles
ctf.style.apply(style_row, 'columns')

current	ambiguous	female	male	unknown	no-author-rec	no-book-author	no-book	absent
pgsql
ambiguous	0.771488	0.038766	0.090348	0.019306	0.013365	0.025562	0.000032	0.041133
female	0.014953	0.926381	0.001025	0.016359	0.007802	0.000214	0.000026	0.033239
male	0.008424	0.000849	0.952482	0.007736	0.008493	0.000222	0.000043	0.021751
unknown	0.001695	0.068473	0.121179	0.790276	0.010383	0.000011	0.000006	0.007976
no-author-rec	0.005282	0.029555	0.163332	0.113019	0.680677	0.000539	0.000061	0.007534
no-book-author	0.002106	0.026603	0.051533	0.030683	0.049561	0.566816	0.208214	0.064484
no-book	nan	nan	nan	nan	nan	nan	nan	nan
absent	0.004533	0.037822	0.092737	0.040934	0.172526	0.613640	0.025327	0.012481

Most of the change is coming from clusters absent in the original but present in the new.

There are also quite a few that had no book author in PGSQL, but no book in the current data - not sure what’s up with that. Let’s look at more crosstabs.

def gender_crosstab(old, new, fractional=True):
    ct = cluster_genders[[old, new]].value_counts().unstack()
    ct = ct.reindex(labels=genders, columns=genders)

    if fractional:
        ctf = ct.divide(ct.sum(axis='columns'), axis='rows')
        return ctf
    else:
        return ct

def plot_gender(set):
    cluster_genders[set].value_counts().sort_index().plot.barh()
    plt.title(f'Gender Distribution in {set}')

PostgreSQL to March 2022 (2.0 release)

This marks the change from PostgreSQL to pure-Rust.

ct = gender_crosstab('pgsql', '2022-03-2.0')
ct.style.apply(style_row, 'columns')

2022-03-2.0	ambiguous	female	male	unknown	no-author-rec	no-book-author	no-book	absent
pgsql
ambiguous	0.977924	0.002955	0.013928	0.000636	0.000878	nan	nan	0.003679
female	0.002190	0.993945	0.000001	0.000301	0.000430	0.000003	0.000005	0.003125
male	0.000591	0.000000	0.995938	0.000528	0.000796	0.000002	0.000014	0.002131
unknown	0.000043	0.002759	0.005304	0.988899	0.001953	0.000001	0.000003	0.001039
no-author-rec	0.000104	0.007919	0.049969	0.031480	0.908599	nan	0.000008	0.001921
no-book-author	0.000002	0.000051	0.000200	0.000108	0.000051	0.649173	0.335017	0.015397
no-book	nan	nan	nan	nan	nan	nan	nan	nan
absent	0.000006	0.000046	0.000239	0.000112	0.000167	0.001884	0.000066	0.997481

This is where we change from no-book-author to no-book for a bunch of books; otherwise things are pretty consistent. This major change is likely a result of changes that count more books and book clusters - we had some inner joins in the PostgreSQL version that were questionable, and in particular we didn’t really cluster solo ISBNs but now we do. But now, if we have a solo ISBN from rating data, it gets a cluster with no book record instead of being excluded from the clustering.

Let’s look at the distribution of statuses for each, starting with PostgreSQL:

plot_gender('pgsql')

And the Rust version:

plot_gender('2022-03-2.0')

March to July 2022

We updated a lot of data files and changed the name and ISBN parsing logic.

ct = gender_crosstab('2022-03-2.0', '2022-07')
ct.style.apply(style_row, 'columns')

2022-07	ambiguous	female	male	unknown	no-author-rec	no-book-author	no-book	absent
2022-03-2.0
ambiguous	0.836228	0.035602	0.083587	0.014814	0.004862	0.000087	0.000016	0.024805
female	0.010065	0.963186	0.000487	0.007911	0.001253	0.000002	0.000012	0.017085
male	0.006706	0.000646	0.974636	0.003702	0.001364	0.000079	0.000014	0.012852
unknown	0.001901	0.040305	0.092947	0.856051	0.003411	0.000200	nan	0.005186
no-author-rec	0.003536	0.020630	0.108694	0.101641	0.762110	0.000009	0.000037	0.003344
no-book-author	0.002056	0.030443	0.057337	0.035301	0.056239	0.809983	0.000007	0.008635
no-book	0.000159	0.002344	0.005280	0.002742	0.004765	0.000634	0.980665	0.003411
absent	0.002109	0.018914	0.040477	0.014926	0.059121	0.003270	0.000001	0.861182

Mostly fine; some more are resolved, existing resolutions are pretty consistent.

plot_gender('2022-07')

July 2022 to Oct. 2022

We changed from DataFusion to Polars and made further ISBN and name parsing changes.

ct = gender_crosstab('2022-07', '2022-10')
ct.style.apply(style_row, 'columns')

2022-10	ambiguous	female	male	unknown	no-author-rec	no-book-author	no-book	absent
2022-07
ambiguous	0.989385	0.004969	0.003626	0.000336	0.001647	nan	nan	0.000036
female	nan	0.995099	nan	0.000361	0.004471	nan	nan	0.000070
male	0.000001	nan	0.994580	0.000431	0.004975	0.000000	nan	0.000012
unknown	nan	nan	nan	0.995467	0.004492	nan	nan	0.000042
no-author-rec	nan	0.000001	0.000003	0.000005	0.999823	0.000131	nan	0.000038
no-book-author	nan	0.000000	0.000001	0.000000	0.000000	0.996612	nan	0.003386
no-book	0.000004	0.000089	0.000035	0.000069	0.000094	0.198066	0.670216	0.131427
absent	nan	nan	nan	nan	nan	nan	nan	1.000000

plot_gender('2022-10')

Oct. 2022 to release 2.1 (Nov. 2022)

We added support for GoodReads CSV data and the Amazon 2018 rating CSV files.

ct = gender_crosstab('2022-10', '2022-11-2.1')
ct.style.apply(style_row, 'columns')

2022-11-2.1	ambiguous	female	male	unknown	no-author-rec	no-book-author	no-book	absent
2022-10
ambiguous	0.999995	nan	nan	nan	nan	nan	nan	0.000005
female	nan	0.999999	nan	nan	nan	nan	nan	0.000001
male	nan	nan	0.999999	nan	nan	nan	nan	0.000001
unknown	nan	nan	nan	0.999998	nan	nan	nan	0.000002
no-author-rec	nan	nan	nan	nan	0.999999	nan	nan	0.000001
no-book-author	nan	nan	nan	0.000000	nan	0.999982	nan	0.000017
no-book	nan	nan	nan	nan	nan	nan	1.000000	nan
absent	nan	0.000000	0.000000	0.000000	0.000000	0.000003	0.031489	0.968508

plot_gender('2022-11-2.1')

Release 2.1 to Jul. 2023

We updated OpenLibrary and VIAF, and made some technical changes.

ct = gender_crosstab('2022-11-2.1', '2023-07')
ct.style.apply(style_row, 'columns')

2023-07	ambiguous	female	male	unknown	no-author-rec	no-book-author	no-book	absent
2022-11-2.1
ambiguous	0.879459	0.006921	0.018915	0.001573	0.000239	0.074744	0.000009	0.018140
female	0.004892	0.975566	0.000307	0.004275	0.000396	0.000482	0.000002	0.014081
male	0.002249	0.000044	0.986211	0.001428	0.000497	0.001463	0.000005	0.008105
unknown	0.000266	0.019105	0.032232	0.945164	0.001086	0.000425	0.000000	0.001721
no-author-rec	0.000277	0.002896	0.007170	0.009781	0.975855	0.000360	0.000004	0.003658
no-book-author	0.000408	0.006223	0.011060	0.004081	0.006726	0.967637	0.000001	0.003864
no-book	0.000363	0.005640	0.011780	0.007411	0.024719	0.003812	0.940827	0.005448
absent	0.002871	0.019474	0.052188	0.024984	0.118197	0.013907	0.000001	0.768377

plot_gender('2023-07')

Jul. 2023 to 3.0 (June 2025)

Mostly technical changes, along with updated OpenLibrary an dVIAF

ct = gender_crosstab('2023-07', '2025-06-3.0')
ct.style.apply(style_row, 'columns')

2025-06-3.0	ambiguous	female	male	unknown	no-author-rec	no-book-author	no-book	absent
2023-07
ambiguous	0.916513	0.015526	0.040841	0.004112	0.000437	0.005387	nan	0.017185
female	0.003798	0.962122	0.000154	0.023668	0.000251	0.000479	0.000000	0.009528
male	0.002358	0.000091	0.980826	0.010716	0.000521	0.000440	0.000001	0.005047
unknown	0.000162	0.021824	0.023425	0.949053	0.002647	0.000014	nan	0.002876
no-author-rec	0.000114	0.001472	0.002637	0.005071	0.952103	0.000162	0.000015	0.038426
no-book-author	0.000084	0.000694	0.001612	0.000504	0.000704	0.994608	0.000000	0.001794
no-book	0.000037	0.000464	0.000953	0.000865	0.003197	0.001035	0.992922	0.000525
absent	0.000476	0.004847	0.013392	0.006382	0.036988	0.003447	0.000001	0.934466

plot_gender('2025-06-3.0')

Release 3.0 (June 2025) to Current

Mostly technical code updates.

ct = gender_crosstab('2025-06-3.0', 'current')
ct.style.apply(style_row, 'columns')

current	ambiguous	female	male	unknown	no-author-rec	no-book-author	no-book	absent
2025-06-3.0
ambiguous	1.000000	nan	nan	nan	nan	nan	nan	nan
female	nan	1.000000	nan	nan	nan	nan	nan	nan
male	nan	nan	1.000000	nan	nan	nan	nan	nan
unknown	nan	nan	nan	1.000000	nan	nan	nan	nan
no-author-rec	nan	nan	nan	nan	1.000000	nan	nan	nan
no-book-author	nan	nan	nan	nan	nan	1.000000	nan	nan
no-book	nan	nan	nan	nan	nan	nan	1.000000	nan
absent	nan	nan	nan	nan	nan	0.955689	nan	0.044311

plot_gender('current')