Academia.eduAcademia.edu
medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 1 1 Subtyping of common complex diseases and disorders by integrating heterogeneous 2 data. Identifying clusters among women with lower urinary tract symptoms in the LURN 3 study 4 5 Victor P. Andreev 1*, Margaret E. Helmuth1, Gang Liu2, Abigail R. Smith1, Robert M. Merion1, 6 Claire C. Yang3, Anne P. Cameron4, J. Eric Jelovsek 5, Cindy L. Amundsen5, Brian T. Helfand6, 7 Catherine S. Bradley 7, John O. L. DeLancey 4, James W. Griffith8, Alexander P. Glaser6, Brenda 8 W. Gillespie9, J. Quentin Clemens4, H. Henry Lai10, and the LURN Study Group 9 10 1 Arbor Research Collaborative for Health, Ann Arbor, Michigan, United States of America 11 2 Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, 12 Michigan, United States of America 13 3 14 America 15 4 Department of Urology, University of Michigan, Ann Arbor, Michigan, United States of America 16 5 Department of Obstetrics and Gynecology, Duke University, Raleigh, North Carolina, United 17 States of America 18 6 Department of Urology, North Shore University, Evanston, Illinois, United States of America 19 7 Department of Obstetrics and Gynecology, University of Iowa, Iowa City, Iowa, United States 20 of America 21 8 22 States of America 23 9 24 America 25 10 Department of Urology, University of Washington, Seattle, Washington, United States of Department of Medical Social Sciences, Northwestern University, Chicago, Illinois, United Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, United States of Department of Surgery, Washington University, St Louis, Missouri, United States of America 26 NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice. medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 2 27 *Corresponding Author 28 Email: victor.andreev@arborresearch.org 29 30 31 Short Title: Subtyping complex disorders by clustering heterogeneous data medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 3 32 ABSTRACT 33 We present a novel methodology for subtyping of persons with a common clinical symptom 34 complex by integrating heterogeneous continuous and categorical data. We illustrate it by clustering 35 women with lower urinary tract symptoms (LUTS), who represent a heterogeneous cohort with 36 overlapping symptoms and multifactorial etiology. Identifying subtypes within this group would 37 potentially lead to better diagnosis and treatment decision-making. Data collected in the Symptoms of 38 Lower Urinary Tract Dysfunction Research Network (LURN), a multi-center prospective observational 39 cohort study, included self-reported urinary and non-urinary symptoms, bladder diaries, and physical 40 examination data for 545 women. Heterogeneity in these multidimensional data required thorough and 41 non-trivial preprocessing, including scaling by controls and weighting to mitigate data redundancy, while 42 the various data types (continuous and categorical) required novel methodology using a weighted 43 Tanimoto indices approach. Data domains only available on a subset of the cohort were integrated using 44 a semi-supervised clustering approach. Novel contrast criterion for determination of the optimal 45 number of clusters in consensus clustering was introduced and compared with existing criteria. 46 Distinctiveness of the clusters was confirmed by using multiple criteria for cluster quality, and by testing 47 for significantly different variables in pairwise comparisons of the clusters. Cluster dynamics were 48 explored by analyzing longitudinal data at 3- and 12-month follow-up. Five distinct clusters of women 49 with LUTS were identified using the developed methodology. The clinical relevance of the identified 50 clusters is discussed and compared with the current conventional approaches to the evaluation of LUTS 51 patients. Rationale and thought process are described for selection of procedures for data 52 preprocessing, clustering, and cluster evaluation. Suggestions are provided for minimum reporting 53 requirements in publications utilizing clustering methodology with multiple heterogeneous data 54 domains. medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 4 55 INTRODUCTION 56 Complex diseases, such as obesity, diabetes, atherosclerosis, Alzheimer’s disease, major 57 depressive disorder, and cancer result from multiple genetic, epigenetic, and environmental factors, and 58 importantly, interactions between these factors [1-3]. Important differences between patients with 59 these complex diseases and disorders may exist at multiple levels, including: (a) symptoms, subjective 60 experiences, and adaptive behaviors; (b) characteristics of the physical state of the organism, including 61 comorbidities; (c) characteristics of organs and systems; and (d) characteristics at the cellular or 62 molecular level. The extent of these differences suggests that some complex diseases and disorders are 63 better represented by subtypes, each of which can potentially have different etiologies, mechanisms, 64 and outcomes, and require different approaches to treatment. Subtype identification is therefore a 65 potentially important aspect to understanding and treating complex diseases and disorders. Specifically, 66 extracting patient characteristics at each level and grouping patients based on these characteristics 67 allows for comprehensive clinical phenotyping and provides necessary information for discovery and 68 implementation of personalized treatments. Characteristics at each level are represented by variables of 69 different types (continuous, categorical, and binary), scales, and different level of relevance or impact 70 for the disease of interest. This data heterogeneity requires thoughtful preprocessing and novel 71 approaches to data integration. 72 Lower urinary tract symptoms (LUTS) is a general term representing a heterogeneous group of 73 symptoms with sometimes unclear etiology, high economic and social costs, and significant effects on 74 patients’ quality of life. LUTS can include frequent urination during day and night (nocturia), urinary 75 urgency (a sudden urge to urinate), stress and urgency urinary incontinence (UI), and bladder emptying 76 symptoms such as straining, hesitancy (delay to start to urinate), weak urine stream, and post-void 77 dribbling. None of the symptoms is pathognomonic for a particular diagnosis, and many persons have 78 more than one symptom. The prevalence of LUTS in the United States (US) ranges between 45% and medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 5 79 70% and increases with age [4-5]. Given the aging population in the US, the prevalence of LUTS is 80 expected to increase in the coming years [6]. Medical expenditures for LUTS have been reported to be 81 as high as $65 billion per year [7], yet many therapeutic options for the treatment of LUTS do not 82 provide long-term symptom relief. Many patients experience a combination of symptoms, so treatment 83 that focuses on a single symptom may result in suboptimal care. To improve treatment outcomes for 84 patients with LUTS, it is necessary to sort out the heterogeneity of this population, increase the 85 understanding of different subtypes of LUTS and their underlying mechanisms. 86 One way to better understand a complex disease or disorder is to use an unbiased, data driven 87 unsupervised clustering approach to identify subtypes of individuals with the disease. This method uses 88 data to identify groups, or clusters, such that members of each cluster are as similar as possible to 89 others within their cluster, but as different as possible from those in other clusters [8]. Subtypes 90 identified in this manner are based on similarities and differences within the data, remaining agnostic to 91 clinical definitions or diagnostic categories. 92 Clustering methodologies have generated important contributions to the analysis of health- 93 related data and are becoming increasingly valuable tools as the field of precision medicine progresses. 94 These methods represent a burgeoning field of research [9-15]. Many unsupervised classification or 95 clustering methods have been developed, from commonly used k-means clustering, hierarchical 96 clustering, and self-organizing maps (SOM) [16-18] to algorithms developed in specific areas for specific 97 applications [19-21]. An important problem in unsupervised clustering is determining the optimal 98 number of clusters. Currently available criteria include Calinski-Harabasz, Davies-Bouldin, Dunn, Gap, 99 Silhouette indices, and others [22-26]; however, the optimal number of clusters may vary by the 100 criterion applied. Therefore, another critical issue is to validate the clustering results, that is, to gain 101 confidence about the clinical significance of the putative clusters, both in terms of cluster numbers and 102 cluster assignments. medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 6 103 Disease subtyping by clustering homogeneous “-omics” data of certain types, mostly gene 104 expression data, has been extensively published [27-30]. More recent studies have subtyped complex 105 diseases by clustering heterogeneous data that included patient health questionnaires and other clinical 106 data. These research studies include subtyping asthma by using questionnaires, physiological tests, and 107 lab tests [31]; subtyping type 2 diabetes by using body mass index (BMI), age at onset of diabetes, 108 homoeostasis model assessment estimates, and insulin resistance [32]; and subtyping sepsis using 109 demographics, vital signs, biomarkers of inflammation, and organ dysfunction or injury [33]. 110 Resampling based consensus clustering initially proposed for clustering gene expression data 111 [34] is gaining popularity and has been used for subtyping complex diseases using heterogeneous data 112 [33, 35]. Multiple random resampling of patients followed by k-means clustering generates probabilities 113 that each pair of patients appear in the same cluster, which can be treated as a pairwise distance 114 between patients and used for determination of cluster membership through hierarchical clustering 115 [34]. This method ensures the clustering results are robust to sampling errors. 116 Previous studies aiming to identify subtypes of LUTS in an unbiased manner include 117 Epidemiology Urinary Incontinence and Comorbidities (EPIC) and Boston Area Community Health (BACH) 118 projects [36-37], which performed clustering of LUTS patients based on a relatively small number of self- 119 reported symptom data in community-dwelling cohorts. Another study used only BMI and bladder diary 120 variables for clustering community-dwelling women with LUTS [38]. 121 The Symptoms of Lower Urinary Tract Dysfunction Research Network (LURN) Observational 122 Cohort Study is a multi-center study that collected self-reported symptoms, 3-day bladder diaries, 123 physical examination, neuroimaging and sensory testing data, and biological samples in over 1000 care- 124 seeking men and women across six tertiary care centers [39-40]. In our previous study [41], we 125 performed clustering of 545 women from the LURN Observational Cohort Study using baseline self- medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 7 126 reported urinary symptom data, captured with the LUTS Tool [42-43] and the American Urological 127 Association Symptom Index (AUA-SI) [44]. Four distinct clusters were identified. Women in cluster F1 128 (n=138) were continent, but reported post-void dribbling, frequency, and voiding symptoms. Cluster F2 129 (n=80) reported urgency urinary incontinence, as well as urinary urgency and frequency, and minimal 130 voiding symptoms. Cluster F3 (n=244) included women reporting all types of urinary incontinence, 131 urgency, frequency, and mild voiding symptoms. Women in cluster F4 (n=83) reported all LUTS at 132 uniformly high levels. These subtypes of LUTS were based solely on the above two questionnaires and 133 require further refinement, followed by clinical verification. 134 The current report describes the methodology and results of refining the female LUTS symptom- 135 based clusters by integrating multiple data domains collected in LURN: demographics, non-urinary 136 symptoms, history, and physical examination data, as well as intake and voiding patterns captured in 3- 137 day bladder diaries. We discuss preprocessing of heterogeneous data and combination of continuous 138 and categorical data using our novel weighted Tanimoto indices approach. We use resampling-based 139 consensus clustering [34] combined with a modified semi-supervised clustering approach [45-46] to 140 make use of data available on only a subset of participants. Then we determine the optimal number of 141 clusters using our novel contrast criterion (CC) developed for consensus clustering, and compare it with 142 other consensus clustering criteria: proportion of ambiguous clustering (PAC) [47], and consensus score 143 (CS) [35], as well as with the established quality of clustering criteria, such as Calinski-Harabasz, Davies- 144 Bouldin, Dunn, and Silhouette [22-26]. We identify distinct clusters of women with LUTS and show 145 superiority of these clusters to our published symptom-based clusters [41], in terms of the percentage 146 of significantly different variables in pairwise comparisons of the clusters and the confidence level in the 147 determined cluster membership. Dynamics of the clusters in 12-month follow-up, as well as clinical 148 relevance of the clusters, are discussed. medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 8 149 In the Methods section, we describe the analytical pipeline we developed and used for 150 subtyping women with LUTS. We provide the rationale for our choices of methods for data 151 preprocessing, integration, clustering, and cluster evaluation, as well as review of other available 152 options. We demonstrate that the developed pipeline allowed for identification of distinct and robust 153 refined clusters, with a higher percentage of significantly different variables across the clusters than 154 those previously published, and validate cluster distinctiveness by analyzing biomarker data. Finally, we 155 review the methodological information needed to assess subtyping via clustering and propose a set of 156 reporting requirements that should ideally be included in all clustering reports. We finish by calling for 157 the clustering community effort to develop minimum requirements for clustering publications. We 158 believe this paper would be of interest for clinicians and researchers involved in subtyping of common 159 complex disease and disorders using heterogeneous and multidomain data. 160 MATERIALS AND METHODS 161 LURN data used for subtyping of LUTS 162 Data on women with LUTS 163 Data for LUTS subtyping were obtained from the LURN Observational Cohort Study [39,40], 164 which included 545 women seeking care for LUTS at six tertiary care centers. Baseline data collection 165 included demographic information, medical history, physical examination findings, 3-day bladder diaries 166 [48], and self-report questionnaires of urologic and non-urologic symptoms. Urologic symptoms were 167 collected using the LUTS Tool [42-43] and the AUA-SI [44]. The LUTS Tool contains 44 items, including 168 questions on the frequency of occurrence and degree of bother for each urinary symptom. Possible 169 answers to the LUTS Tool questions were ranked from zero to four, zero indicating absence of the 170 symptom, and four indicating the most severe level of the symptom. The AUA-SI has eight items, 171 including a single overall bother question. Responses to the first seven questions of the AUA-SI range 172 from zero to five, zero indicating “none” or “not at all”, and five indicating “almost always”. The final medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 9 173 question in the AUA-SI ranges from zero (delighted) to six (terrible). Participants also completed patient- 174 reported outcome (PRO) questionnaires from the Patient-Reported Outcomes Measurement 175 Information System (PROMIS). We used questionnaires related to bowel function (PROMIS 176 gastrointestinal constipation, diarrhea, and bowel incontinence subsets) [49], psychological health 177 (PROMIS Depression and Anxiety Short Forms [50], Perceived Stress Scale [51], PROMIS Sleep 178 Disturbance Short Form [52]), urologic pain (Genitourinary Pain Index [GUPI]) [53], and pelvic floor 179 function (Pelvic Floor Distress Inventory [PFDI]) [54]. Demographics included: age, race, ethnicity, 180 employment status, education level, and marital status. Physical examination data included: weight, 181 waist circumference, BMI, post-void residual volume, pelvic organ prolapse (using the Pelvic Organ 182 Prolapse Quantification [POP-Q] system that measures the location of selected landmarks on the vagina 183 and cervix [55]), and presence of pathology findings at the introitus, urethra, vagina, uterus, or rectum. 184 Medical history data included Functional Comorbidity Index (FCI) [56], additional individual 185 comorbidities, as well as information on history of urinary tract infections (UTI), pregnancy, vaginal 186 deliveries, alcohol, smoking, recreational drug use, and medication use. Bladder diaries included data on 187 timing and volume of each beverage intake and urinary void during a 72-hour period. Completeness and 188 accuracy of bladder diaries collected in LURN are described in [57]. Only 193 women (35%) returned 189 bladder diaries deemed complete. For clustering purposes, we used the following five bladder diary 190 variables from those 193 women: number of intakes and voids, total volumes of intakes and voids, and 191 maximum voided volume (serving as a proxy for bladder capacity). 192 In total, 185 variables were used for subtyping women with LUTS; 27 demographic variables, 55 193 medical history variables, 33 physical exam variables, 52 urinary symptoms variables (LUTS Tool and 194 AUA-SI), 13 non-urinary PRO variables, and 5 bladder diary variables. Variables were continuous (n=83) 195 or categorical (n=102). Supplemental Table S1 presents an overview of these variables for 545 women 196 with LUTS used in the analysis. Although all variables were deemed important or possibly important in medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 10 197 subtyping persons with LUTS, not all the variables are of equal importance, relevance, and non- 198 redundancy; therefore, scaling and weighting of the variables was implemented as described below. 199 Data on non-LUTS controls 200 Our preprocessing procedure, described in more detail in the next section, includes scaling of 201 variables by standardizing their values using means and standard deviations (SDs) in non-LUTS controls. 202 The LURN study included 55 control women, who were not necessarily healthy but did not report LUTS. 203 Unfortunately, not all the variables of interest were collected for these non-LUTS controls (e.g., physical 204 examination, bladder diary). As a source of bladder diary data for non-LUTS controls, we used bladder 205 diaries of 32 non-LUTS controls from the Establishing Prevalence of Incontinence (EPI) community study 206 of women in Southeastern Michigan [58]. For other variables of interest not collected for non-LUTS 207 controls, we used population data from literature sources indicated in Supplemental Table S1. 208 Biological samples collected and analyzed in LUTS cases and non-LUTS controls 209 The LURN study collected numerous biological samples, including whole blood, serum, plasma, 210 and urine at baseline and at 3- and 12-month follow-up visits [39-40]. Of these samples, 230 baseline 211 serum samples of women with LUTS and 30 serum samples of non-LUTS controls were analyzed using 212 the targeted proteomics approach – Proximity Extended Assay (PEA) by Olink Proteomics (Uppsala, 213 Sweden). Three Olink panels (cardio metabolic, inflammation, neurology) were used to quantify 214 abundances of 276 proteins. These data were not used for clustering in the current report; however, 215 they served as an additional orthogonal approach for evaluation of the quality of the identified clusters. 216 We compared the abundances of 276 proteins in women with LUTS and in controls and tested for 217 significantly differentially abundant proteins in each of the identified clusters versus controls adjusted 218 for multiple comparison using the false discovery rate (FDR) correction (FDR<0.05) [59]. Note that assays 219 were performed in a subset (n =230, 42%) of women. medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 11 220 221 Ethical guidelines and consent The authors confirm all relevant ethical guidelines have been followed, and all research has 222 been conducted according to the principles expressed in the Declaration of Helsinki. Informed consent 223 has been obtained from participants. Institutional Review Board (IRB) approval has been obtained from: 224 Ethical and Independent Review Services (E&I) IRB, an Association for the Accreditation of Human 225 Research Protection Programs (AAHRPP) Accredited Board, Registration #IRB 00007807. 226 Overview of the clustering pipeline 227 The clustering pipeline implemented in this paper contains multiple steps of data preprocessing, 228 integration, clustering, and cluster evaluation. In this subsection, we present an overview of the 229 sequence of steps in the pipeline shown in Figure 1. Details and rationale for each of the steps are 230 provided in the rest of the subsections of the “Materials and Methods” section below. The pipeline is 231 implemented using MATLAB (MathWorks, Natick, MA) and is publicly available through the Dryad 232 repository (https://datadryad.org). 233 medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 12 234 235 Figure 1. Flowchart of the pipeline for subtyping of a common complex disease or disorder by integrating heterogeneous data as used for subtyping of LUTS. 236 237 238 Data preprocessing 239 Multiple imputation 240 Due to missing data (up to 10% in self-report questionnaires of urologic and non-urologic 241 symptoms), multiple imputation was performed. The imputation used a sequential regression technique 242 and was implemented using IVEware version 2.0 [60-61]. Ten imputed data sets were constructed, and 243 each was preprocessed as described below. The k-means step of the resampling-based consensus 244 clustering was performed on each of the data sets separately, resulting in ten pairwise probabilities of medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 13 245 being in the same cluster for each pair of participants. Finally, hierarchical clustering on the mean of ten 246 probabilities (pairwise distances) was performed to determine cluster membership. 247 Scaling of continuous variables 248 Scaling variables prior to clustering is an important and often overlooked step. Clustering 249 algorithms group objects in a way that minimizes the sum of the pairwise distances between 250 participants within the cluster, where the distance is composed of the distances between all variables 251 calculated using Euclidian, Manhattan, or other suitable metrics [9]. Since each variable is measured 252 using its own scale, the distances and optimal partition of the objects depend on these scales. As stated 253 in [62], the problem with unscaled, unstandardized data is the inconsistency between cluster solutions 254 when the scale of some of the variables is changed, which is a strong argument in favor of 255 standardization. It is especially important in the case of heterogeneous data, where scales of variables in 256 the raw data can be very different and completely unrelated. A common form of conversion of variables 257 to standard scores (or z-scores) entails subtracting the mean and dividing by the SD for each variable. 258 However, subtracting the cohort mean and dividing by the cohort SD would mask the subtype 259 differences, since it ignores whether the within-cohort variance is caused by the natural biological 260 variability of the subjects or by differences in disease subtypes, which increase the within-cohort 261 variance due to multimodal distributions along certain variables. Using z-scores along such variables will 262 unduly reduce their weight and will mask the presence of the subtypes. Therefore, standardization using 263 z-scores is not suitable for our task of identifying disease subtypes. The solution to this problem is 264 standardization by the mean and SD of a reference population that does not have the disease of 265 interest, in this case, controls without LUTS. Following this approach, we define standardized variables 266 𝑆𝑆𝑖𝑖𝑖𝑖 [63]: 267 268 𝑆𝑆𝑖𝑖𝑖𝑖 = (𝐴𝐴𝑖𝑖𝑖𝑖 − 𝐴𝐴𝑖𝑖𝑖𝑖 )/𝜎𝜎𝑖𝑖𝑖𝑖 (1), where 𝐴𝐴𝑖𝑖𝑖𝑖 –ith unstandardized variable for participant n, 𝐴𝐴𝑖𝑖𝑖𝑖 -mean value of ith variable across control medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 14 269 subjects without disease of interest, 𝜎𝜎𝑖𝑖𝑖𝑖 – SD of ith variable in control subjects without disease of 270 interest. A simulated example illustrating the benefits (substantially lower misclassification error) of 271 clustering using variables standardized according to equation (eq.) 1, versus unstandardized variables 272 and z-scores, is presented in Supplemental Material text. 273 Weighting variables to mitigate the redundancy 274 Clustering results can be skewed by including variables reflecting redundant information. An 275 obvious and extreme case example will be including into the data set the same or highly correlated 276 variables multiple times, which will result in the dominating role of these variables in the overall sum of 277 squared distances, and therefore in the clustering decision. To mitigate this, we used weighting, so that 278 the highest weight was attributed to the least correlated variable (i.e., the variable with the smallest 279 average correlation with all other variables) and the lowest weight to the most correlated variable [41, 280 64]. The weights wi were defined by equations (2-4): 281 wi = 282 ci = 1 1 + ci / cw m ∑r ij j =1, j ≠ i m 283 cw = ∑ (2), /(m − 1) m ∑r ij i =1 j =1, j ≠ i / m(m − 1) (3), (4), 284 where m -number of variables and rij Pearson correlation coefficients of variables i and j . 285 Row normalization for continuous variables 286 Initial attempts to cluster un-normalized urinary symptoms data led to identification of two 287 clusters that differed by overall severity of LUTS, not subtypes of symptoms. To avoid clustering 288 predominantly by the overall severity of LUTS, in [41, 64] and here, we normalized the data by the medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 15 289 participant's overall severity of disease. For each participant, the weighted Euclidean length of the 290 vector composed of all 78 continuous variables used for clustering was calculated as 𝐿𝐿𝑖𝑖 = 291 th th th 2 �∑78 𝑖𝑖=1( 𝑤𝑤𝑖𝑖 ∙ 𝑆𝑆𝑖𝑖𝑖𝑖) , where 𝑆𝑆𝑖𝑖𝑖𝑖 is the scaled i variable for the n participant and 𝑤𝑤𝑖𝑖 is the weight of i 293 variable defined by eq. (2-4). Each continuous variable was then normalized by 𝐿𝐿𝑖𝑖 , the Euclidean length 294 strategy allowed for clustering based on the direction rather than the length of the vector representing 295 each subject. 296 Consensus clustering using continuous variables 292 297 of the participant’s vector, resulting in normalized continuous variables 𝑉𝑉𝑖𝑖𝑖𝑖 = 𝑤𝑤𝑖𝑖 ∙ 𝑆𝑆𝑖𝑖𝑖𝑖 𝐿𝐿𝑖𝑖 . This normalization Clustering was performed using a resampling-based consensus clustering method introduced by 298 Monti et al [34]. We performed 1000 instances of random resampling, each selecting a subset including 299 80% of participants. The same procedure was repeated for each of the ten multiply imputed data sets, 300 resulting in 10,000 subsets. We then partitioned each of the subsets into clusters using a k-means 301 clustering algorithm implemented as k-means MATLAB function (with option ‘number of replicates’ =8, 302 see [63] for the explanation on the need of this option); with number of clusters K scanned from 2 to 12. 303 Let 𝑄𝑄𝑖𝑖𝑛𝑛 denote the number of times participants 𝑛𝑛 and 𝑞𝑞 were assigned by k-means into the same 304 305 306 307 308 cluster. Let 𝐼𝐼𝑖𝑖𝑛𝑛 denote the number of times participants 𝑛𝑛 and 𝑞𝑞 were both selected in the random sampling. The probability of participants 𝑛𝑛 and 𝑞𝑞 belonging to the same cluster could be calculated as 𝑄𝑄𝑖𝑖𝑛𝑛 /𝐼𝐼𝑖𝑖𝑛𝑛. We could thus obtain ten probabilities for a pair of participants from the ten imputed data sets. The average of these probabilities represented the final consensus index 𝑀𝑀𝑖𝑖𝑛𝑛for participants 𝑛𝑛 and 𝑞𝑞. A 545 by 545 consensus matrix 𝑀𝑀 (Figure 2) was constructed to visualize these average 309 probabilities as a heat map. Probability is color-coded: bright yellow represents probability close to one 310 and dark blue probability close to zero. The indices of participants were reordered so that the 311 participants belonging to the same clusters were grouped together. To reorder the indices of medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 16 312 participants in the consensus matrix, we employed hierarchical clustering (using clustergram MATLAB 313 function) with 1 − 𝑀𝑀 as distance matrix so that participants belonging to the same clusters were 314 315 316 317 grouped together, depicted as bright yellow blocks along the diagonal of consensus matrix. Figure 2. Consensus matrix. Consensus matrix is presented as a heat map, where the probabilities 𝑀𝑀𝑖𝑖𝑛𝑛for each pair of participants to be in the same cluster are shown by color-coded elements; bright yellow represents probability close to one and dark blue probability close to zero. 318 319 320 Clustering using categorical variables 321 Of 185 variables used for clustering women with LUTS, 102 (55%) are categorical. K-means is not 322 an appropriate method for categorical variables, so resampling-based consensus clustering with multiple 323 runs of k-means algorithm, which we used for continuous variables, cannot be directly used for 324 categorical variables. medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 17 325 K-prototype approach 326 One way to combine continuous and categorical variables is to use the k-prototype algorithm 327 introduced by Z. Huang [65]. According to [65], the distance between two objects 𝑋𝑋, 𝑌𝑌, described by p 328 329 continuous variables and m-p binary variables, is represented as: 𝑝𝑝 𝑑𝑑(𝑋𝑋, 𝑌𝑌) = ∑𝑗𝑗=1(𝑥𝑥𝑗𝑗 − 𝑦𝑦𝑗𝑗 ) 2 + 𝛾𝛾 ∑𝑚𝑚 𝑗𝑗=𝑝𝑝+1 𝛿𝛿(𝑥𝑥𝑗𝑗 , 𝑦𝑦𝑗𝑗 ) (5), 330 where δ is Kronnecker symbol describing simple matching and γ is the weight introduced in [65] to 331 “avoid favoring either type of attribute” (continuous vs. categorical). The goal of the k-prototype 332 algorithm is to minimize the sum of the distances (defined by eq. 5) between objects within the cluster. 333 Limitations of the k-prototype algorithm include lack of scaling of categorical variables and use of the 334 same weight γ for all categorical variables, regardless of their relevance to the disease of interest or 335 their redundancy. A later version of the algorithm [66] attempted to overcome some of these limitations 336 by defining the distance between an object 𝑋𝑋𝑖𝑖 and the centroid 𝑍𝑍𝑘𝑘 of the kth cluster as: 337 338 339 𝑝𝑝 𝑑𝑑(𝑍𝑍𝑘𝑘 ,𝑋𝑋𝑖𝑖 ) = ∑𝑗𝑗=1(𝑥𝑥𝑖𝑖𝑗𝑗 − 𝑍𝑍𝑘𝑘𝑗𝑗 )2 + ∑𝑚𝑚 𝑗𝑗=𝑝𝑝+1 𝜑𝜑 �𝑍𝑍𝑘𝑘𝑗𝑗 ,𝑥𝑥 𝑖𝑖𝑗𝑗 � (6), 1, 𝑖𝑖𝑖𝑖 𝑍𝑍𝑘𝑘𝑗𝑗 ≠ 𝑥𝑥𝑖𝑖𝑗𝑗 � 𝑖𝑖𝑘𝑘𝑘𝑘𝑘𝑘 � , and 𝐶𝐶𝑘𝑘 is the number of objects in cluster k, while 𝐶𝐶𝑘𝑘𝑗𝑗𝑘𝑘 is where 𝜑𝜑 𝑍𝑍𝑘𝑘𝑗𝑗 ,𝑥𝑥 𝑖𝑖𝑗𝑗 = � , 𝑜𝑜𝑜𝑜ℎ𝑒𝑒𝑒𝑒𝑤𝑤𝑖𝑖𝑒𝑒𝑒𝑒 1− 𝑖𝑖𝑘𝑘 the number of objects in this cluster with the categorical value 𝑎𝑎𝑗𝑗𝑘𝑘 of the jth attribute, e.g., participants 340 with blue eyes in cluster k . Such definition of distance makes sense; for instance, if the number of 341 participants in cluster k is 100, and 51 of them have blue eyes, then for a person with brown eyes, 342 distance from the centroid along this dimension is 1, but for a person with blue eyes, it is 1-0.51=0.49. 343 Now, if 99 participants have blue eyes, then for them, the distance from the centroid is 1-0.99=0.01, 344 while for the only one with brown eyes, it is still 1. Thus, such a definition of distance makes certain 345 attributes more important (defining) for the cluster if the majority of objects have the same value of this 346 attribute. An algorithm using such a definition of distance between objects would strive to make clusters medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 18 347 as homogeneous as possible with regard to both its continuous and categorical variables. This approach, 348 however, does not distinguish between categorical variables relevant and irrelevant to the disease of 349 interest. 350 To distinguish between relevant and irrelevant variables, we suggest using the same approach 351 as for continuous variables, i.e., compare them with controls without the disease or symptom complex 352 of interest. For the categorical variables transformed into binary variables, we suggest scaling by 353 comparison of the frequencies of these binary variables in LUTS 𝐹𝐹𝑗𝑗 and in controls 𝐹𝐹𝑗𝑗𝑖𝑖 by using the 354 following function: 355 𝛾𝛾𝑗𝑗 = �log ( 356 357 358 𝐹𝐹𝑘𝑘 𝐹𝐹𝑘𝑘𝑗𝑗 )� (7), where |𝑥𝑥| is absolute value of 𝑥𝑥. If, for a certain binary variable, frequencies in LUTS and non-LUTS controls are equal, e.g., prevalence of blue eyes is the same in LUTS and non-LUTS, then this variable will get weight 𝛾𝛾𝑗𝑗 = 0 and 359 would not affect clustering decision, essentially excluding the variable from clustering. However, if the 360 361 frequency of this variable in LUTS is higher or lower than in controls, then 𝛾𝛾𝑗𝑗 > 0, and this, relevant to 362 the variables based on their correlation with other variables described by eq. (2-4), one needs to modify 363 eq. (6): 364 𝑑𝑑(𝑍𝑍𝑘𝑘 ,𝑋𝑋𝑖𝑖 ) = ∑𝑗𝑗=1 𝑤𝑤𝑗𝑗 (𝑥𝑥𝑖𝑖𝑗𝑗 − 𝑍𝑍𝑗𝑗𝑘𝑘)2 + ∑𝑚𝑚 𝑗𝑗=𝑝𝑝+1 𝑤𝑤𝑗𝑗 𝜑𝜑�𝑍𝑍𝑘𝑘𝑗𝑗 ,𝑥𝑥 𝑖𝑖𝑗𝑗 � 𝛾𝛾𝑗𝑗 365 Unfortunately, none of the available implementations of k-prototype algorithm in standard software (R 366 and SAS [67-68]) easily allows for such modification, and therefore, an alternative simpler approach was 367 used. disease variable, will affect clustering decisions. To accommodate this scaling together with weighting of 𝑝𝑝 (8) medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 19 368 369 Weighted Tanimoto indices approach A simpler approach to clustering categorical variables is based on Tanimoto indices or Tanimoto 370 similarity measure [69]. For two objects a and b described by m binary variables, Tanimoto similarity is 371 defined as: ∑𝑚𝑚 𝑘𝑘=1 𝑎𝑎𝑘𝑘∙𝑏𝑏 𝑘𝑘 372 𝑇𝑇 = 373 For instance, if a and b are 5-dimensional binary vectors a = [1, 1, 0, 0, 0] and b = [1, 0, 1, 0, 0], (9) ∑𝑚𝑚 � 2 2 � 𝑘𝑘=1 𝑎𝑎𝑘𝑘 +𝑏𝑏 𝑘𝑘 −𝑎𝑎𝑘𝑘∙𝑏𝑏 𝑘𝑘 374 then T=1/(2+2-1)=1/3. Note that common “ones” but not common “zeros” are counted in this definition 375 of similarity, which is especially useful in case of multiple binary variables formed from one categorical 376 variable. Think, for example, of the categorical variable ‘eye color’ transformed into several binary 377 variables: ‘blue eyes’ (yes, no), ‘brown eyes’ (yes, no), ‘green eyes’ (yes, no), etc. Tanimoto similarity 378 between two persons with blue eyes will not depend on whether you add ‘hazel eyes’ to the list of 379 options or not. 380 Not all of the categorical variables are equally relevant to the disease of interest, so we want to 381 be able to assign weights reflecting the level of relevance for each variable by comparing its frequency in 382 LUTS with its frequency in non-LUTS controls. We also want to compensate for redundancy in the 383 variables by using weights defined by equations (2-4). Importantly, we want to make sure that no 384 categorical variable, even if it is much more frequent in disease than in controls, has overwhelmingly 385 high weight and makes the role of differences in other variables negligible. To attain this goal, we 386 introduce a weighed Tanimoto similarity measure as: 387 388 389 𝑇𝑇 = 2 2 ∑𝑚𝑚 𝑘𝑘=1 𝑎𝑎𝑘𝑘∙𝑏𝑏 𝑘𝑘∙𝑤𝑤𝑘𝑘 ∙(erf�𝛾𝛾𝑘𝑘 �) 2 ∑𝑚𝑚 � 2 2 � 2 𝑘𝑘=1( 𝑎𝑎𝑘𝑘 +𝑏𝑏𝑘𝑘 −𝑎𝑎𝑘𝑘∙𝑏𝑏 𝑘𝑘 ∙𝑤𝑤𝑘𝑘 ∙�erf�𝛾𝛾𝑘𝑘 �� ) (10), where 𝑤𝑤𝑗𝑗 is the weight defined by eq. (2-4), using the appropriate correlation coefficients of the variables. Coefficient 𝛾𝛾𝑗𝑗 is defined by eq. (7) and minimizes the role of binary variables equally prevalent medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 20 390 391 in disease and controls. Function erf(𝑥𝑥) = 2 𝑥𝑥 ∫ exp( −𝑜𝑜 2 ) 𝑑𝑑𝑜𝑜 √𝜋𝜋 0 ensures that the weight of each binary variable is smaller or equal to one, even for the high values of 𝛾𝛾𝑗𝑗, since |erf (𝑥𝑥)| ≤ 1 . Note that 392 maximum value of T is equal to one when all the categorical variables in a and b are the same, and 393 minimum value is zero when all the categorical variables are different. Now we can use eq. (10) to 394 define distance between any pair of participants described by J binary variables as: 395 396 397 𝑑𝑑�𝑋𝑋𝑖𝑖 , 𝑋𝑋𝑛𝑛 � = 1 − 𝑇𝑇𝑖𝑖𝑛𝑛 = 1 − ∑𝐽𝐽𝑘𝑘=1 𝑋𝑋𝑘𝑘𝑖𝑖 ∙𝑋𝑋𝑘𝑘𝑗𝑗∙𝑤𝑤2𝑘𝑘 ∙(erf�𝛾𝛾𝑘𝑘�) 2 2 2 +𝑋𝑋 2 −𝑋𝑋 ∙𝑋𝑋 �∙𝑤𝑤2 ∙�erf�𝛾𝛾 �� ) ∑𝐽𝐽 ( �𝑋𝑋𝑘𝑘𝑖𝑖 𝑘𝑘𝑖𝑖 𝑘𝑘𝑗𝑗 𝑘𝑘 𝑘𝑘𝑗𝑗 𝑘𝑘 𝑘𝑘=1 (11) Combining continuous and categorical variables Note the similarity of the pairwise distance 1 − 𝑇𝑇𝑖𝑖𝑛𝑛 between two participants based on the 399 categorical variables describing them (eq.11), and pairwise distance 1 − 𝑀𝑀𝑖𝑖𝑛𝑛 between these 400 variables describing these two participants are the same, while the latter is equal to zero when the two 401 participants always were assigned to the same cluster by the 10,000 instances of k-means in the 402 resampling-based consensus clustering. Similarly, the former is equal to one when all the categorical 403 variables describing the two participants are different, while latter is equal to one when these 404 participants always were assigned to the different clusters in resampling-based consensus clustering. 405 This similarity allows combining continuous and categorical pairwise distances into a single distance 406 measure 𝐷𝐷𝑖𝑖𝑛𝑛, with a minimum value of zero and maximum value of one using the weighted Euclidean 398 407 408 409 410 participants based on their continuous variables. The former is equal to zero when all the categorical length approach: 2 𝐷𝐷𝑖𝑖𝑛𝑛 = �((1 − 𝑀𝑀𝑖𝑖𝑛𝑛 )2 + 𝜇𝜇 2 ∙ �1 − 𝑇𝑇𝑖𝑖𝑛𝑛 � )�(1 + 𝜇𝜇 2 ) (12), where 𝜇𝜇 is the weight representing the relative role of the distances based on categorical variables and on continuous variables. We set it equal to the ratio of the number of non-redundant categorical and medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 21 411 ∑78 continuous variables: 𝜇𝜇 = ∑102 𝑗𝑗=1 𝑤𝑤𝑗𝑗𝑗𝑗𝑎𝑎𝑗𝑗� 𝑗𝑗=1 𝑤𝑤𝑗𝑗𝑗𝑗𝑗𝑗𝑖𝑖𝑗𝑗 , where 𝑤𝑤𝑗𝑗 are determined by eq. (2-4) using 412 correlation coefficients appropriate to the distributions of the variables. 413 Semi-supervised clustering using bladder diary data 414 Of 545 women in the Observational Cohort of LURN, 193 (35%) returned complete bladder 415 diaries without missing volumes of voids and intakes. These data were deemed clinically important to 416 the subtyping of women with LUTS. One way to integrate data domains only available on a subset is by 417 using semi-supervised clustering methods [70-71]. If class membership for the members of the subset is 418 known, then the objective function in the clustering of the whole cohort should be modified as follows: 419 420 𝑝𝑝 2 ∑𝑁𝑁𝑁𝑁𝑘𝑘 ∑𝑁𝑁𝑁𝑁𝑘𝑘 ∑𝑁𝑁𝑘𝑘 𝑊𝑊𝐶𝐶𝑆𝑆𝑆𝑆 = ∑𝑗𝑗=1 ∑𝑁𝑁𝑘𝑘 𝑖𝑖=1 𝑛𝑛=1(𝑥𝑥𝑖𝑖𝑗𝑗 − 𝑥𝑥𝑛𝑛𝑗𝑗 ) + 𝑖𝑖=1 𝑛𝑛=1 ℎ 𝑖𝑖𝑛𝑛 (13), where 𝑊𝑊𝐶𝐶𝑆𝑆𝑆𝑆 - within cluster sum of squared distances, p – number of variables, 𝑛𝑛 ≠ 𝑞𝑞, Nk - number of 421 cohort participants in the given cluster k, Nsk - number of the participants in cluster k that were present 422 in the subset. The values of ℎ 𝑖𝑖𝑛𝑛 = −ℎ are negative (reward, decreasing the within-cluster-sum-of- 423 424 squares [WCSS]) if participants n and m belong to the same cluster according to subset classification, and is positive ℎ 𝑖𝑖𝑛𝑛 = ℎ (punishment, increasing WCSS) if participants n and q belong to the different 425 classes of the subset. This approach known as “must-link, cannot-link” allows for using labels known 426 from classification of the subset to influence clustering of the cohort. The limitation of this approach, 427 however, is that it does not allow for different level of confidence in subset cluster membership, i.e., 428 participants n and q are either in the same subset cluster or not. In our case, additional data available 429 for the subset of participants do not provide 100% confidence in cluster membership for this subset; 430 furthermore, the number of participants in the subset is lower than in the whole cohort, making the 431 subset clusters less robust. There is a measure of similarity, however, that is quantitative and reflects 432 the confidence in cluster membership; it is the pairwise distance between members of the subset. We medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 22 433 suggest using this measure to modify the pairwise distance defined by eq. 12 by taking into account the 434 similarity between members of the bladder diary (BD) subset: 435 𝐺𝐺𝑖𝑖𝑛𝑛 = max ((𝐷𝐷𝑖𝑖𝑛𝑛 + 𝜌𝜌 ∙ (𝐵𝐵𝐷𝐷𝑖𝑖𝑛𝑛 − 𝑚𝑚𝐵𝐵𝐷𝐷)), 0) (14) 437 where 𝐷𝐷𝑖𝑖𝑛𝑛 is defined by eq. 12, 𝐵𝐵𝐷𝐷𝑖𝑖𝑛𝑛is the pairwise distance between members of the subset, 𝑚𝑚𝐵𝐵𝐷𝐷 is 438 in the below sections. If either participant n or q, or both of them are not members of the subset, their 439 pairwise distance is not known and is assumed equal to the mean pairwise distance within the subset 440 𝑚𝑚𝐵𝐵𝐷𝐷. For these participants, the second term of eq. 14 is equal to zero; while, for members of the 436 the mean pairwise distance between members of the subset, 𝜌𝜌 is a parameter determined as described 442 subset, it is either negative or positive depending on whether 𝐵𝐵𝐷𝐷𝑖𝑖𝑛𝑛 is smaller or larger than 𝑚𝑚𝐵𝐵𝐷𝐷. Note 443 that, since the second term might be negative, for some large values of 𝜌𝜌, the sum of the two terms is negative as well. However, the pairwise distance between the objects cannot be negative, and is 444 therefore set to zero for these cases. 441 445 We used five variables (number of intakes and voids, total volumes of intakes and voids, and 446 maximum voided volume) from the bladder diaries of the 193 participants to refine the values of their 447 pairwise distances. These five variables were scaled using bladder diary data for controls [58] and then 448 added to the 78 other continuous variables to calculate pairwise distances 𝐵𝐵𝐷𝐷𝑖𝑖𝑛𝑛 refined with bladder 449 450 451 diary variables, as described in the subsection on consensus clustering using continuous variables. Then it was introduced into eq. 14 to get the refined matrix of the pairwise distances 𝐺𝐺𝑖𝑖𝑛𝑛 . The value of the coefficient 𝜌𝜌 was determined by optimizing the quality of clusters, as defined in the following 452 subsections. 453 Determining the number of clusters 454 455 Determining the number of clusters is an important step in any clustering process. In partitioning algorithms like k-means, it is necessary to decide on the number of clusters K prior to medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 23 456 running the algorithm. In agglomerative algorithms like hierarchical clustering, it is possible to decide on 457 the number of clusters when the dendrogram based on the distances between objects is already 458 created. It is common to try several values of K and then to compare the resultant clusters by using 459 various quality of clustering criteria, including Calinski-Harabasz, Davies-Bouldin, Dunn, Gap, and 460 Silhouette indices [22-26]. Quite often, these criteria disagree on the value of K that optimizes the 461 quality of the clusters. 462 Resampling-based consensus clustering, introduced in [34] and subsequently applied to our data 463 set, is a combination of multiple instances of k-means clustering followed by hierarchical clustering on 464 the pairwise distances between objects derived at the first stage. The value of K in k-means is typically 465 scanned (in our case from 2 to 12), and then the optimal value of K is determined using criteria 466 developed specifically for consensus clustering algorithm [34-35, 47]. In both [34] and [47], the 467 determination of the number of clusters is based on analysis of the cumulative distribution function 468 (CDF) of consensus index values 𝑀𝑀𝑖𝑖𝑛𝑛 (defined in the “Consensus clustering using continuous variables” 469 subsection of this paper). The main idea of this analysis is that, in case of ideal clustering, there are only 470 two possible values of consensus index 𝑀𝑀𝑖𝑖𝑛𝑛 = 1, when a pair of objects n and q are in the same cluster, 471 and 𝑀𝑀𝑖𝑖𝑛𝑛 = 0, when they are in the different clusters. Therefore, the ideal CDF should consist of two 472 vertical lines and a horizontal (flat) line between them. The length of the first vertical line will be equal 473 to the number of pairs with 𝑀𝑀𝑖𝑖𝑛𝑛 = 0 and the length of the second vertical line to the number of pairs 474 with 𝑀𝑀𝑖𝑖𝑛𝑛 = 1. However, if the value of K used in clustering is different from the true value of K, then the 475 shape of the CDF curve differs from the idealized curve described above. In [34], the optimal K is defined 476 as the one for which the change in the area under the CDF (relative to area at K-1 and K+1) is the largest 477 (“elbow” of the AUC vs. K curve). Analysis in [47] demonstrates several examples when the criterion of 478 [34] does not work and suggests an alternative criterion named proportion of ambiguously clustered 479 pairs (PAC) equal to the number of pairs with 0 < 𝑀𝑀𝑖𝑖𝑛𝑛 < 1 over the total number of pairs. Obviously, in medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 24 480 the real-world situation of biological variability and noisy measurements, almost all of the pairs will fall 481 482 into this category, so some more liberal lower and upper boundaries for 𝑀𝑀𝑖𝑖𝑛𝑛 need to be introduced, 483 for different values of K, since potential ambiguity increases with the increased number of clusters. 484 e.g., 0.1 and 0.9, as in [47]. It raises a question, however, whether these boundaries should be different A more straightforward approach to determine the number of clusters is used in [35] by 486 introducing mean consensus score (CS) calculated as the mean value of consensus indices 𝑀𝑀𝑖𝑖𝑛𝑛 within 487 that it favors the high number of clusters, e.g., CS could reach its maximum when K=N/2 and each 488 cluster contains only a pair of most similar objects with highest values of 𝑀𝑀𝑖𝑖𝑛𝑛 . 485 489 the clusters. The value of K that results in the highest CS is considered optimal. The problem with CS is Below, we introduce a pair of complementary criteria, i.e., contrast criterion (CC) and 490 proportion of the core clusters (PCCs), which we used in our clustering pipeline. We believe they 491 combine the advantages of PAC and CS and are free of some of their limitations, especially when used as 492 a complementary pair. The below sections introduce the CC and PCCs based on the analysis of 493 consensus matrix derived by k-means clustering of the multiple resampling instances of the data set; 494 however, these criteria can be used with any matrix of pairwise distances between the objects. 495 Contrast criterion 496 The idea of the CC is derived from visual representation of the consensus matrix as a heat map 498 presented in Figure 2. Each pixel of the heat map represents the value of 𝑀𝑀𝑖𝑖𝑛𝑛 probability of two objects 499 a cluster. Next, we compared the “bright yellowness” of this diagonal square with the “color” of the rest 500 of the row in which the diagonal square is located; therefore, the term “contrast criterion” (CC). The 501 larger the difference between these two measures, the further the situation from the case where all 502 𝑀𝑀𝑖𝑖𝑛𝑛 except for n=q are equal, and the heat map is uniformly yellowish (single uniform cluster case), in 497 to be together in the same cluster. Each bright yellow square along the diagonal of the matrix represents medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 25 503 which case the contrast is zero. We consider number of clusters K and cluster membership optimal 504 when CC is maximized. 505 When analyzing the properties and behavior of clustering criteria, it is necessary to compare the 506 clusters identified using certain clustering algorithms and clustering criteria of interest with the “true” 507 clusters. Unfortunately, “true” clusters are not known in real life, and therefore, one needs to simulate 508 them and then evaluate misclassification error resulting from the clustering algorithm and criteria of 509 interest. Such an approach was used in [63] to compare several popular clustering algorithms, and was 510 applied for the case of resampling-based consensus clustering with CC and PAC criteria (see 511 Supplemental Material text and Supplemental Figures S2-S10). Here, we concentrate on the general 512 analysis of CC and its properties. In this analysis, we need to introduce the term “alleged clusters”, which 513 are different from “true clusters” and “identified clusters”. “Alleged clusters” are determined for each 514 value of K tried by clustering algorithm, while “identified clusters” are those maximizing the value of 515 clustering criteria of interest, and “true clusters” are specified by the simulation. 516 To define CC explicitly, let us first look at the most typical case where the number of alleged 517 clusters is not equal to the number of objects, is not equal to one, and none of the clusters includes just 518 one object. Note that we are not making any assumptions or imposing any restrictions on the properties 519 of the “true” clusters. 520 For 𝐾𝐾 ≠ 𝑁𝑁, 𝐾𝐾 ≠ 1, 𝑁𝑁𝑗𝑗 ≠ 1, 𝐶𝐶𝐶𝐶 = ∑𝐾𝐾 𝑘𝑘=1 � 521 522 where K is the number of alleged clusters, N is the number of the clustered objects, 𝑁𝑁𝑘𝑘 and 𝑁𝑁𝑖𝑖 are the 523 represents the average “bright yellowness” of the diagonal square, while the second term in the 𝑁𝑁 𝑁𝑁 𝑘𝑘 ∑ 𝑘𝑘 ∑𝑖𝑖=1 𝑗𝑗=1,𝑗𝑗≠𝑖𝑖 𝑀𝑀𝑖𝑖𝑗𝑗 𝑁𝑁𝑘𝑘 ( 𝑁𝑁𝑘𝑘−1) − 𝑁𝑁 𝑁𝑁 𝑘𝑘 𝑖𝑖 ∑𝐾𝐾 𝑖𝑖=1,𝑖𝑖≠𝑘𝑘∑𝑖𝑖=1 ∑𝑗𝑗=1 𝑀𝑀𝑖𝑖𝑗𝑗 𝑁𝑁𝑘𝑘 (𝑁𝑁−𝑁𝑁𝑘𝑘 ) ��𝐾𝐾 (15), numbers of objects in kth and ith clusters. As intended, the first term in the brackets of eq. (15) medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 26 524 brackets represents the average “yellowness” of the rest of the row in which the diagonal square is 525 located. 526 Note that we defined CC as an averaged contrast across alleged K clusters independently of the 527 size of the clusters. Other approaches are possible, e.g., a weighted average based on the sizes of the 528 clusters, or minimax approach, where the contrast for the “worst” (least contrast cluster) is maximized. 529 Clearly, the best choice of combining contrasts of each of the clusters into one overall depends on many 530 factors, including the goal of the clustering, expected sizes of the clusters, and the number and 531 distribution of the variables. It is an interesting topic; however, it is outside of the scope of this paper. Another choice made in defining CC by eq. (15) is omitting of the diagonal terms 𝑀𝑀𝑖𝑖𝑖𝑖, which are 532 533 always equal to one. The percentage of diagonal terms in each square representing an identified cluster 534 is 535 terms in the definition of CC would favor smaller clusters over larger clusters by assigning higher CC to 536 the smaller clusters, irrespective of similarities of objects within the clusters. Therefore, we do not 537 include diagonal terms in the definition of CC, as shown in eq. 15. Note that eq. (15) does not work if 538 K=1, K=N, or 𝑁𝑁𝑗𝑗 = 1. For these special cases, CC is defined and discussed in Supplemental Material text. 539 𝑁𝑁𝑘𝑘 𝑁𝑁𝑘𝑘2 = 𝑁𝑁𝑘𝑘−1, or 100%, 50%, 33%, 25% for 𝑁𝑁𝑘𝑘 = 1, 2, 3, 4. Since 𝑀𝑀𝑖𝑖𝑖𝑖 = 1, 𝑀𝑀𝑖𝑖𝑛𝑛 < 1, inclusion of diagonal There are certain similarities between our contrast criterion CC and consensus score (CS) of [35]. 540 Although the exact definition of CS is not provided, it appears that CS is similar to the first term of eq. 541 542 (15). However, it is unclear if the diagonal terms 𝑀𝑀𝑖𝑖𝑖𝑖 are omitted or included and how CS for K clusters 543 CC in some simple idealized cases. In case of ideal clustering, both CS and CC reach their maximum 544 possible value of 1. The worst-case scenario for both criteria is the case where K>1 clusters are alleged, 545 when in reality, there are no “true” clusters, and all objects are described by a unimodal random 546 distribution of their attributes (variables). Now, if the number of alleged clusters K >1, there is an equal are combined to derive the overall CS. Nevertheless, it is of interest to compare the behavior of CS and medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 27 547 probability for the object to end up in any of K clusters, so the mean value of 𝑀𝑀𝑖𝑖𝑛𝑛 is 1/K, which makes 548 the minimal possible value for CC equal to zero and minimal possible value for CS equal to 1/K. The 549 rather limited range of values from 1/K to 1, together with the dependence of the minimal value on the 550 number of alleged clusters, are the limitations of the CS criterion, which are absent in case of CC, where 551 the range is from 0 to 1 irrespectively of K. Therefore, we consider the use of CC advantageous for 552 determining the optimal number of clusters. 553 Core clusters 554 When analyzing the quality of alleged clusters, it is important to know the confidence with 555 which cluster membership is determined. Clearly, one would prefer clusters where the probability of 556 objects to belong to a particular cluster is 0.9 rather than 0.3. The knowledge of consensus matrix allows 557 for calculating the probability for each object n belonging to a particular cluster k: 558 𝜋𝜋𝑖𝑖𝑘𝑘 = 𝑁𝑁 𝑘𝑘 ∑𝑗𝑗=1 𝑗𝑗≠𝑖𝑖 𝑀𝑀𝑖𝑖𝑗𝑗 𝑁𝑁𝑘𝑘 −1 (16) 559 Note that it is different from the probability averaged across the cluster that was used to calculate CS 560 and CC in the previous subsection. Importantly, within the same cluster, some objects might have 561 562 probability (confidence) as high as 𝜋𝜋𝑖𝑖𝑘𝑘=0.9999... , or as low as 𝜋𝜋𝑖𝑖𝑘𝑘 = 1/𝐾𝐾+0.001, assuming that it is 563 lower for any other cluster 𝑖𝑖 ≠ 𝑘𝑘. We will call the nth object the core member of cluster k if 𝜋𝜋𝑖𝑖𝑘𝑘 > 0.5, which means that, for this object, the probability to be in cluster k is higher than probability to be in all 564 other clusters combined. The rest of the members of the cluster do not belong to the core; for them, the 565 probability to belong to the kth cluster is just higher than to be in any other given cluster. The number of 566 core members divided by the total number of objects in the cluster provides a useful measure that we 567 named proportion of the core cluster (PCC). As with the contrast criterion CC, one can use several 568 approaches to derive overall PCC from the PCCs for each cluster, i.e., take the average across all clusters, 569 weighted average based on the size of the clusters, or minimax by looking at the PCC in the worst medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 28 570 cluster. PCC provides a measure of overall confidence in the alleged cluster membership and of the 571 uniformity of the alleged clusters. Unlike contrast CC, PCC favors smaller size of the clusters and reaches 572 its maximum when K=N/2 and each cluster contains just a pair of objects. PCC reveals information 573 similar to the proportion of ambiguously clustered pairs (PAC) [47], for which 0.1 < 𝑀𝑀𝑖𝑖𝑛𝑛 < 0.9; 574 however, PCC is easier to interpret and does not include unjustified upper and lower boundaries 0.1 and 575 0.9. 576 Using CC and PCC to determine optimal number of clusters K 577 578 We used a combination of CC and PCC to determine the optimal number of clusters K. Note that eq. 14 contains undefined coefficient 𝜌𝜌. Therefore, we have two parameters 𝜌𝜌 and K to determine and 580 two criteria to meet. We determined 𝜌𝜌 and K as the values maximizing CC and corresponding to an 581 to 1.2 using the single-program multiple data sets (spmd) function of MATLAB Parallel Computing 582 583 Toolbox; K values were scanned from K=2 to K=12. The determined optimal values were 𝜌𝜌 =0.3 and K=5. 584 presented in Figure 3. As seen in Figure 3, contrast criterion CC and percent of core cluster members 585 PCC both have maxima for number of clusters K=5. As shown in Table 1, other quality of clustering 586 criteria also confirm K=5 as an optimal number of clusters for our cohort of women with LUTS. 579 587 elbow (point of diminishing returns) for PCC. A clustering procedure was run for 24 values of 𝜌𝜌 from 0.05 The resultant consensus matrix together with the values of CC and PCC for K scanned from 2 to 12 are medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 29 588 589 590 Figure 3. Determination of the optimal number of refined clusters. (A) Consensus matrix heat map demonstrates five clusters of participants (named W1-W5) grouped together based on the pairwise distances G nq (eq 14). (B) CC eq. 15 for K=2,…,12. (C) PCC eq. 16. Both CC and PCC have maxima at K=5. 591 592 Table 1. Other quality of clustering criteria, confirming K=5 as an optimal number of clusters. K=2 Calinski-Harabasz (↑) K=3 K=4 K=5 K=6 K=7 K=8 K=9 K=10 1382.26 346.13 676.29 1126.25 567.83 398.02 309.93 234.52 180.57 Davies-Bouldin (↓) 0.4328 0.8175 0.6045 0.4720 0.5692 0.9178 1.0809 1.1498 1.1678 Dunn (↑) 0.0167 0.0072 0.0290 0.0485 0.0496 0.0455 0.0273 0.0754 0.0590 Point-Biserial (↓) -4.801 Silhouette (↑) 0.6896 0.4565 0.5731 -2.728 -3.25 -3.849 -2.515 -2.081 -1.855 -1.621 -1.409 0.6857 0.5955 0.4821 0.3723 0.3143 NaN 593 594 Evaluation of the quality of the identified clusters 595 We used multistep procedure to evaluate the quality of identified clusters and compare with the 596 potential alternative clusters. The first step was examination of CC and PCC for each of K clusters (Figure 597 3). Next, we calculated other established quality of clustering criteria, including Calinski, Davies-Bouldin, 598 Dunn, Point-Biserial, Silhouette [22-26], etc. (Table 1). Then, we performed pairwise comparison of the 599 clusters using Wilcoxon rank sum tests and chi square tests, where appropriate, to determine which medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 30 600 variables used for clustering were significantly different in the pairwise comparison. All pairwise 601 comparisons were adjusted using an FDR correction for multi-testing [59]. 602 Visualization of the results 603 We used several tools to visualize the results of our analyses. Heat maps representing the 604 consensus matrices were generated using the clustergram MATLAB function. Properties of the identified 605 clusters were illustrated by radar plots, built-in using SAS statistical graphics panel (sgpanel) scatter, 606 polygon, and vector procedures. Comparison of cluster membership in the refined clusters versus 607 previously identified symptom-based clusters was performed using Sankey diagrams, built with the 608 googleVis package. 609 RESULTS AND DISCUSSION 610 Description of the clusters 611 Five distinct clusters of women with LUTS were identified by clustering 545 participants of the 612 LURN Observational Cohort Study using 185 variables. We call these clusters W1-W5, in order to 613 distinguish them from clusters F1-F4, described previously [41]. Demographic data for each cluster are 614 presented in Table 2. Some demographic characteristics were different across the clusters, including 615 age, ethnicity, menopausal status, obesity, prevalence of hysterectomy, percentage of participants with 616 at least one vaginal birth, education level, and employment status. No significant differences across the 617 clusters were observed for race, smoking, and alcohol use. 618 medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 31 619 Table 2. Demographic Data for clusters W1-W5. N Age (median IQR) W1 W2 W3 W4 W5 77 64 144 95 165 53 (36-60) 66 (51-71) 60 (50-70) 51 (42-63) 59 (51-67) P-Value <0.0001 0.221 Race White 66 (86%) 60 (94%) 116 (81%) 80 (84%) 130 (79%) Black 8 (10%) 2 (3%) 22 (15%) 8 (8%) 26 (16%) Asian 3 (4%) 2 (3%) 3 (2%) 5 (5%) 3 (2%) Other 1 (1%) 2 (3%) 3 (2%) 3 (3%) 6 (4%) 0.016 Ethnicity 74 (96%) 57 (89%) 138 (96%) 84 (88%) 159 (96%) Hispanic or Latino 1 (1%) 5 (8%) 3 (2%) 10 (11%) 2 (1%) Unknown 2 (3%) 2 (3%) 3 (2%) 1 (1%) 4 (2%) Obese 23 (30%) 15 (23%) 72 (50%) 38 (40%) 99 (60%) <0.0001 Post-menopausal 35 (46%) 46 (73%) 102 (71%) 47 (49%) 117 (72%) <0.0001 Had a hysterectomy 17 (22%) 9 (14%) 49 (34%) 25 (26%) 64 (39%) 0.0013 At least one vaginal birth 35 (45%) 60 (94%) 91 (63%) 75 (79%) 128 (78%) <0.0001 Non-Hispanic/Latino Alcoholic drinks per week 0.0706 Never 10 (13%) 5(7%) 20 (14%) 15 (17%) 40 (25%) 0-3 drinks per week 55 (72%) 46 (73%) 89 (62%) 65 (68%) 100 (61%) 4-7 drinks per week 8 (11%) 9 (14%) 21 (15%) 13 (14%) 15 (9%) 0 (0%) 3 (4%) 8 (6%) 2 (2%) 5 (3%) More than 7 drinks per week 0.1746 Smoking status Never smoker 50 (67%) 43 (68%) 99 (69%) 64 (67%) 92 (56%) Former smoker 21 (28%) 19 (30%) 33 (23%) 28 (29%) 52 (32%) Current smoker 3 (4%) 1 (2%) 10 (7%) 3 (3%) 18 (11%) medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 32 0.0058 Education Less than Associate degree 40 (30%) 24 (30%) 89 (30%) 27 (28%) 76 (47%) Associates or Bachelor’s degree 32 (43%) 25 (40%) 58 (41%) 45 (47%) 61 (38%) Graduate degree 23 (31%) 19 (30%) 41 (29%) 23 (24%) 24 (15%) 0.0350 Employment Full-time 35 (47%) 27 (42%) 50 (35%) 44 (46%) 51 (31%) Part-time 13 (17%) 5 (8%) 20 (14%) 17 (18%) 21 (13%) Unemployed (looking or not looking for work) 27 (36%) 32 (50%) 73 (51%) 34 (36%) 90 (56%) 620 621 Importantly, all urinary symptoms and many other clinical variables were significantly different 622 across the clusters. Table 3 presents the comparison of urinary symptoms (collected with LUTS Tool and 623 AUA-SI), bladder diary variables, and other 37 significantly different clinical variables across clusters W1- 624 W5. For clarity, we describe these significantly different variables while discussing signatures of the 625 clusters in the text following Table 3. Table 2, and especially Table 3, illustrate the distinctiveness of the 626 identified five clusters of women with LUTS. 627 628 Table 3. Urinary symptoms, bladder diary variables, non-urinary symptoms, and clinical variables across clusters W1-W5. Cluster Cluster Cluster Cluster Cluster P-Value W1 W2 W3 W4 W5 77 64 144 95 165 Frequency 1.92 1.72 2.65 1.67 2.70 <0.001 Daytime frequency 1.67 1.44 2.09 1.37 2.13 <0.001 Nocturia 1.85 1.16 1.89 1.22 2.06 <0.001 N medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 33 Incomplete emptying 1.76 1.16 1.20 0.86 2.37 <0.001 Trickle/dribble 1.69 1.07 1.28 1.11 2.56 <0.001 Urgency 1.10 1.47 2.88 1.76 2.73 <0.001 Hesitancy 1.18 0.72 0.34 0.27 1.26 <0.001 Intermittency 1.12 0.73 0.32 0.28 1.28 <0.001 Straining 1.04 0.49 0.16 0.11 1.09 <0.001 Weak stream 0.84 0.79 0.43 0.22 1.52 <0.001 Spraying 0.87 0.60 0.46 0.45 1.55 <0.001 Urgency with fear of leakage 0.56 1.34 2.85 2.05 2.69 <0.001 Bladder pain 0.94 0.50 0.34 0.34 1.42 <0.001 Burning with urination 0.44 0.24 0.10 0.12 0.72 <0.001 Urinary incontinence(UI) 0.59 1.24 2.39 2.81 2.96 <0.001 Post-void UI 0.74 0.65 0.71 1.44 2.31 <0.001 Urgency UI 0.37 1.06 2.61 2.00 2.71 <0.001 Stress UI (laughter) 0.55 1.03 1.26 2.86 2.57 <0.001 Stress UI (exercise) 0.46 0.78 0.90 2.94 2.35 <0.001 UI with sleep 0.10 0.35 0.73 0.88 1.64 <0.001 UI with sex 0.16 0.33 0.12 0.75 0.80 <0.001 UI no reason 0.21 0.56 0.89 1.66 2.13 <0.001 Nocturia (AUASI) 2.37 1.77 2.18 1.57 2.37 <0.001 Frequency (AUA-SI) 3.15 2.25 2.96 2.26 3.13 <0.001 Intermittency (AUA-SI) 1.74 1.12 0.53 0.41 1.95 <0.001 Weak stream (AUA-SI) 1.25 1.12 0.60 0.36 2.09 <0.001 Straining (AUASI) 1.44 0.57 0.12 0.16 1.08 <0.001 medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 34 Incomplete emptying (AUASI) 2.18 1.28 1.04 0.79 2.67 <0.001 Urgency (AUASI) 1.59 2.17 3.63 2.18 3.53 <0.001 QOL (AUA-SI) 3.68 3.57 4.46 4.66 4.88 <0.001 75.58 75.47 82.62 80.58 87.64 <0.001 Waist circumference (Cm) 94.74 97.13 100.59 95.58 106.35 <0.001 131.00 130.92 124.29 130.17 <0.001 75.04 41.29 29.09 39.94 0.012 Weight (Kg) Systolic blood pressure 122.42 Post-void residual volume 55.43 (mL) POP-Q: Ba result -2.65 1.24 -2.23 -1.78 -1.86 <0.001 POP-Q C result -6.85 -2.38 -6.52 -6.54 -5.85 <0.001 POP-Q D result -7.96 -4.43 -7.79 -7.42 -6.79 <0.001 Number of pregnancies 1.80 2.90 2.01 2.48 3.07 <0.001 Number of vaginal births 0.92 2.43 1.47 1.72 2.02 <0.001 Functional Comorbidity Index Total 1.66 1.74 2.48 1.24 3.64 <0.001 GUPI pain 5.09 4.15 3.70 2.23 6.84 <0.001 GUPI urine 4.71 3.40 3.67 2.49 5.67 <0.001 GUPI QOL 5.78 5.68 6.85 7.60 8.70 <0.001 POPDI-6 14.50 20.60 7.57 7.25 30.22 <0.001 Colorectal-anal distress inventory (CRADI-8) 12.98 18.03 14.76 10.99 33.50 <0.001 Urinary distress inventory 24.42 (UDI-6) 25.44 36.27 39.58 63.67 <0.001 Perceived stress scale 12.09 9.29 11.89 11.29 16.44 <0.001 PROMIS constipation Tscore 50.30 48.68 49.00 49.89 55.80 <0.001 PROMIS depression Tscore 48.91 44.63 48.67 46.61 53.93 <0.001 PROMIS anxiety T-score 49.84 46.78 48.61 48.11 54.75 <0.001 medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 35 PROMIS sleep disturbance T-score 53.48 49.08 52.17 51.52 56.83 <0.001 PROMIS diarrhea T-score 45.70 46.37 48.43 45.17 53.98 <0.001 PROMIS physical functioning T-score 50.44 49.85 46.84 53.19 42.24 <0.001 Arthritis diagnosis 23 (30%) 29 (46%) 64 (45%) 20 (21%) 94 (58%) <0.001 Asthma diagnosis 12 (16%) 7 (11%) 29 (20%) 7 (7%) 49 (30%) <0.001 Chronic obstructive pulmonary disease (COPD) diagnosis 1 (1%) 5 (8%) 6 (4%) 1 (1%) 19 (12%) 0.001 Diabetes diagnosis 4 (5%) 8 (13%) 22 (15%) 5 (5%) 36 (22%) 0.004 Upper gastrointestinal disease diagnosis 19 (25%) 15 (24%) 34 (24%) 12 (13%) 71 (44%) <0.001 Depression diagnosis 17 (22%) 9 (14%) 53 (37%) 27 (28%) 79 (49%) <0.001 Anxiety or panic disorder diagnosis 13 (17%) 8 (13%) 35 (24%) 20 (21%) 62 (38%) <0.001 Degenerative disc disease 10 (13%) diagnosis 7 (11%) 29 (20%) 10 (11%) 57 (35%) <0.001 History of pelvic pain 15 (20%) 3 (5%) 7 (5%) 11 (12%) 37 (23%) <0.001 Sexual activity with the last month 41 (54%) 27 (43%) 54 (38%) 58 (61%) 56 (34%) <0.001 History of hypertension 20 (26%) 21 (33%) 61 (43%) 23 (24%) 80 (49%) <0.001 History of hyperlipidemia 14 (18%) 24 (38%) 44 (31%) 22 (23%) 67 (41%) 0.009 History of sleep apnea 9 (12%) 10 (16%) 21 (15%) 9 (9%) 45 (28%) 0.005 History of a psychiatric diagnosis 32 (42%) 11 (17%) 55 (38%) 41 (43%) 93 (57%) <0.001 Past surgical procedure for LUTS 4 (5%) 6 (10%) 23 (16%) 11 (12%) 36 (22%) 0.008 No abnormal vaginal 57 (75%) findings on physical exam 35 (56%) 101 (72%) 78 (82%) 124 (77%) 0.004 No abnormal uterus 50 (68%) findings on physical exam 45 (71%) 83 (60%) 69 (73%) 87 (54%) 0.009 medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 36 No notation of tenderness on physical exam 44 (59%) 50 (79%) 108 (77%) 84 (88%) 131 (81%) <0.001 Average number of voids in 24 hours 7.6 7.5 8.4 7.1 8.6 0.0023 Average voided volume in 24 hours (mL) 1827.0 1717.5 1786.6 1786.6 1813.5 0.6728 Average number of intakes in 24 hours 6.5 6.4 6.3 6.3 6.4 0.7081 Average volume of intakes in 24 hours 1902.8 1682.9 1813.3 1813.3 1810.1 0.3028 Max voided volume 531.9 543.2 473.4 473.4 519.0 0.2134 629 630 Properties of the five clusters are visualized in Figure 4. Each column represents one of five 631 clusters. Radar plots in the first row illustrate urinary symptoms measured by LUTS Tool and AUA-SI; the 632 second row illustrates demographics, clinical measurements, and non-urinary PROs; the third row shows 633 categorical data on comorbidities and anomalies identified during the physical exam; the fourth row 634 shows intake and voiding pattern variables collected in bladder diaries. Radar plots represent mean 635 values of the raw variables across members of each of the clusters. None of the clusters could be 636 characterized by a single symptom, but rather by a combination of symptoms with various levels of 637 severity. Women in all five clusters reported higher than normal frequency of voiding (with the highest 638 frequency in W3 and W5). Women in all clusters except W1 reported urinary urgency and some level of 639 incontinence. 640 medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 37 641 642 643 644 Figure 4. Radar plots illustrating mean values of urinary symptoms, demographics, clinical measurements, non-urinary PROs, comorbidities, and bladder diary variables for identified clusters of women with LUTS. Urinary symptoms are color-coded: green=frequency; blue=post-micturition; purple=urgency; dark blue=voiding; red=pain; orange=incontinence. 645 646 Women in cluster W1 (n=77) reported minimal urinary incontinence, but had mostly voiding and 647 post-micturition symptoms (post-void dribbling, trickling, straining, hesitancy, and incomplete bladder 648 emptying). They were younger than the average across the LURN female cohort, had a lower than 649 average weight, number of pregnancies, and vaginal births. They had less comorbidities and abnormal 650 findings in the physical exam. Women in cluster W2 (n=64) reported mild urinary symptoms, including 651 mild urinary incontinence. They presented clinically significant anterior vaginal wall prolapse (mean 652 POP-Q point B anterior [Ba]=1.24 cm, which is outside the introitus), apical prolapse (mean POP-Q point 653 C=-2.38 cm), and the most severe pelvic organ prolapse symptoms (with the highest Pelvic Organ 654 Prolapse Distress Inventory [POPDI-6] values of 20.60). They were on average older (66 vs. 53 years old), 655 and had a higher number of pregnancies (2.9 vs. 1.8) and vaginal births (1.47 vs. 0.92) than women in 656 cluster W1. They also had the highest post-void residual urine volume (75 mL) across the clusters. medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 38 657 Women in cluster W3 (n=144) reported high urinary frequency, urinary urgency, and urgency urinary 658 incontinence. They had increased weight, had larger waist circumference, and higher functional 659 comorbidity index (FCI) than women in W1, W2, and W4. They most frequently reported “urgency with 660 fear of leakage” (2.85), but did not report any substantial post-voiding symptoms. Women in cluster W4 661 (n=95) reported multiple symptoms associated with stress urinary incontinence, as well as urgency 662 urinary incontinence, and some post-void urinary incontinence. They were younger (mean 51 years), 663 had less medical comorbidities (FCI =1.24), and had a higher level of physical functioning (PROMIS T- 664 score=53.2) than others in the cohort. Women in W5 (n=165) reported higher frequencies and severities 665 of LUTS for all symptoms. For 27 out of 30 urinary symptoms, they reported the highest levels across all 666 five clusters. These women were heavier (87.6 Kg), had the lowest level of physical functioning (PROMIS 667 T-score=42.2), had more medical comorbidities (FCI=3.64), and more pregnancies (3.07) than the rest of 668 the cohort. They also reported higher psychosocial difficulties in depression, anxiety, and perceived 669 stress, as well as sleep disturbance. 670 The presence of multiple significantly different variables across the clusters demonstrates that 671 clusters W1-W5 meet the concise definition of clustering given by Liao as: “The goal of clustering is to 672 identify structure in an unlabeled data set by objectively organizing data into homogeneous groups 673 where the within-group-object dissimilarity is minimized and the between-group-object dissimilarity is 674 maximized” [8]. Figure 5 visually illustrates results of pairwise comparisons of the clusters in a matrix 675 form. Figure 5A provides cluster comparisons by LUTS Tool variables. Elements on the diagonal of the 676 matrix present the level of severity for each LUTS Tool question, i.e., the severity urinary symptom 677 signature of the cluster. The triangle of boxes above the diagonal demonstrates variables significantly 678 different in the pairwise comparison of the clusters; each colored bar indicates a significantly different 679 variable. As seen, the majority of symptoms are significantly different in the pairwise comparison of the 680 clusters. Elements in the lower triangle of the matrix present the difference in symptom severity levels; medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 39 681 e.g., the first (upper) element in the triangle represents the difference between symptom severity levels 682 in cluster W2 and cluster W1, indicating that urgency symptoms are more severe in cluster W2, while 683 voiding and pain symptoms are more severe in cluster W1 than in cluster W2. Similarly, Figures 5B-5C 684 provide the results of pairwise comparison of the clusters for other variables from Tables 2-3, 685 demonstrating multiple significantly different non-urologic symptoms, physical examination, clinical and 686 demographic variables. In summary, clusters are distinct and significantly different, not only by their 687 urinary symptom signatures, but by multiple non-urologic variables as well. Importantly, these 688 significant differences are demonstrated by omnibus test (Tables 2-3) and by pairwise comparison of the 689 clusters (Figure 5). 690 691 Figure 5. Results of the pairwise comparison of clusters W1-W5. (A) LUTS Tool variables. (B) Demographic and clinical variables. (C) Physical examination and comorbidities data. 692 medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 40 693 694 695 696 697 Differential protein abundance in serum of women with LUTS versus non-LUTS controls Figure 6 presents the volcano plots comparing abundances of 276 proteins in baseline serum samples of women with LUTS versus non-LUTS controls. Figure 6A compares the abundances for all 230 medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 41 698 women with LUTS to 30 controls, while Figures 6B-6F provide similar comparisons for members of the 699 identified clusters W1-W5 for whom proteomics data was available (n1=37, n2=38, n3=53, n4=42, 700 n5=60). Supplemental Table S3 provides the lists of significantly differentially abundant proteins for each 701 of the comparisons. Multiple differentially abundant proteins are observed in the serum samples of 702 women with LUTS versus non-LUTS controls, both overall and between each cluster and controls. While 703 some of these have been shown [72] to be associated with LUTS (e.g., tumor necrosis factor [TNF], 704 interleukin-10 [IL-10], monocyte chemotactic protein [MCP], and transforming growth factor [TGF]), the 705 remainders are novel. The highest number of the differentially abundant proteins of 70 (29 after FDR 706 correction for multiple testing) is observed for cluster W5, which demonstrated the highest level of all 707 urinary symptoms and comorbidities. Note that overlap between the lists of differentially abundant 708 proteins is quite low, meaning that clusters W1-W5 are “biochemically” different. The highest overlap of 709 18 differentially abundant proteins is observed for cluster W5 and cluster W3, defined mainly by high 710 urinary frequency, urinary urgency, and urge urinary incontinence. Interestingly, the lowest number of 711 differentially abundant proteins (n=10) are observed in clusters W2 (characterized by the presence of 712 pelvic organ prolapse) and W4 (characterized by the presence of stress urinary incontinence), which are 713 presumably driven by anatomic abnormalities, rather than biochemical changes. Without going into the 714 detailed interpretation of these results, which are outside the scope of this paper, we think the 715 observed differences in the differentially abundant proteins across W1-W5 clusters serve as important 716 independent confirmation of the distinctiveness of the identified clusters. 717 medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 42 718 719 720 Figure 6. Volcano plots demonstrating differentially abundant proteins in women with LUTS vs. controls for 230 participants representing: (A) the whole cohort; and (B-F) for each of identified clusters W1-W5. 721 722 723 Comparison of clusters W1-W5 with our previously published urinary symptom-based clusters F1-F4 724 Comparing quality of the clusters 725 Previously, we identified four clusters (F1-F4) by analyzing data on the same 545 women with 726 LUTS using only urinary symptoms data collected via the LUTS Tool and AUA-SI (total of 52 variables) 727 [41]. Since the same resampling procedure was performed when generating W1-W5 and F1-F4, both 728 cluster structures are equally robust to the random variations of the cohort composition. 729 Distinctiveness, as determined by pairwise comparisons, was higher for the refined clusters 730 compared with the previously published clusters (Tables 4 and 5). The proportion of significantly 731 different variables in pairwise comparison of the clusters ranged from 27% to 83% (mean 52%) for the 732 refined clusters, compared with a range of 27% to 58% (mean 43%) for the previous clusters. The 733 proportion of core clusters was also higher for W1-W5, compared with F1-F4; it ranged from 73% to 93% medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 43 734 (mean 87%) for the refined clusters, compared with a range of 40% to 83% (mean 62%) for the previous 735 clusters. Summarizing, there are more significantly different variables across refined clusters (W1-W5) 736 than across our previously published clusters (F1-F4), and the refined clusters contain a higher 737 percentage of core members for whom the probability to be in the given cluster is higher than the 738 probability to be in all other clusters combined. Therefore, the refined clusters identified in the current 739 paper by using additional urinary and non-urinary variables (total of 185 variables) are substantially 740 more distinct than our previously published clusters based only on urinary symptoms measured by the 741 LUTS Tool and AUA-SI (52 variables). 742 Table 4. Proportion of significantly different variables in clusters W1-W5 and F1-F4. W1 W2 W3 W4 W5 F1 F2 W1 27% 46% 58% 64% F1 34% W2 27% 39% 37% 83% F2 34% W3 47% 27% 34% 66% F3 47% 27% W4 58% 37% 34% 69% F4 58% 51% W5 64% 83% 66% 69% Mean 52.3% Mean F3 47% 27% F4 58% 51% 39% 39% 42.7% 743 744 745 Table 5. Proportion of core clusters in W1-W5 and F1-F4. W1 W2 W3 W4 W5 Mean PCC 88% 73% 89% 93% 92% 87% F1 F2 F3 F4 PCC 40% 74% 51% 83% Mean 62% 746 747 748 Comparing cluster membership The Sankey diagram in Figure 7 serves to compare cluster membership in W1-W5 and in F1-F4. 749 Refined cluster W1 is mostly composed of the members of cluster F1 without prolapse. Cluster W2 is 750 formed by the members of cluster F1 with pelvic organ prolapse and includes some members of other 751 clusters having prolapse and moderate urinary symptoms. Cluster W3 is mainly composed of members 752 of F2 and F3 with urgency urinary incontinence. Cluster W4 is predominantly formed by members of F3 medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 44 753 with stress urinary incontinence. Cluster W5 includes nearly all members of F4 and approximately 30% 754 of F3 who have both urgency urinary incontinence and stress urinary incontinence symptoms. 755 Figure 7. Sankey diagram comparing cluster memberships in W1-W5 and F1-F4. 756 757 Comparing radar plots 758 Figure 8 provides comparison of radar plots for the urinary symptom signatures of refined 759 clusters W1-W5 and symptom-based clusters F1-F4. There are substantial similarities in the urinary 760 symptom signatures of our previously published clusters, and of the refined clusters. Radar plots for W5 761 and F4 are similar in presenting all urinary symptoms at a uniformly high level. Signatures of W4 and F3 762 are similar in presenting the combination of stress urinary incontinence, urgency, and voiding 763 dysfunction symptoms. W3 and F2 present urinary urgency, urgency urinary incontinence, and mild 764 voiding problems. Symptom signatures of W1 and F1 are similar, presenting mostly voiding and post- 765 micturition problems. The signature of cluster W2 presents mild LUTS and is mostly defined by clinically 766 significant pelvic organ prolapse. The observed similarity of the clusters’ LUTS signatures confirms that 767 additional variables did not result in radical changes, but rather in incremental changes that allowed for medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 45 768 identification of the refined clusters, which are built upon, but are more distinct and uniform than, our 769 previously published ones. We believe this is further evidence of the stability of the identified clusters. 770 Urinary symptom data captured by the LUTS Tool and AUA-SI provided the foundation for data-driven 771 subtyping of LUTS, while the remaining urinary and non-urinary variables allowed for identifying refined 772 clusters that differ not only by urinary symptoms, but by other PRO, demographic, and clinical variables 773 as well (Table 3, Figure 5, and Figures 6B-6C). Importantly, cluster refinement enhanced the 774 distinctiveness and uniformity of the clusters, as well as the confidence in cluster membership, by 775 increasing the overall proportion of core clusters from 62% to 87%. 776 Figure 8. Comparison of radar plots of the urinary symptom signatures for clusters W1-W5 and F1-F4. 777 778 779 780 Evolution of refined clusters W1-W5 in 3- and 12-month follow-up Figure 9 presents radar plots of the urinary symptom signatures of refined clusters W1-W5 at 3- 781 and 12-month follow-up. As seen, the shapes of the radar plots are conserved, while their areas 782 representing overall severity of LUTS are decreased due to improvement in urinary symptoms for some medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 46 783 participants. The percentage of improvers varied across the clusters (Table 6). Improvers were defined 784 as having a ½ SD or greater improvement between baseline and 12 months on the calculated LUTS Tool 785 Summary Score (including all 22 LUTS Tool severity variables). We view stability of the urinary symptom 786 radar plot signatures’ shapes as additional evidence of robustness of the identified clusters. 787 Figure 9. Evolution of the urinary symptom signatures in 3- and 12-month follow-up. 788 789 790 Table 6. Percentage of improvers in cluster W1-W5 in 3- and 12-month follow-up. Cluster N patients with 12-month follow-up N (%) Improvers at 12 months Cluster W1 50 23 (46%) Cluster W2 51 23 (45%) Cluster W3 106 57 (53%) Cluster W4 70 49 (70%) Cluster W5 119 69 (58%) medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 47 791 792 Clinical significance of the identified clusters The current paradigm for managing patients with LUTS is to assign a diagnosis based on a pre- 793 defined symptom complex, such as overactive bladder (OAB), or based on a single predominant 794 symptom, such as nocturia or stress urinary incontinence. Treatments are then administered based on 795 these diagnoses [73-74]. Conventional classification of LUTS includes such partially overlapping groups 796 as OAB wet, OAB dry, continent, stress urinary incontinence, urgency urinary incontinence, mixed 797 urinary incontinence, underactive bladder, and bladder outlet obstruction. As stated in [39-41], there 798 are limitations to this paradigm, as patients frequently present with multiple other urinary symptoms in 799 addition to those being treated, and these combinations of symptoms may be relevant to treatment 800 selection. Diagnosis and treatment based solely on patients’ chief complaints may be unsatisfactory, as 801 they disregard other presenting symptoms. Mechanistic studies reveal that a functional impairment to a 802 specific organ in the urinary tract may cause more than a single symptom. For example, a weak urethral 803 sphincter is associated with both stress and urgency urinary incontinence [58, 75]. This may explain why 804 mixed incontinence is so common. This raises the question of how current diagnostic paradigms 805 correspond with biological changes of the continence system and how symptoms occur in women 806 seeking treatment, which was the main rationale for the unbiased data-driven subtyping of women with 807 LUTS in LURN. As seen, none of the clusters W1-W5 identified in our analysis could be characterized by a 808 single symptom, but rather by a combination of symptoms with various levels of severity, which are in 809 concert with the clinical observations mentioned above [58, 75]. The more detailed comparison of our 810 refined clusters W1-W5 with the conventional LUTS groups and subtypes of LUTS identified by other 811 researchers [36-38,76] is provided in Supplemental Material text. medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 48 812 Clustering methodologies reported in studies on subtyping LUTS and other common diseases and 813 disorders 814 Previous research on subtyping of diseases and disorders provides various levels of detail 815 regarding the clustering methodologies that were employed. For instance, Coyne et al [36] provided 816 detailed information on the 14 LUTS questions used for clustering and indicated that all the variables 817 were scaled from 0 to 1. They took care of the robustness of clusters to the variation in composition of 818 the cohort by performing clustering on the random 50% subset of participants first, and then by 819 extending it to the whole cohort. They used k-means algorithm for clustering and scanned the number 820 of clusters K from 3 to 7. They reported that the decision on the number of clusters was made by 821 evaluating “each cluster model based on the clinical relevance and distinctiveness of each cluster, as 822 well as the amount of variance accounted for by the cluster solution.” However, they did not provide the 823 names or values of criteria used for cluster evaluation. They provided detailed and informative 824 descriptive statistics, but for some reason, did not provide any tests on significantly different variables in 825 pairwise comparison of the identified clusters. 826 Similarly, Hall et al [37] provided detailed information on the 14 LUTS questions and scaling. K- 827 means clustering was performed with random 50% split of the cohort (split-half validation) similar to 828 [36]. Detailed descriptive statistics and omnibus tests are provided across identified clusters and 829 asymptomatic controls for significant difference in the variables not used for clustering, including 830 demographics, comorbidities, risk factors, and lifestyle factors. However, no information on significant 831 differences in the 14 LUTS variables used for clustering, and no information on pairwise comparison of 832 the clusters is provided. Summarizing, these two LUTS clustering papers provide a reasonable level of 833 details on scaling of the variables and on the clustering procedure, but unfortunately do not provide 834 enough information on evaluation of the quality of the identified clusters. medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 49 835 In contrast, Miller et al [38] provided all necessary information on pairwise comparison of the 836 identified clusters, both for six variables used for clustering, and for eight other variables collected but 837 not used for clustering. Unfortunately, the authors did not perform any scaling of variables used for 838 clustering. As stated in the paper, “the six clustering variables were (a) number of voids during daytime 839 hours, (b) number of voids during nighttime hours, (c) daytime modal output volume in milliliters, (d) 840 total 24-hour output volume in milliliters, (e) total 24-hour beverage intake in milliliters, and (f) BMI (a 841 variable that was speculated to be related to intake).” If the volume of related variables c, d, e were not 842 scaled but entered into clustering in milliliters (values can be as high as 500), then these variables will 843 provide the domineering contribution (compared with the number of daytime voids, typically < 20) to 844 the 6-dimensional Euclidean distance between clustered objects, and will serve as drivers determining 845 cluster membership. The contribution of these variables to the Euclidean distance would be much lower 846 if they were entered in liters instead of milliliters, which would change the cluster membership. This is 847 the problem indicated by Hair et al [62], with unscaled, unstandardized data of the inconsistency 848 between cluster solutions when the scale of some variables is changed, which is a strong argument in 849 favor of standardization. To our mind, the best solution of the problem is the use of the variables scaled 850 by comparison with controls, as we described in the Methods section. Proper scaling is especially 851 important when using heterogeneous data combining dimensionless and dimensional variables 852 measured in different units, as in [38], where frequencies, volumes (mL), and BMI units (kg/m2) are 853 combined. 854 A broader look outside the LUTS domain shows that previous data on subtyping other common 855 complex diseases provide different level of details on the clustering procedure and cluster evaluation as 856 well. Table 7 below summarizes methodological information reported in the clustering papers [31-33] 857 subtyping patients with asthma, diabetes, and sepsis, as well as in the LUTS papers [36-38] discussed medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 50 858 above. More details on methodological information reported in [31-33] are provided in Supplemental 859 Material text. 860 Table 7. Methodological information provided in the clustering papers. Paper Variables Preprocessing Clustering Number of algorithm clusters determination Moore 34 variables No info on Agglomerative Dendrogram et al derived from scaling of 17 hierarchical demonstrates 5-6 [31] initial 726. Only continuous clustering, groups. Five list of 34 variables variables; 17 Wards linkage. clusters selected provided. composite due to small size variables of the sixth ranked 0-10. group. No other criteria. Ahlqvist et al [32] List of six variables used for clustering is provided. Five variables standardized as z-scores. Presence of glutamic acid decarboxylase antibodies (GADA) binary variable. Variables standardized as z-scores. Patients with GADA grouped into separate cluster. Kmeans with resampling for patients without GADA. Coyne et List and al [36] description of 14 EPIC LUTS questions used for clustering is provided. Variables scaled from 0 to 1. Hall et al List and [37] description of 14 BACH LUTS questions used for clustering is provided. Variables scaled from 0 to 1. k-means clustering with split-half randomization. Values of k scanned from 3 to 8. Hierarchical clustering and k-means clustering, split-half validation. Seymour List of 29 et al variables used for [33] clustering is provided. Resamplingbased consensus kmeans clustering. Cluster evaluation Omnibus tests (analysis of variance [ANOVA], Kruskal-Wallis, chisquare) used on demographic, clinical, medication use, health care utilization, and biomarker variables. No pairwise comparisons of clusters. Schwarz’s Box plots comparing 5 Bayesian continuous variables criterion to used for clustering. determine Pairwise comparisons number of of clusters for multiple clusters k=4. variables not used for clustering. Cluster validation in 3 independent cohorts. Consensus matrix Pairwise comparison of heat map. Area variables clusters for under the CDF variables used and not curve [34]. used for clustering. Validation in an independent cohort. No names or Descriptive statistics on values for criteria variables both used provided. and not used for clustering, but no tests for significance. Pseudo F and t2 statistics were used to determine number of clusters. Omnibus tests (ANOVA, chi-square) on variables used and not used for clustering. No pairwise comparisons of clusters. medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 51 Miller et al [38] List of six variables used for clustering is provided. No explicit information on scaling. Agglomerative hierarchical clustering, Wards linkage. However, values of statistics are not provided. Dendrogram demonstrates 3-4 groups. Three groups selected based on visual examination of separation on canonical variables plane. No other criteria provided. Pairwise comparison of clusters on six variables used for clustering and 8 variables not used for clustering. ANOVA for continuous and chisquare test for categorical data. 861 862 Thoughts on minimal requirements for clustering publications 863 As shown above, publications using clustering for disease subtyping provide different levels of 864 details on data preprocessing, clustering procedure, and cluster evaluation. In particular, information on 865 scaling and weighting of the variables, values of criteria used for selection of the number of clusters, 866 pairwise comparison of the clusters, and level of confidence in cluster membership are often not 867 provided. This missing or hard-to-find information is important for better understanding of the papers’ 868 results, for comparison of the proposed phenotypes with previous and future classifications, and for 869 potential refinement. With this in mind, and guided by the principles of FAIR data (findability, 870 accessibility, interoperability, and reusability) [77-78], we think it is time for the clustering community to 871 develop minimum requirements for clustering reports (MICRo), similar to minimum information about a 872 proteomics experiment (MIAPE), developed by the proteomics community [79], and minimum 873 information about a microarray experiment (MIAME), developed by the transcriptomics community 874 [80]. medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 52 875 Exact guidelines for the minimal requirements for clustering publications should result from the 876 clustering community discussion. Here, we would like to call for such discussion and propose the below 877 items that we believe are important for future guidelines: 878 1. Complete list of variables used for clustering. 879 2. Explicit information on scaling and weighting of clustering variables. 880 3. Clustering algorithm used (name of the function with options and parameter values, or code). 881 4. Exact definition of criteria used to determine the number of clusters. Values of criteria for the 882 selected and alternative number of clusters. Preferably, more than one criterion should be 883 presented. 884 5. Results of pairwise comparison of the clusters, with the indication of clinically meaningful and 885 significantly different variables in the pairwise comparison – for all variables used for clustering, 886 and for selected important variables not used for clustering (e.g., demographics). 887 888 889 6. Information on the level of confidence in cluster membership. Limitations of the current study Our paper carries some limitations. First, there are limitations in terms of the cohort, which 890 included only treatment-seeking and predominantly white participants. Our analysis only contains 891 women. Preliminary data analysis confirmed that sex is the major determinant of LUTS subtypes; 892 therefore, sex-specific clustering was performed. We previously published the results of urinary 893 symptom-based clustering of male participants [64]. Cluster refinement of male clusters, along the same 894 lines as described in the current paper, is underway; the resulting refined subtypes will be compared 895 with those found in the female cohort. Our control group used for scaling of the variables was relatively 896 small (55 participants) but commensurate with the size of identified clusters. Not all data elements medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 53 897 collected for the cases were available for the controls, so we used some literature data for general 898 population and bladder diary data from a different study (EPI). 899 Second, there are limitations in terms of data elements used for clustering. Genomics data were 900 not used so far. We do not expect that genomics data will produce dramatic effect on clustering results 901 since LUTS is a highly prevalent common disease, especially in older age. Nevertheless, genotyping of 902 the LURN participants is underway, which would allow for future cluster refinement by including the 903 binary single nucleotide polymorphism (SNP) data using our novel weighted Tanimoto indices approach. 904 Proteomics data are available for approximately 40% of the cohort and were not yet used for cluster 905 refinement. However, they were used to demonstrate the presence of multiple significantly different 906 proteins, indicating the refined symptom-based clusters are biochemically different. 907 Third, we developed methodology and a pipeline for integrating heterogeneous continuous and 908 categorical data for clustering women with LUTS. We cannot claim that this is the preferable 909 methodology for other data sets since data and research questions are different in different studies. 910 However, we explicitly described our preprocessing and clustering procedures, as well as criteria used 911 for determination of the number of clusters, cluster evaluation, and confidence level in cluster 912 membership. We compared our methodology with alternative approaches and demonstrated that our 913 methodology allows for combining heterogeneous continuous, categorical, and binary data, and that our 914 refined clusters are more distinct than the previous urinary symptom-based clusters. Detailed 915 description of the methodology, and comparison with the alternative approaches, allows interested 916 readers to decide if it is suitable for their data and research questions. Availability of the pipeline source 917 code allows for modifications, if needed. 918 919 Fourth, and most importantly, clinical significance of the identified clusters has yet to be determined. We already demonstrated the distinctness of our clusters, but now we need to establish medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 54 920 their usefulness in clinical practice. This should be done through clinical trials, where treatments and 921 outcomes of patients classified into the identified clusters would be compared with the standard 922 treatment without the knowledge of cluster membership. Our preliminary analysis of multiple 923 differentially abundant serum proteins in women with LUTS corroborated that the identified clusters are 924 biochemically different. Further analysis of the affected biochemical pathways, and their longitudinal 925 dynamics as related to symptom trajectories, will follow with the goal to enhance understanding of 926 different etiologies of the identified subtypes and potentially establish more effective subtype-specific 927 treatments. Further analysis will also include determination of the minimal set of variables sufficient for 928 classification of patients with LUTS into identified subtypes in clinical practice. 929 CONCLUSION 930 A novel clustering pipeline for subtyping of common complex diseases and syndromes using 931 heterogeneous continuous and categorical data was developed. The advantages of scaling variables by 932 comparison with the controls without the disease of interest were discussed and illustrated by the 933 simulated example. The novel weighted Tanimoto indices approach to integrate multiple binary 934 variables into the clustering procedure was developed. A cluster refinement procedure using data 935 available only for the subset of participants through semi-supervised clustering was proposed. A novel 936 contrast criterion (CC) for resampling-based consensus clustering was proposed and compared with 937 existing criteria for consensus clustering, i.e., consensus score (SC) and proportion of ambiguously 938 clustered pairs (PAC). A simulated example demonstrated the advantages of CC over SC and PAC. 939 Information provided in the literature on subtyping common complex diseases and disorders 940 was reviewed and shown to be often incomplete, especially with regard to data preprocessing, 941 clustering procedures, and cluster evaluation. Suggestions for the minimum requirements for clustering 942 publications were formulated, and the community effort to work on creating such requirements 943 following the principles of FAIR data was called for. medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 55 944 Five distinct clusters of women with LUTS were identified by using 185 variables, including 945 demographics, physical exam, LUTS and non-LUTS questionnaires, and bladder diary variables. The 946 quality of the clusters was evaluated using established criteria (Calinski-Harabasz, Davies-Bouldin, Dunn, 947 Point-Biserial, and Silhouette [22-26]), as well as novel contrast criterion (CC) and percentage of core 948 members of the clusters (PCC). Distinctiveness of the clusters was confirmed by multiple significantly 949 different variables in pairwise comparison of the clusters. Refined clusters W1-W5 were compared with 950 our previously published urinary symptom-based clusters F1-F4, and were shown to be more distinct by 951 having a higher percentage of significantly different variables and a higher percentage of the core 952 members of the clusters. Importantly, targeted proteomics data confirmed that our refined clusters 953 based on clinical data are biochemically different. Identification of the clinically and biochemically 954 distinct subtypes of LUTS has provided a foundation for studies of subtype-specific etiologies and 955 treatments. 956 957 medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 56 958 959 960 961 962 963 ACKNOWLEDGMENTS Heather Van Doren, Senior Medical Editor with Arbor Research Collaborative for Health, provided editorial assistance on this manuscript. This is publication number 28 of the Symptoms of Lower Urinary Tract Dysfunction Research Network (LURN). This study is supported by the National Institute of Diabetes & Digestive & Kidney Diseases 964 through cooperative agreements (grants DK097780, DK097772, DK097779, DK099932, DK100011, 965 DK100017, DK099879) and for Dr. Andreev’s Biomarker Ancillary LURN R01 (grant 5R01DK125251). 966 Research reported in this publication was supported at Northwestern University, in part, by the 967 National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number 968 UL1TR001422. The content is solely the responsibility of the authors and does not necessarily represent 969 the official views of the National Institutes of Health. 970 The following individuals were instrumental in the planning and conduct of this study at each of 971 the participating institutions: 972 Duke University, Durham, North Carolina (DK097780): PIs: Cindy Amundsen, MD, Eric Jelovsek, MD; Co- 973 Is: Kathryn Flynn, PhD, Todd Harshbarger, PhD, Jim Hokanson, PhD, Aaron Lentz, MD, Michelle O’Shea, 974 MD, David Page, PhD, Nazema Siddiqui, MD, Kevin Weinfurt, PhD Lisa Wruck, PhD; Study Coordinators: 975 Yasmeen Bruton, Paige Green, Folayan Morehead 976 University of Iowa, Iowa City, IA (DK097772): PIs: Catherine S Bradley, MD, Karl Kreder, MD, MBA, MSCE; 977 Co-Is: Bradley A. Erickson, MD, MS, Daniel Fick, MD, Vince Magnotta, PhD, Philip Polgreen, MD, MPH; 978 Study Coordinators: Mary Eno, Sarah Heady, Chelsea Poesch medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 57 979 Northwestern University, Chicago, IL (DK097779): PIs: James W Griffith, PhD, Kimberly Kenton, MD, MS, 980 Brian Helfand, MD, PhD; Co-Is: Carol Bretschneider, MD, David Cella, PhD, Sarah Collins, MD, Julia 981 Geynisman-Tan, MD, Alex Glaser, MD, Christina Lewicky-Gaupp, MD, Margaret Mueller, MD; Study 982 Coordinators: Sylwia Clarke, Melissa Marquez, Pooja Sharma, Michelle Taddeo, Pooja Talaty. Dr. Helfand 983 and Ms. Talaty are at NorthShore University HealthSystem. 984 University of Michigan Health System, Ann Arbor, MI (DK099932): PI: J Quentin Clemens, MD, FACS, 985 MSCI; Co-Is: John DeLancey, MD, Dee Fenner, MD, Rick Harris, MD, Steve Harte, PhD, Anne P. Cameron, 986 MD, Aruna Sarma, PhD, Giulia Lane, MD; Study Coordinators: Ashly Chimner, Linda Drnek, Emma Keer, 987 Marissa Moore, Greg Mowatt, Sarah Richardson 988 University of Washington, Seattle Washington (DK100011): PI: Claire Yang, MD; Co-I: Anna Kirby, MD; 989 Study Coordinators: Lois Meryman, Brenda Vicars, RN 990 Washington University in St. Louis, St. Louis Missouri (DK100017): PI: H. Henry Lai, MD; Co-Is: Gerald L. 991 Andriole, MD, Joshua Shimony, MD, PhD; Study Coordinators: Linda Black, Vivien Gardner, Patricia 992 Hayden, Diana Wolff, Aleksandra Klim, RN, MHS, CCRC 993 Arbor Research Collaborative for Health, Data Coordinating Center (DK099879): PI: Robert Merion, MD, 994 FACS; Co-Is: Victor Andreev, PhD, DSc, Brenda Gillespie, PhD, Abigail Smith, PhD; Project Manager: 995 Melissa Fava, MPA, PMP; Clinical Monitor: Melissa Sexton, BA, CCRP; Research Analysts: Margaret 996 Helmuth, MA, Jon Wiseman, MS, Jane Liu, MPH; Project Associate: Levi Hurley 997 National Institute of Diabetes and Digestive and Kidney Diseases, Division of Kidney, Urology, and 998 Hematology, Bethesda, MD: Project Scientist: Ziya Kirkali MD; Project Officer: Christopher Mullins PhD; 999 Project Advisor: Julie Barthold, MD. medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 58 1000 REFERENCES 1001 1002 1. Schadt EE, Lum PY. Reverse engineering gene networks to identify key drivers of complex disease phenotypes. J Lipid Res. 2006;47:2601-2613. 1003 1004 2. Becker KG. The common variants/multiple disease hypothesis of common complex genetic disorders. Medical Hypothesis. 2004;62:309-317. 1005 1006 3. Relton CL, Davey Smith G. Epigenetic epidemiology of common complex disease: prospects for prediction, prevention, and treatment. PLoS Med. 2010;7(10):e1000356. 1007 1008 1009 4. Coyne KS, Sexton CC, Thompson CL, Milsom I, Irwin D, Kopp ZS, et al. The prevalence of lower urinary tract symptoms (LUTS) in the USA, the UK and Sweden: results from the Epidemiology of LUTS (EpiLUTS) study. BJU Int. 2009;104(3):352-360. 1010 1011 1012 5. Irwin DE, Milsom I, Hunskaar S, Reilly K, Kopp Z, Herschorn S, et al. Population-based survey of urinary incontinence, overactive bladder, and other lower urinary tract symptoms in five countries: results of the EPIC study. Eur Urol. 2006;50(6):1306-1314; discussion 1314-1305. 1013 1014 6. Litman HJ, McKinlay JB. The future magnitude of urological symptoms in the USA: projections using the Boston Area Community Health survey. BJU Int. 2007;100(4):820-825. 1015 1016 1017 7. Coyne KS, Wein A, Nicholson S, Kvasz M, Chen CI, Milsom I. Economic burden of urgency urinary incontinence in the United States: a systematic review. J Manag Care Pharm. 2014;20(2):130140. 1018 8. Liao TW. Clustering of time series data -a survey. Pattern Recognit. 2005;38:1857-1874. 1019 9. Duda RO, Hart PE, Stork DG. Pattern classification. 2nd Ed. New York: Wiley; 2001. 1020 1021 1022 10. Robotti E, Manfredi M, Marengo E. Biomarkers discovery through multivariate statistical methods: A review of recently developed methods and applications in proteomics. J Proteomics Bioinform. 2014;S3:003. 1023 1024 11. Dudoit S, Fridlyand J. A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology. 2002;3(7):1-21. 1025 1026 12. Hastie T, Tibshirani R, Friedman J. The elements of statistical learning, statistics. New York: Springer; 2001. 1027 13. Wang L. Heterogeneous data and big data analytics. Autom Cont Inf Sci. 2017;3(1):8-15. 1028 14. Jain AK, Dubes RC. Algorithms for clustering data. Englewood Cliffs, NJ: Prentice Hall; 1988. 1029 1030 1031 15. Giancarlo R, Scaturro D, Utro F. ValWorkBench: An open source Java library for cluster validation, with applications to microarray data analysis. Comp Meth Prog Biomed. 2015;118:207-217. medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 59 1032 1033 16. Hartigan JA, Wong MA. Algorithm AS 136: A K-means clustering algorithm. J Royal Stat Soc Series C (Applied Statistics). 1970;28(1):100-108. 1034 1035 17. Day WHE, Edelsbrunner H. Efficient algorithms for agglomerative hierarchical clustering methods. J Classif. 1984;1:7-24. 1036 18. Kohonen T. Self-organizing maps. Information sciences. Berlin: Springer; 1997. 1037 1038 19. Lum PY, Singh G, Lehman A, Ishkanov T, Vejdemo-Johansson M, Alagappan M, et al. Extracting insights from the shape of complex data using topology. Sci Rep. 2013;3:1236. 1039 1040 1041 20. Bae E, Bailey J. COALA: A novel approach for the extraction of an alternate clustering of high quality and high dissimilarity. International Conference on Data Mining 2006. Los Alamitos, CA, USA. IEEE Computer Society: 53-62. 1042 1043 21. Ramoni M, Sebastiani P, Cohen P. Multivariate clustering by dynamics. Proceedings of the 2000 National Conference on Artificial Intelligence (AAAI-2000). San Francisco, CA: 633-638. 1044 22. Calinski T, Harabasz J. A dendrite method for cluster analysis. Comm Statistics. 1974;3:1-27. 1045 1046 23. Davies DL, Bouldin DW. A cluster separation measure. IEEE Trans Pattern Anal Mach Intell. 1979;1:224-227. 1047 1048 24. Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. J Royal Stat Society Series B. 2001;63:411-423. 1049 1050 25. Rouseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Computational Applied Mathematics. 1987;20:53-65. 1051 1052 26. Rand WM. Objective criteria for the evaluation of clustering methods. J Am Stat Assoc. 1971;66(336):846-850. 1053 1054 1055 27. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci. 1999;96:6745-6750. 1056 1057 1058 28. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh M, Downing J, Caligiuri M, Bloomfield C, Lander E. Molecular classification of cancer: Class discovery and class prediction by gene expression. Science. 1999;286(5439):531-537. 1059 1060 1061 29. Hayes DN, Monti S, Parmigiani G, Gilks CB, Naoki K, Bhattacharjee A, et al. Gene expression profiling reveals reproducible human lung adenocarcinoma subtypes in multiple independent patient cohorts. J Clin Oncol. 2006;24:5079-5090. 1062 1063 1064 30. Lapointe J, Li C, Higgins JP, van de Rijn M, Bair E, Montgomery K, et al. Gene expression profiling identifies clinically relevant subtypes of prostate cancer. Proc Natl Acad Sci. USA 2004;101:811816. medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 60 1065 1066 1067 31. Moore WC, Meyers DA, Wenzel SE, Teague WG, Li H, Li X, et al. Identification of asthma phenotypes using cluster analysis in the severe asthma research program. Am J Respir Crit Care Med. 2010;181:315-323. 1068 1069 1070 32. Ahlqvist E, Storm P, Käräjämäki A, Martinell M, Dorkhan M, Carlsson A, et al. Novel subgroups of adult-onset diabetes and their association with outcomes: a data-driven cluster analysis of six variables. Lancet Diabetes Endocrinol. 2018;6:361-369. 1071 1072 1073 33. Seymour CW, Kennedy JN, Wang S, Chang CCH, Elliott CF, Xu Z, et al. Derivation, validation, and potential treatment implications of novel clinical phenotypes for sepsis. JAMA. 2019;321(20):2003-2017. 1074 1075 1076 34. Monti S, Tamayo P, Meserov J, Golub T. Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 2003;52:91118. 1077 1078 1079 35. Locke Jr K, Lai HH, Pontari MA, Clemens JQ, Kreder KJ, Krieger JN, et al. Discovery, validation, and novel visualization of subgroups in urologic chronic pelvic pain syndrome (UCPPS): Consensus clustering findings from the MAPP Research Network. J Urol. 2020;203(4S):e104. 1080 1081 36. Coyne KS, Matza LS, Kopp ZS, Thompson C, Henry D, Irwin DE, et al. Examining lower urinary tract symptom constellations using cluster analysis. BJU Int. 2008;101(10):1267-1273. 1082 1083 1084 37. Hall SA, Cinar A, Link CL, Kopp ZS, Roehrborn CG, Kaplan SA, et al. Do urological symptoms cluster among women? Results from the Boston Area Community Health Survey. BJU Int. 2008;101(10):1257-1266. 1085 1086 1087 38. Miller JM, Guo Y, Rodseth SB. Diary data subjected to cluster analysis of intake/output/void habits with resulting clusters compared by continence status, age, race. Nurs Res. 2011;60(2):115-123. 1088 1089 39. Yang CC, Weinfurt KP, Merion RM, Kirkali Z, Lurn Study Group. Symptoms of Lower Urinary Tract Dysfunction Research Network. J Urol. 2016;196(1):146-152. 1090 1091 1092 40. Cameron AP, Lewicky-Gaupp C, Smith AR, Helfand BT, Gore JL, Clemens JQ, et al. Baseline lower urinary tract symptoms in patients enrolled in LURN: a prospective, observational cohort study. J Urol. 2018;199(4):1023-1031. 1093 1094 1095 41. Andreev VP, Gang L, Yang CC, Smith AR, Helmuth ME, Wiseman JB, et al. Symptom-based clustering of women in the Symptoms of Lower Urinary Tract Dysfunction Research Network (LURN) observational cohort study. J Urol. 2018;200(6):1323-1331. 1096 1097 1098 42. Coyne KS, Sexton CC, Kopp Z, Chapple CR, Kaplan SA, Aiyer LP, et al. Assessing patients’ descriptions of lower urinary tract symptoms (LUTS) and perspectives on treatment outcomes: results of qualitative research. Int J Clin Pract. 2010;64(9):1260-1278. 1099 1100 1101 43. Coyne KS, Barsdorf AI, Thompson C, Ireland A, Milsom I, Chapple C, et al. Moving towards a comprehensive assessment of lower urinary tract symptoms (LUTS). Neurourol Urodyn. 2012;31(4):448-454. medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 61 1102 1103 1104 44. Barry M, Fowler F Jr, O’Leary M, Bruskewitz RC, Holtgrewe HL, Mebust WK, et al. The American Urological Association symptom index for benign prostatic hyperplasia. The Measurement Committee of the American Urological Association. J Urol. 1992;148(5):1549-1557. 1105 1106 1107 45. Cheng J, Leng M, Li L, Zhou H, Chen X. Active semi-supervised community detection based on must-link and cannot-link constraints. PLoS ONE. 2014;9(10): e110088. doi:10.1371/journal.pone.0110088. 1108 1109 1110 46. Li Z, Liu J, Tang X. Pairwise constraint propagation by semidefinite programming for semisupervised classification. Proceedings of the 25th International Conference on Machine Learning; Helsinki, Finland: 2008. 1111 1112 47. Șenbabaoğlu Y, Michailidis G, Li JZ. Critical limitations of consensus clustering in class discovery. Sci Rep. 2014;4:6207. 1113 1114 1115 48. Ku JH, Jeong IG, Lim DJ, Byun S, Paick JS, Oh SJ. Voiding diary for the evaluation of urinary incontinence and lower urinary tract symptoms : Prospective assessment of patient compliance and burden. Neurourol Urodyn. 2004;23(4):331-335. 1116 1117 1118 49. Spiegel BM, Hays RD, Bolus R, Melmed GY, Chang L, Whitman C, et al. Development of the NIH Patient-Reported Outcomes Measurement Information System (PROMIS) gastrointestinal symptom scales. Am J Gastroenterol. 2014;109(11):1804-1814. 1119 1120 1121 50. Pilkonis PA, Choi SW, Reise SP, Stover AM, Riley WT, Cella D, et al. Item banks for measuring emotional distress from the Patient-Reported Outcomes Measurement Information System (PROMIS®): depression, anxiety, and anger. Assessment. 2011;18(3):263-283. 1122 1123 51. Cohen S, Kamarck T, Mermelstein R. A global measure of perceived stress. J Health Soc Behav. 1983;24(4):385-396. 1124 1125 1126 52. Yu L, Buysse DJ, Germain A, Moul DE, Stover A, Dodds NE, et al. Development of short forms from the PROMISTM sleep disturbance and sleep-related impairment item banks. Behav Sleep Med. 2011;10(1):6-24. 1127 1128 1129 53. Clemens JQ, Calhoun EA, Litwin MS, McNaughton-Collins M, Kusek JW, Crowley EM, et al. Validation of a modified National Institutes of Health chronic prostatitis symptom index to assess genitourinary pain in both men and women. Urology. 2009;74(5):983-987. 1130 1131 1132 54. Barber MD, Chen Z, Lukacz E, Markland A, Wai C, Brubaker L, et al. Further validation of the short form versions of the Pelvic Floor Distress Inventory (PFDI) and Pelvic Floor Impact Questionnaire (PFIQ). Neurourol Urodyn. 2011;30(4):541-546. 1133 1134 1135 55. Bump RC, Mattiasson A, Bø K, Brubaker LP, DeLancey JO, Klarskov P, Shull BL, Smith AR. The standardization of terminology of female pelvic organ prolapse and pelvic floor dysfunction. Am J Obstet Gynecol. 1996;175(1):10-17. 1136 1137 56. Groll DL, To T, Bombardier C, Wright JG. The development of a comorbidity index with physical function as the outcome. J Clin Epidemiol. 2005;58(6):595-602. medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 62 1138 1139 1140 57. Cameron AP, Wiseman JB, Smith AR, Merion RM, Gillespie BW, Bradley CS, et al. Are three-day voiding diaries feasible and reliable? Results from the Symptoms of Lower Urinary Tract Dysfunction Network (LURN) cohort. Neurourol Urodyn. 2019;38(8):2185-2193. 1141 1142 1143 58. DeLancey JO, Fenner DE, Guire K, Patel DA, Howard D, Miller JM. Differences in continence system between community-dwelling black and white women with and without urinary incontinence in the EPI study. Am J Obstet Gynecol. 2010;202(6):584.e1-584.e12. 1144 1145 59. Benjamini Y, Hochberg Y. Controlling the false discovery rate – a practical and powerful approach to multiple testing. J Royal Statistical Society, Ser B. 1995;57(1):289-300. 1146 1147 1148 60. Raghunathan TE, Lepkowski JM, Van Hoewyk J, Solenberger P. A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodol. 2001;27(1):85-95. 1149 1150 1151 61. Raghunathan TE, Solenberger PW, Berglund P, Van Hoewyk J. IVEware: Imputation and variance estimation software. Ann Arbor: University of Michigan, Institute for Social Research, Survey Research Center. 2000. 1152 1153 62. Hair JR, Anderson RE, Tatham RL, Black WC. Multivariate data analysis. Prentice- Hall Inc: Upper Saddle River, NJ; 1998. 1154 1155 1156 63. Andreev VP, Gillespie BW, Helfand BT, Merion RM. Misclassification errors in unsupervised classification methods. Comparison based on the simulation of targeted proteomics data. J Proteomics Bioinform. 2016;S14:005. doi:10.4172/jpb.S14-005. 1157 1158 1159 64. Liu G, Andreev VP, Helmuth ME, Yang CC, Lai HH, Smith AR, et al. Symptom-based clustering of men in the Symptoms of Lower Urinary Tract Dysfunction Research Network (LURN) observational cohort study. J Urol. 2019; 202(6):1230-1239. 1160 1161 65. Huang Z. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery. 1998;2(3):283-304. 1162 1163 66. Ng MK, Li MJ, Huang JZ, He Z. On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2007;29(3):503-507. 1164 1165 1166 67. Szepannek G, Aschenbruck R. k-prototypes clustering for mixed variable-type data. CRAN Repository 2021. Available at: https://cran.rproject.org/web/packages/clustMixType/clustMixType.pdf. Accessed 7/16/21. 1167 1168 1169 1170 68. SAS clustering action set: Clustering with the k-prototypes algorithm. SAS visual statistics programming guide. Available at: https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/casactstat/casactstat_clustering_ex amples06.htm. Accessed 7/16/21. 1171 1172 69. Rogers DJ, Tanimoto TT. A computer program for classifying plants. Science. 1960;132(3434):1115-1118. medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 63 1173 1174 1175 1176 70. Bilenko M, Basu S, Mooney RJ. Integrating constraints and metric learning in semi-supervised clustering. Proceedings of the 21st International Conference on Machine Learning (ICML). Banff, Canada. July 2004. Available at: https://www.cs.utexas.edu/~ml/papers/semi-icml-04.pdf. Accessed 8/6/21. 1177 1178 71. Huang H, Cheng Y, Zhao R. A semi-supervised clustering algorithm based on must-link set. In C Tang et al (Eds). ADMA 2008;LNAI 5139:492-499. 1179 1180 1181 72. Siddiqui NY, Helfand BT, Andreev VP, Kowalski JT, Bradley MS, Lai HH, et al. Biomarkers implicated in lower urinary tract symptoms: systematic review and pathway analyses. J Urol. 2019;202(5):880-889. 1182 1183 73. Syan R, Brucker BM. Guideline of guidelines: urinary incontinence. BJU International. 2016;117(1):20-33. 1184 1185 74. AUA (American Urological Association) Guidelines. Available at: https://www.auanet.org/guidelines. Accessed 7/22/21. 1186 1187 1188 75. DeLancey JO, Trowbridge ER, Miller JM, Morgan DM, Guire K, Fenner DE, et al. Stress urinary incontinence: relative importance of urethral support and urethral closure pressure. J Urol. 2008;179(6);2286-2290. 1189 1190 1191 76. Rosen RC, Coyne KS, Henry D, Link CL, Cinar A, Aiyer LP, et al. Beyond the cluster: methodological and clinical implications in the Boston Area Community Health survey and EPIC studies. BJU Int. 2008;101(10):1274-1278. 1192 1193 77. Go FAIR. FAIR (Findable, Accessible, Interoperable, Reusable) principals for scientific data. Available at: https://www.go-fair.org/fair-principles/. Accessed 8/18/21. 1194 1195 1196 78. Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, et al. The FAIR guiding principles for scientific data management and stewardship. Scientific Data. 2016;3:160018. 1197 1198 79. Taylor CF, Paton NW, Lilley KS, Binz PA, Julian RK, Jr., Jones AR, et al. The minimum information about a proteomics experiment (MIAPE). Nat Biotechnol. 2007;25(8):887-893. 1199 1200 1201 80. Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, et al. Minimum information about a microarray experiment (MIAME)—toward standards for microarray data. Nat Genet. 2001;29(4):365-371. 1202 1203 1204 81. Utomo E, Blok BFM, Steensma AB, Korfage IJ. Validation of the pelvic floor distress inventory (PFDI-20) and pelvic floor impact questionnaire (PFIQ-7) in a Dutch population. Int Urogynecol J. 2014;25(4):531-544. 1205 1206 82. Cohen S. Perceived stress scale. Mind Garden 1994. Available at: http://www.mindgarden.com/documents/PerceivedStressScale.pdf. Accessed 8/6/21. 1207 1208 83. PROMIS scoring manuals. HealthMeasures. Available at: http://www.healthmeasures.net/promis-scoring-manuals. Accessed 8/6/21. medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 64 1209 SUPPORTING INFORMATION 1210 See separate Supplemental Materials document. 1211 SUPPLEMENTAL MATERIAL TEXT (Section Headings) 1212 Simple example illustrating the benefits of scaling by controls. 1213 Contrast criterion (CC): special cases. 1214 Contrast criterion and proportion of ambiguous clustering: simulation example. 1215 Details on the targeted proteomics study of serum samples of women with LUTS versus controls. 1216 Comparison of the identified clusters W1-W5 with the conventional classification, subtypes, and clusters 1217 of women with LUTS identified in literature. 1218 Details on the methodological information provided in the clustering papers [31-33]. 1219 SUPPLEMENTAL TABLES (Captions) 1220 Supplemental Table S1. Overview of the variables used for clustering of 545 women with lower urinary 1221 tract symptoms (LUTS) in comparison with the same variables for non-LUTS controls. 1222 Supplemental Table S2. Means, standard deviations, pairwise distances, and misclassification errors 1223 resulting from the use of three scaling approaches (unscaled – U, scaled as cohort’s z-scores – C, and 1224 scaled by healthy controls – N). 1225 Supplemental Table S3. Lists of differentially abundant proteins observed in the serum samples of 230 1226 women with LUTS versus controls. Proteins with false discovery rate (FDR) adjusted p-value<0.05 are 1227 bolded. 1228 Supplemental Table S3A. All 230 women with LUTS. medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 65 1229 Supplemental Table S3B. W1 cluster (n=37). 1230 Supplemental Table S3C. W2 cluster (n=38). 1231 Supplemental Table S3D. W3 cluster (n=53). 1232 Supplemental Table S3E. W4 cluster (n=42). 1233 Supplemental Table S3F. W5 cluster (n=60). 1234 SUPPLEMENTAL FIGURES (Captions) 1235 Supplemental Figure S1. Simulated example 1, illustrating distributions of two variables describing 1236 patients in two disease subtypes and healthy controls. Upper row – X1, lower row X2. Left column 1237 unscaled, center z-scored, right scaled by controls. 1238 Supplemental Figure S2. Simulated example A. Non-overlapping transcript signatures. 40,000 1239 transcripts. Five groups of people. Means, correlation matrix, and single instance of simulated 1240 distributions of transcript abundances. 1241 Supplemental Figure S3. Simulated example A. Consensus matrices, K=2,3…8. CC vs. PAC. Effect size=2. 1242 Supplemental Figure S4. Simulated example A. Consensus matrices, K=2,3…8. CC vs. PAC. Effect 1243 size=0.6. 1244 Supplemental Figure S5. Simulated example A. Misclassification error vs. effect size. Comparison of 1245 consensus clustering using contrast criterion with k-means and hierarchical clustering using Calinski- 1246 Harabasz criterion. Three values of correlation coefficient for transcript abundances within transcript 1247 signatures. medRxiv preprint doi: https://doi.org/10.1101/2021.09.17.21263124; this version posted September 22, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. 66 1248 Supplemental Figure S6. Simulated example A1. Same conditions as in simulated examples A, except 1249 transcript abundances being correlated following 𝑹𝑹𝒊𝒊𝒊𝒊 = 𝒓𝒓| 𝒊𝒊−𝒊𝒊| . 1250 Supplemental Figure S7. Simulated example B. Overlapping transcript signatures. Consensus matrices, 1251 K=2,3…8. CC vs. PAC. Effect size=2. 1252 Supplemental Figure S8. Simulated example B. Overlapping transcript signatures. Consensus matrices, 1253 K=2,3…8. CC vs. PAC. Effect size=0.8. 1254 Supplemental Figure S9. Simulated example B. Misclassification error vs. effect size. Comparison of 1255 consensus clustering using contrast criterion with k-means and hierarchical clustering using Calinski- 1256 Harabasz criterion. Three values of correlation coefficient for transcript abundances within transcript 1257 signatures. 1258 Supplemental Figure S10. Simulated example B1. Same conditions as in simulated examples B, except 1259 transcript abundances being correlated following 𝑹𝑹𝒊𝒊𝒊𝒊 = 𝒓𝒓| 𝒊𝒊−𝒊𝒊| . 1260