Regularized linear and gradient boosted ensemble methods to predict athletes' gender based on a survey of masters athletes

Joe Walsh, Ian Timothy Heazlewood, Mike Climstein

    Research output: Contribution to journalArticlepeer-review


    The aim of this research was to investigate statistical learning techniques to predict gender based upon psychological constructs (measuring motivations to participate in masters sports). Motivations of marathoners scales (MOMS) sports psychological data for 3928 masters athletes (2,010 males) from the World Masters Games (the largest sporting event in the world by participant numbers) was investigated using the regularized linear modelling methodologies ridge regression, the lasso and the elastic net. Comparison was made between previously published research utilizing logistic regression, discriminate function analysis, radial basis functions, multilayer perceptrons and a selection of boosted decision tree based models. It was hypothesized that the regularized linear models would perform better than other models except the boosted decision trees, however ensembles of the regularized linear models and gradient boosted machines would result in improved accuracy over any prior models in the literature. Implementing modern regression methods with regularization provided improvements in classification accuracy based on gender compared to non-regularized linear models, however not boosted trees such as a gradient boosted machine (GBM). Models that were solely or partially based on L2 regularization (including a penalty term to reduce the sum of squares of the parameters) performed better than those than relied solely or primarily on L1 regularization (including a penalty term to reduce the sum of the absolute values of the parameters) or subset selection. This finding had implications for analysis of MOMS data in general with respect to using the 56 questions in MOMS as opposed to the underlying nine constructs for analysis in order to compensate for multicollinearity. Ensemble methods stacking ridge regression and GBMs with out-of-sample prediction further improved accuracy, giving higher accuracy scores (0.7236) than obtained in any preceding literature. This demonstrates the potential benefits from such an ensemble approach in terms of developing models with improved accuracy, as well as increasing the likelihood of developing practical applications from predictions using MOMS psychometric data.

    Original languageEnglish
    Pages (from-to)47-64
    Number of pages18
    JournalModel Assisted Statistics and Applications
    Issue number1
    Publication statusPublished - 24 Jan 2019


    Dive into the research topics of 'Regularized linear and gradient boosted ensemble methods to predict athletes' gender based on a survey of masters athletes'. Together they form a unique fingerprint.

    Cite this