Regularized linear and gradient boosted ensemble methods to predict athletes' gender based on a survey of masters athletes

Joe Walsh, Ian Timothy Heazlewood, Mike Climstein

Research output: Contribution to journalArticleResearchpeer-review

Abstract

The aim of this research was to investigate statistical learning techniques to predict gender based upon psychological constructs (measuring motivations to participate in masters sports). Motivations of marathoners scales (MOMS) sports psychological data for 3928 masters athletes (2,010 males) from the World Masters Games (the largest sporting event in the world by participant numbers) was investigated using the regularized linear modelling methodologies ridge regression, the lasso and the elastic net. Comparison was made between previously published research utilizing logistic regression, discriminate function analysis, radial basis functions, multilayer perceptrons and a selection of boosted decision tree based models. It was hypothesized that the regularized linear models would perform better than other models except the boosted decision trees, however ensembles of the regularized linear models and gradient boosted machines would result in improved accuracy over any prior models in the literature. Implementing modern regression methods with regularization provided improvements in classification accuracy based on gender compared to non-regularized linear models, however not boosted trees such as a gradient boosted machine (GBM). Models that were solely or partially based on L2 regularization (including a penalty term to reduce the sum of squares of the parameters) performed better than those than relied solely or primarily on L1 regularization (including a penalty term to reduce the sum of the absolute values of the parameters) or subset selection. This finding had implications for analysis of MOMS data in general with respect to using the 56 questions in MOMS as opposed to the underlying nine constructs for analysis in order to compensate for multicollinearity. Ensemble methods stacking ridge regression and GBMs with out-of-sample prediction further improved accuracy, giving higher accuracy scores (0.7236) than obtained in any preceding literature. This demonstrates the potential benefits from such an ensemble approach in terms of developing models with improved accuracy, as well as increasing the likelihood of developing practical applications from predictions using MOMS psychometric data.

Original languageEnglish
Pages (from-to)47-64
Number of pages18
JournalModel Assisted Statistics and Applications
Volume14
Issue number1
DOIs
Publication statusPublished - 24 Jan 2019

Fingerprint

Ensemble Methods
Gradient Method
Predict
Regularization
Ridge Regression
Decision tree
Penalty
Linear Model
Ensemble
Elastic Net
Gradient
Multicollinearity
Decision trees
Sports
Subset Selection
Statistical Learning
Lasso
Model
Psychometrics
Parameter Selection

Cite this

@article{fda2d87bd3ff43e8b0dc0d1a85d984d4,
title = "Regularized linear and gradient boosted ensemble methods to predict athletes' gender based on a survey of masters athletes",
abstract = "The aim of this research was to investigate statistical learning techniques to predict gender based upon psychological constructs (measuring motivations to participate in masters sports). Motivations of marathoners scales (MOMS) sports psychological data for 3928 masters athletes (2,010 males) from the World Masters Games (the largest sporting event in the world by participant numbers) was investigated using the regularized linear modelling methodologies ridge regression, the lasso and the elastic net. Comparison was made between previously published research utilizing logistic regression, discriminate function analysis, radial basis functions, multilayer perceptrons and a selection of boosted decision tree based models. It was hypothesized that the regularized linear models would perform better than other models except the boosted decision trees, however ensembles of the regularized linear models and gradient boosted machines would result in improved accuracy over any prior models in the literature. Implementing modern regression methods with regularization provided improvements in classification accuracy based on gender compared to non-regularized linear models, however not boosted trees such as a gradient boosted machine (GBM). Models that were solely or partially based on L2 regularization (including a penalty term to reduce the sum of squares of the parameters) performed better than those than relied solely or primarily on L1 regularization (including a penalty term to reduce the sum of the absolute values of the parameters) or subset selection. This finding had implications for analysis of MOMS data in general with respect to using the 56 questions in MOMS as opposed to the underlying nine constructs for analysis in order to compensate for multicollinearity. Ensemble methods stacking ridge regression and GBMs with out-of-sample prediction further improved accuracy, giving higher accuracy scores (0.7236) than obtained in any preceding literature. This demonstrates the potential benefits from such an ensemble approach in terms of developing models with improved accuracy, as well as increasing the likelihood of developing practical applications from predictions using MOMS psychometric data.",
keywords = "ensemble modelling, linear modelling, masters sport, Sport psychology",
author = "Joe Walsh and Heazlewood, {Ian Timothy} and Mike Climstein",
year = "2019",
month = "1",
day = "24",
doi = "10.3233/MAS-180454",
language = "English",
volume = "14",
pages = "47--64",
journal = "Model Assisted Statistics and Applications: an international journal",
issn = "1574-1699",
publisher = "IOS Press",
number = "1",

}

Regularized linear and gradient boosted ensemble methods to predict athletes' gender based on a survey of masters athletes. / Walsh, Joe; Heazlewood, Ian Timothy; Climstein, Mike.

In: Model Assisted Statistics and Applications, Vol. 14, No. 1, 24.01.2019, p. 47-64.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - Regularized linear and gradient boosted ensemble methods to predict athletes' gender based on a survey of masters athletes

AU - Walsh, Joe

AU - Heazlewood, Ian Timothy

AU - Climstein, Mike

PY - 2019/1/24

Y1 - 2019/1/24

N2 - The aim of this research was to investigate statistical learning techniques to predict gender based upon psychological constructs (measuring motivations to participate in masters sports). Motivations of marathoners scales (MOMS) sports psychological data for 3928 masters athletes (2,010 males) from the World Masters Games (the largest sporting event in the world by participant numbers) was investigated using the regularized linear modelling methodologies ridge regression, the lasso and the elastic net. Comparison was made between previously published research utilizing logistic regression, discriminate function analysis, radial basis functions, multilayer perceptrons and a selection of boosted decision tree based models. It was hypothesized that the regularized linear models would perform better than other models except the boosted decision trees, however ensembles of the regularized linear models and gradient boosted machines would result in improved accuracy over any prior models in the literature. Implementing modern regression methods with regularization provided improvements in classification accuracy based on gender compared to non-regularized linear models, however not boosted trees such as a gradient boosted machine (GBM). Models that were solely or partially based on L2 regularization (including a penalty term to reduce the sum of squares of the parameters) performed better than those than relied solely or primarily on L1 regularization (including a penalty term to reduce the sum of the absolute values of the parameters) or subset selection. This finding had implications for analysis of MOMS data in general with respect to using the 56 questions in MOMS as opposed to the underlying nine constructs for analysis in order to compensate for multicollinearity. Ensemble methods stacking ridge regression and GBMs with out-of-sample prediction further improved accuracy, giving higher accuracy scores (0.7236) than obtained in any preceding literature. This demonstrates the potential benefits from such an ensemble approach in terms of developing models with improved accuracy, as well as increasing the likelihood of developing practical applications from predictions using MOMS psychometric data.

AB - The aim of this research was to investigate statistical learning techniques to predict gender based upon psychological constructs (measuring motivations to participate in masters sports). Motivations of marathoners scales (MOMS) sports psychological data for 3928 masters athletes (2,010 males) from the World Masters Games (the largest sporting event in the world by participant numbers) was investigated using the regularized linear modelling methodologies ridge regression, the lasso and the elastic net. Comparison was made between previously published research utilizing logistic regression, discriminate function analysis, radial basis functions, multilayer perceptrons and a selection of boosted decision tree based models. It was hypothesized that the regularized linear models would perform better than other models except the boosted decision trees, however ensembles of the regularized linear models and gradient boosted machines would result in improved accuracy over any prior models in the literature. Implementing modern regression methods with regularization provided improvements in classification accuracy based on gender compared to non-regularized linear models, however not boosted trees such as a gradient boosted machine (GBM). Models that were solely or partially based on L2 regularization (including a penalty term to reduce the sum of squares of the parameters) performed better than those than relied solely or primarily on L1 regularization (including a penalty term to reduce the sum of the absolute values of the parameters) or subset selection. This finding had implications for analysis of MOMS data in general with respect to using the 56 questions in MOMS as opposed to the underlying nine constructs for analysis in order to compensate for multicollinearity. Ensemble methods stacking ridge regression and GBMs with out-of-sample prediction further improved accuracy, giving higher accuracy scores (0.7236) than obtained in any preceding literature. This demonstrates the potential benefits from such an ensemble approach in terms of developing models with improved accuracy, as well as increasing the likelihood of developing practical applications from predictions using MOMS psychometric data.

KW - ensemble modelling

KW - linear modelling

KW - masters sport

KW - Sport psychology

UR - http://www.scopus.com/inward/record.url?scp=85065664676&partnerID=8YFLogxK

U2 - 10.3233/MAS-180454

DO - 10.3233/MAS-180454

M3 - Article

VL - 14

SP - 47

EP - 64

JO - Model Assisted Statistics and Applications: an international journal

JF - Model Assisted Statistics and Applications: an international journal

SN - 1574-1699

IS - 1

ER -