Coding and Machine-Learning Strategies for Disaggregated Racial/Ethnic Data
Coding and Machine-Learning Strategies for Disaggregated Racial/Ethnic Data
Published: 12/15/2020

Machine learning is a subset of artificial intelligence and computer science which uses algorithms and statistical models to analyze and draw inferences from patterns in data. This workshop covers the types of machine learning methods, including supervised learning and unsupervised learning; examples and use cases for survey research; as well as some of the limitations.

Presenter:
Scott Comulada, PhD, Director for UCLA Semel Institute Center for Community Health, and Associate Professor, UCLA Department of Psychiatry and Biobehavioral Sciences

About the National Network of Health Surveys' Advancing Health Equity Through Data Disaggregation Workshop Series
Disaggregated race/ethnicity data is needed to expose gaps in health equities and inform policies and programs and close those gaps. The National Network of Health Surveys, part of the UCLA Center for Health Policy Research, offers a series of workshops designed to improve the disaggregation of race and ethnicity measures in health data sources. Our goal is to boost the number of subpopulation categories made available to key constituencies working to improve health equity. This is especially important for representing communities that are often “hidden” in large health data sets.

Scott Comulada
Scott Comulada

Machine Learning Methods:

Contextual Example of Machine-Learning Methods (1) (5:44)
Contextual Example of Machine-Learning Methods (2) (7:30)

  • Examples of how disaggregated data can help studies identify interventions for those who really need them

Contextual Example of Machine-Learning Methods (3) (9:50)

  • Includes discussion of key statistical considerations when employing machine-learning methods

Contextual Example of Machine-Learning Methods (4) (12:41)
Contextual Example of Machine-Learning Methods (5) (14:15)

  • P-values and machine learning

Contextual Example of Machine-Learning Methods (6) (16:06)

  • Splitting data – training data vs. test data

Contextual Example of Machine-Learning Methods (7) (18:02)

  • Regression models: Lasso regression solution
    • Introduces a penalty parameter that shrink regression coefficients to zero that don’t adequately minimize error

Key Points of Machine Learning (20:39)   

  • Visualization of machine learning framework (22:35)

Definitions (25:36)

  • Machine learning: methods to automate model building (includes lasso, random forests, neural networks)
  • Also includes definitions of AI, computer vision, deep learning, and Natural Language Processing

Two Main Types of Machine Learning (27:35)

  • Supervised Learning (focus of this presentation)
  • Unsupervised Learning

Use Case #1 – Write-In Survey Responses (28:25)

  • Manual Coding vs. Automated Coding: Manual coding can be subjective, time inefficient, and error prone, while automated coding can address these shortcomings.
    • What can Machine Learning offer? (32:53)

Use Case #2 (39:51)

  • Discusses inverse probability sample weights, Generalized Boosted Modeling (GBM), and imbalances in the data

Programming Languages and Software (43:21)

  • Understanding what you need out of a software, and suggestions for next stepsincludes discussion of Python, R, Stata, and Julia
    • Options for simplifying ML approaches (46:55)
    • Data management (48:19)
    • Computing options (50:37)

Caveats (52:25)

  • Limitations of traditional statistical methods can carry over to ML. Looking at Google as an example (52:30)
  • Complexity of model can create issues diagnosing problems (55:20)
  • Missing data (55:52)
  • AI gone bad (1:00:00)