Leveraging unstructured construction injury reports to predict safety outcomes and model safety risk using natural language processing, machine learning, and probability theory

Tixier, A J-P (2015) Leveraging unstructured construction injury reports to predict safety outcomes and model safety risk using natural language processing, machine learning, and probability theory. Unpublished PhD thesis, University of Colorado at Boulder, USA.

Abstract

Construction is one of the most dangerous industries in the United States and throughout the globe. Despite the abundant research that has been motivated by the very high socio-economic costs induced by accidents, safety performance plateaus and injuries still occur at an unacceptable, disproportionate rate. The paradox is that at the same time, huge databases of valuable textual injury reports are left mostly unused, because of the lack of a conceptual framework to readily extract usable knowledge from them and because manual content analysis is very expensive. Not only do these vast amounts of candid narratives represent a wealth of valuable lessons to be learned, but they could also transform the way safety is approached in construction. From mostly being dealt with through the analysis of subjective, aggregated, or secondary data; expert-opinion; and according to a strictly regulatory and managerial perspective, construction safety could become an empirical, data-driven science, where objective, quantitative techniques such as Machine Learning and statistical modeling could play a determinant role. To provide a proof for this concept, we (1) developed a Natural Language Processing tool to automatically extract fundamental attributes and outcomes from unstructured textual injury reports and remove the needs for manual content analysis; (2) explored the interplay and detected clashes between attributes using unsupervised clustering and network analysis techniques; (3) applied supervised Machine Learning algorithms to capture the mapping between attribute and outcome data and predict various safety outcomes; and (4) proposed a new way to model and simulate construction safety risk using probability theory tools such as Kernel Density Estimates and Copulas. At every level, results are promising and show that by following the aforementioned pipeline, it is possible to better understand and predict injuries, simply from raw textual data. We hope this research shows that adopting a data-driven approach could lead to better-informed, safer decision-making, and improve safety performance in construction.

Item Type: Thesis (Doctoral)
Thesis advisor: Hallowell, M R
Uncontrolled Keywords: construction safety; injury; pipeline; learning; safety; United States; content analysis; network analysis; probability; machine learning
Date Deposited: 16 Apr 2025 19:32
Last Modified: 16 Apr 2025 19:32