Cost-weighted tf-idf: A novel approach for measuring highway project similarity based on pay items' cost composition and term frequency

Do, Q; Moriyani, M A; Le, C and Le, T (2023) Cost-weighted tf-idf: A novel approach for measuring highway project similarity based on pay items' cost composition and term frequency. Journal of Construction Engineering and Management, 149(8), ISSN 0733-9364

Abstract

State highway agencies (SHAs) often need to cluster or bundle projects in accordance with their scope similarity for various construction management tasks, including historical data-driven time, cost estimating, and project bundling. Conventionally, SHAs categorize similar projects into work types based on subjective judgment about the similarity between major pay items. A few quantitative methods for project similarity determination are found in the literature, but they mostly use one single source of information, either the cost contribution of pay items or the keywords of pay items descriptions, for measuring project similarity. This paper presents the first attempt to integrate multiple information sources for project similarity measurement. This research proposes a novel cost-weighted term frequency-inverse document frequency (CW-TF-IDF) method that incorporates the cost information of pay items into the traditional TF-IDF word embedding method to measure project similarity. The effectiveness of the proposed method in supporting project clustering and bundling was tested using the historical bid data collected from an SHA. The findings showed that the CW-TF-IDF method significantly improves project clustering performance compared to the most recent state-of-the-art method. The CW-TF-IDF method also showed its outperformance in project bundling as it yielded a cosine similarity of over 0.9 for most of the bundled projects in the testing data. This proposed method is expected to help SHAs accurately identify similar projects and eventually improve their project management effectiveness.

Item Type: Article
Uncontrolled Keywords: natural language processing; project bundling; project clustering; project similarity; term frequency-inverse document frequency
Date Deposited: 11 Apr 2025 19:49
Last Modified: 11 Apr 2025 19:49