Learning Traffic Crashes as Language: Datasets, Benchmarks, and What-if Causal Analyses (2024)

Zhiwen Fan^1†,Pu Wang^2,3†,Yang Zhao³,Yibo Zhao³,Boris Ivanovic⁴,
Zhangyang Wang¹,Marco Pavone^4,5,Hao Frank Yang^3∗
$\dagger$ Equal contribution $*$ Corresponding author (haofrankyang@jhu.edu)
¹University of Texas at Austin ²New York University
³Johns Hopkins University ⁴NVIDIA Research ⁵Stanford University

Abstract

The increasing rate of road accidents worldwide results not only in significant loss of life but also imposes billions financial burdens on societies. Current research in traffic crash frequency modeling and analysis has predominantly approached the problem as classification tasks, focusing mainly on learning-based classification or ensemble learning methods. These approaches often overlook the intricate relationships among the complex infrastructure, environmental, human and contextual factors related to traffic crashes and risky situations. In contrast, we initially propose a large-scale traffic crash language dataset, named CrashEvent, summarizing 19,340 real-world crash reports and incorporating infrastructure data, environmental and traffic textual and visual information in Washington State. Leveraging this rich dataset, we further formulate the crash event feature learning as a novel text reasoning problem and further fine-tune various large language models (LLMs) to predict detailed accident outcomes, such as crash types, severity and number of injuries, based on contextual and environmental factors. The proposed model, CrashLLM, distinguishes itself from existing solutions by leveraging the inherent text reasoning capabilities of LLMs to parse and learn from complex, unstructured data, thereby enabling a more nuanced analysis of contributing factors. Our experiments results shows that our LLM-based approach not only predicts the severity of accidents but also classifies different types of accidents and predicts injury outcomes, all with averaged F1 score boosted from 34.9% to 53.8%. Furthermore, CrashLLM can provide valuable insights for numerous open-world what-if situational-awareness traffic safety analyses with learned reasoning features, which existing models cannot offer. We make our benchmark, datasets, and model public available for further exploration.

1 Introduction

Learning Traffic Crashes as Language: Datasets, Benchmarks, and What-if Causal Analyses (1)

Road traffic crashes constitute a global public health crisis, resulting in substantial mortality, morbidity, and economic costs. In 2021, a total of 6,103,213 cases were reported in the United States. According to the National Highway Traffic Safety Administration (NHTSA) Fact 2022, on average, one crash occurs every five minutes, and 40.92% of them result in injuries and long-term disabilities. Tragically, 39,785 of these crashes were fatal, resulting in one life lost every 11 minutes ¹¹1https://crashstats.nhtsa.dot.gov/Api/Public/ViewPublication/813560. In 2019, the total comprehensive loss for the USA was $1.365 trillion, amounting to $4,117 per citizen annually [1]. These startling statistics underscore the urgent need for the research community to develop and implement effective interventions to save lives and reduce the economic burden. Typically, after a crash occurs, the details are summarized in a crash report by traffic agencies, utilizing figures, text and numerical formats to reconstruct the process. However, the causal factors of crashes are multifaceted, heterogeneous, and interconnected, encompassing a complex interplay of infrastructure design, human factors, environmental conditions, alcohol or drug use, vehicle-related factors, and other variables[2]. This inherent complexity presents a longstanding challenge in analyzing these multimodal data and localizing the casual factors to learn from these tragedies.

Currently, existing researchers always use machine learning approaches[3] to formulate traffic accident analysis as classification tasks[4, 5, 6], summarizing and predicting crashes using a fixed number of features derived from heterogeneous crash reports. While these methods have provided valuable outputs, they typically oversimplify inputs into numerical categories and cannot offer accurate insights into event-level details. The process of discretizing text descriptions into handcrafted features (e.g., one-hot vectors, categorical levels) often fails to capture the complex inter- and intra-correlations among the diverse human, vehicle, behavior, regulation, environmental, and contextual factors present in textual crash records. Therefore, there is an obvious call for new approaches capable of learning from complex, unstructured crash text records to enable more accurate, reliable and useful prediction and reasoning analysis of crash contributing factors, thereby showing possibilities to improve the traffic safety effectively.

Recently, large language models (LLMs)[7, 8], pretrained on extensive natural language data, have shown exceptional proficiency in contextual text reasoning capabilities with language information. The ability of LLMs to understand and generate human-like text suggests their potential for comprehending the complex and unstructured data found in crash reports. This understanding can facilitate the case level of analysis, what-if situational comparison, aiding in the identification of the hidden causes of accidents. However, to accurately interpret specific crash records, LLMs require fine-tuning due to the unique and nuanced nature of traffic data, which often includes heterogeneous information not adequately addressed by models trained on generic data.

To fill this gap, we introduce CrashEvent dataset and CrashLLM, the first event level traffic crash benchmark to LLMs, and investigates the LLMs’ ability to forecast traffic crashes by reasoning textualized heterogeneous crash records.CrashEvent comprises crash data from the whole Washington state of 2022 and defines three critical crash prediction tasks as text reasoning problem. Each crash instance in CrashEvent consists thorough descriptions in multiple aspects (see Figure1), allowing the integration of original and rich crash reports without the loss of original textual information.CrashEvent is curated with human-machine cooperative approach contains two-phase process: 1) Crash Textual Recategorization and Organization, involving human experts to formulate structured and meaningful paragraph-wise forms, including general information, infrastructure, even information and crash unit information.2) Machine-guided Crash Report Generation. In this phase, we ask ChatGPT to transform and fill the original unstructured text into crash template.Our two-phase procedure ensures the data generation process is highly efficient and requires minimal human labelling efforts, to generate rich and lossless crash contexts.Then, we conducted experiments using both standard efficient fine-tuning with LoRA[9] of LLM and representation ML methods to perform event-level crash predictions. Our finding indicate that by formulating the traffic crash prediction as text reasoning problem, CrashLLM can significantly outperform all traditional methods to leverage the uncompressed data format.For proactively mitigate risks and enhance overall traffic safety, we perform conditional what-if analysis by visualizing the distribution shift by synthesizing the test data cases with perturbing specific attributes. The analysis allow the inditification of improving safety predictions and equip first responders with timely, event-specific insights, ultimately reducing the chances of crashes happening. We summarize the contributions as follows:

1.
We introduce the CrashEvent dataset, comprising 19,340 crash records from 242 cities and 1,973 road segments in Washington State during 2022, totaling approximately 6.32 million words. Each crash event record includes 50 attributes describing the infrastructure, event, environment, and textual descriptions of the vehicles and pedestrians involved.
2.
We introduce CrashLLM, the first event-level traffic crash prediction framework. We demonstrate the effectiveness of fine-tuning CrashLLM for traffic crash prediction tasks by formulating crash prediction as text-based reasoning analysis.
See Also
Traffic engineers build roads that invite crashes because they rely on outdated research and faulty data
3.
Experimental results show that CrashLLM achieves an average F1 score 18.89% higher on three tasks (injury, severity, and accident type prediction) compared to existing machine learning models. We further conduct a what-if situational-aware analysis by synthesizing test data to explore hypothetical scenarios and assess their potential impact on traffic safety outcomes.

2 Related Works

Existing Approaches to Learning Traffic Crash Forecasting

Multiple traditional methods have been employed to analyze traffic, focusing primarily on predicting injury severity using machine learning techniques. Some studies frame the problem as binary classification [4, 5, 6]. For instance, Assi (2020) developed a hybrid system using Principal Component Analysis (PCA) with Multi-Layer Perceptron (MLP) and Support Vector Machine (SVM) to predict traffic crash severity, distinguishing between slight injury and serious/fatal [10]. Other studies consider it as multiclass classification for different severity levels [11, 12, 13, 14, 15]. For example, Satter et al. (2023) predicted injury levels using vanilla MLP with embedding layers, and TabNet [16]. Most studies utilize text-based traffic crash reports [17], but machine learning methods require transforming text into numerical representations, which may cause information loss and impact prediction accuracy. Additionally, these models only learn the distribution of the training data and lack true understanding of underlying causality, limiting their ability to generalize and predict accurately when distributions change.

Related Causality and Factors Influencing Traffic Crashes

In some of these studies, researchers have delved further into investigating the relationship between risk factors and crash severity. Various machine learning methods have been widely used to solve the crash injury severity classification problem and analyze the contributing factors related to injury severity. These methods include Support Vector Machine (SVM) [18, 19], Logistic Regression (LR) [20, 21], AdaBoost [20], Bayesian network [22, 13, 23, 24], K-means Clustering and Latent Class Clustering [25, 26], and SHapley Additive exPlanations (SHAP)[27].However, the real-world crash data distribution is significantly imbalanced with incomplete and heterogeneous records, which can introduce bias and inaccuracies in the model’s learning process.

Human-Like Reasoning in Large Language Models

Recent advancements in large language models (LLMs), such as GPT-4 [7] and LlaMA 2 [8], have expanded the scope of artificial intelligence from traditional predictive analytics to emulating complex human-like interactions in various systems[28]. One notable feature within LLMs is in-context learning (ICL), where the model performs tasks based on input-output examples without parameter adjustments. Additionally, knowledge extraction from locally deployable LLMs through fine-tuning, particularly using Parameter-Efficient Fine-Tuning (PEFT) techniques [9, 29], offers valuable insights. We will adopt the PEFT approach [9] for fine-tuning CrashLLM.

3 Transforming Non-Numeric Crash Events into Textual Format

Learning Traffic Crashes as Language: Datasets, Benchmarks, and What-if Causal Analyses (2)

Describing an accident involves complex and diverse information, including the environment of the accident location, details of the vehicles involved, and a thorough description of the accident process. Providing such detailed information is essential for understanding the accident and analyzing its causes[30]. To obtain a comprehensive accident dataset, we gathered data from a variety of sources and formats, then reorganized and processed it into a unified dataset for each accident. Finally, we categorized the dataset into four types of information, which is general information, infrastructure information, event information and unit information, and convert them into approximately 300-word text descriptions to serve as the final input dataset. Our datasets with license description can be accessed at https://crashllm.github.io/.

Introduction to Raw Data Sources and Types

Our dataset includes crash data from Washington State in 2022, in total of 19,340 records after filtering significantly incomplete data. The primary sources of our dataset include Highway Safety Information System (HSIS) crash data²²2https://highways.dot.gov/, satellite image data, police accident reports, crashworthiness data and state driver licensing data. The HSIS data consists of two components: one describes the physical layout of roads and the associated traffic characteristics within Washington State, and the other consisting of crash reports that provide a general description of accidents. Satellite images are obtained based on the location infrastructure and planning information, i.e., intersection alignments, neighborhood types, in the crash reports. By processing these satellite images, we can obtain more detailed descriptions of the road infrastructure. Police accident reports contain detailed descriptions of accidents, with their data elements standardized into a common format. Crashworthiness data includes information collected from crash sites and data related to the vehicles involved in the accidents. State driver licensing data includes basic information about the individuals involved in accidents, such as age and gender. Figure2 uses an example of crash data to illustrate the process of generating a prompt from the raw data.

Feature Engineering and Textual Organization of Crash Data

For each accident, we associated the crash report with the involved vehicles and individuals using the crash report number, thus obtaining descriptions of the accident and the persons involved. The route ID and milepost were used to identify the specific road segment where the accident occurred, allowing us to gather related roadway data in existing database. Additionally, to supplement the infrastructure and environmental information, we obtained satellite images based on GPS coordinates and used VLLM[7] to supplement environmental information.Due to the diverse sources of data, we performed feature cleaning by deleting or merging duplicate features. Given the substantial amount of textual descriptions in the data, we employed an AI-human collaborative approach for dimensionality reduction on certain features. This process generalized the data and reduced redundancy. Based on the five W’s (where, when, what, who, why) of crash reports [31], we categorized the data into four types of information (general information, infrastructure information, event information and unit information). Finally, for providing logically coherent and continuous textual data which is amenable to LLM learning, we transformed each category of data into text format using an AI-human cooperative prompt design. The mentioned work provides a comprehensive textual database that describes the accident environment, process, and entities involved, forming a solid foundation for LLM learning on accident data. Finally, after filtering out datapoints with significant missing information, the CrashEvent dataset mergers the complementary information from multimodal data sources and contains 19,800 crash records with approximately 6.32 million words for further use.

Variable	No.	Values	Abbr.	Definitions
Severity(S)	1	No Apparent Injury	O	No visible injuries reported at the scene.
	2	Possible Injury	C	Any injury reported to the officer or claimed by the individual.
	3	Minor Injury	B	Any injury other than fatal or disabling at the scene.
	4	Serious Injury	A	Any injury that prevents an individual from walking, driving, or continuing their normal activities.
	5	Fatal	K	Any injury that directly results in the death of a living person within 30 days of a motor vehicle crash.
AccidentType (AT)	1	Single Vehicle With Object	SVO	Collision involving a single vehicle and a stationary object.
	2	Angle Impacts Right	AIR	Vehicles collide at an angle, impacting on the right side.
	3	Other	Oth	Any other types of accidents not classified in specific categories.
	4	Sidewipes Left	SL	Vehicles sideswipe each other on the left side.
	5	Front End Collision	FEC	Collisions where the front ends of vehicles impact each other.
	6	Rear End Collision	REC	Collisions where one vehicle impacts the rear of another.
	7	Overturn	OT	Accidents where a vehicle overturns.
	8	Animal Collision	AC	Collisions involving animals.
	9	Pedestrian Collision	PC	Accidents where a vehicle collides with a pedestrian.
	10	Sidewipes Right	SR	Vehicles sideswipe each other on the right side.
	11	Pedal Cyclist Collision	PCC	Collisions involving cyclists.
	12	Head On Collision	HOC	Head-on collisions between vehicles.
	13	Off Road	OR	Accidents involving vehicles going off the road.
	14	Angle Impact Left	AIL	Vehicles collide at an angle, impacting on the left side.

Defining Inputs and Outputs

In the context of a traffic accident, the outcome and severity are of primary concern. Numerous studies focus on predicting and analyzing the types of accidents, their severity, or the number of injuries involved [32, 33, 34]. To effectively measure these aspects, we selected three variables from the accident reports to describe the outcome of the accident:the number of people injured $(\mathcal{I}_{t})$ , which is more balanced compared to using the raw number of injuries, the severity of the accident on the KABCO scale³³3https://highways.dot.gov/media/20141 $(\text{S})$ , which is commonly utilized in police-recorded accident data [35] and the accident type $(\text{AT})$ . We utilize these three variables to describe the crash result $(\text{CR}_{i})$ , where $i$ denotes the unique identifier caseid. The accident outcome can be presented in the following format: $\text{CR}_{i}=\text{AT}_{S}^{\mathcal{I}_{t}}$ . The function $\mathcal{I}_{t}$ describes the number of people injured in an accident as follows: zero if t $=0$ , one if t $=1$ , two if t $=2$ , and more than two if t $\geq 3$ , where t represents the number of people injured. The values for S and AT are provide in the Table1.

For the model’s input, four segments of textual information are contained, as shown in Figure2. Each paragraph consists of approximately 100 words. Together, they provide a comprehensive and detailed description of the accident. The content of each paragraph is outlined below: 1) General Information: this includes specifics about the time and location of the accident, as well as the type of road where it occurred. 2) Infrastructure Information: this covers descriptions of the road infrastructure, including both static elements like the number of lanes and speed limits, and dynamic features such as work zone indicators, lighting, and road surface conditions at the time of the accident. 3) Event Information: this segment provides a detailed account of the accident process and the contributing factors identified in the records. 4) Unit Information: this involves details about the vehicles and individuals involved in the crash.

4 Adapting Language Models for Text-Based Crash Reasoning Analysis

We adapt LLaMa-2[8] to crash prediction tasks to enhance the LLMs’ capabilities in interpreting crash data, identifying critical factors, and conducting causality analysis to offer insights for crash prevention.

Construct Training Data for LLMs

In the training of large language models (LLMs), a single input consists of three components: the system prompt, the user prompt, and the target prompt. Details regarding the system and user prompts are presented in Section3. The target prompt is formulated using the template: "The answer is: <PREDICTION>", where <PREDICTION> represents the ground truth. These components are structured as follows: "System: <system prompt>, User: <user prompt>, Assistant: <target prompt>". We use LLaMA-2’s tokenizer to segment the text inputs into tokens.

Additional Special Tokens for Classification

To adapt the LLM as a crash classifier, additional tokens have been incorporated into the tokenizer’s vocabulary. Specifically, for predicting the total number of injuries, four special tokens have been introduced: [<ZERO>, <ONE>, <TWO>, <THREE OR MORE>], corresponding to zero, one, two, and three or more injured individuals, respectively, in the crash event. This approach has also been applied to the predictions of severity and accident type (see supplementary materials for details). The parameters of the input and output embedding layers are set as trainable, enabling the model to align the representations of these special tokens with the existing embedding space.

Supervised Finetuning

During the fine-tuning phase, the traffic forecasting task is framed as a next-token generation task. This process can be described as:

p_{\theta}(T_{i})=\prod_{j=1}^{|T_{i}|}{p_{\theta}(t_{j}^{(i)}|t_{1}^{(i)},%\cdots,t_{j-1}^{(i)}}),\vspace{-1mm}

(1)

where $T_{i}$ is the $i$ -th item in the training data, $p_{\theta}$ is the LLM model, $t_{j}^{(i)}$ denotes the $j$ -th token in $T_{i}$ . By maximizing the likelihood $p_{\theta}(T)=\prod_{i=1}^{N}{p_{\theta}(T_{i})}$ , the LLM’s parameters are learned. Both the system prompt and the user prompt are masked for loss computation during training. Through this process, the model learns to make prediction for a traffic crash accident. In our experiments, we utilize LoRA[9] to fine-tune LLaMA-2 models. All the models are loaded in 4-bit. We use AdamW[36] as optimizer and train the models on Nvidia A100 80GB GPU with DeepSeed⁴⁴4https://www.deepspeed.ai/.

5 Experiments

We conduct experiments to study the effectiveness of the trained model, we consider two research evaluation settings: (1) what is the performance of CrashLLM as a traffic crash predictor? (2) How was the conditional analysis ability to see how LLMs have mastered conditional and modal reasoning?

CrashEvent Dataset Split and Evaluation Metrics

We split the collected data into two parts: data from January, June, and December are used as the testing set (842 data points), with additional resampling based on the total number of injuries to create a uniformly distributed evaluation subset. The remaining data (15,014 data points) are used as the training set. All experiments were configured to predict tasks including Total Injuries, Crash Severities, and Accident Types.In evaluating the model performance as a classification task, we employ accuracy, precision, recall, and F1 score as metrics.

Adopted Baselines

We follow the recent literature[3] and also adopt Random forest (RF)[37], Decision Trees (DT)[38], Adaptive boosting (AdaBoost)[39], Bayesian Network(BN)[40], LogisticRegression (LR)[41], and Categorical boosting (CatBoost)[42] as compared baselines.

Model	Evaluation Metric (Model Rank)				Rank
	Accuracy $\uparrow$	Precision $\uparrow$	Recall $\uparrow$	F1-score $\uparrow$
	Injury/Severity/Type	Injury/Severity/Type	Injury/Severity/Type	Injury/Severity/Type
RandomForest[43]	0.353 / 0.339 / 0.384	0.124 / 0.115 / 0.543	0.353 / 0.339 / 0.384	0.184 / 0.171 / 0.395	6.58 (9)
AdaBoost[39]	0.353 / 0.339 / 0.579	0.124 / 0.115 / 0.383	0.353 / 0.339 / 0.579	0.184 / 0.171 / 0.447	6.33 (8)
CatBoost[42]	0.353 / 0.339 / 0.702	0.124 / 0.115 / 0.664	0.353 / 0.339 / 0.702	0.184 / 0.171 / 0.667	5.08 (6)
Bayesian Network[44]	0.394 / 0.341 / 0.653	0.485 / 0.306 / 0.563	0.394 / 0.341 / 0.653	0.287 / 0.181 / 0.578	4.67 (4)
DecisionTree[38]	0.353 / 0.347 / 0.677	0.124 / 0.207 / 0.631	0.353 / 0.347 / 0.677	0.184 / 0.190 / 0.640	4.75 (5)
LogisticRegression[41]	0.353 / 0.339 / 0.566	0.124 / 0.115 / 0.471	0.353 / 0.339 / 0.566	0.184 / 0.171 / 0.457	6.25 (7)
Llama-7B	0.399 / 0.382 / 0.740	0.404 / 0.411 / 0.771	0.399 / 0.382 / 0.740	0.401 / 0.379 / 0.744	2.92 (3)
Llama-13B	0.439 / 0.393 / 0.748	0.431 / 0.375 / 0.767	0.439 / 0.393 / 0.748	0.427 / 0.353 / 0.755	2.08 (2)
Llama-70B	0.447 / 0.436 / 0.747	0.451 / 0.446 / 0.775	0.447 / 0.436 / 0.747	0.445 / 0.411 / 0.757	1.25 (1)

Learning Traffic Crashes as Language: Datasets, Benchmarks, and What-if Causal Analyses (3)

Comparisons with SoTA on Crash Prediction.

The quantitative comparisons between CrashLLM and other established machine learning models are shown in Table2. In this table, "inj" represents the task of predicting the total number of people injured, "sev" represents the task of predicting the severity, and "acc" represents the accident type task. Figure3 presents the confusion matrix results for the baseline models. Here, we formulate traffic crash prediction as a text classification problem, categorizing crashes’ injuries into four categories, severity into five categories, and crash type into fourteen categories. A comprehensive examination of existing crash frequency and prediction models shows that none reliably provide useful outcomes at the event level. Most of these models prioritize fitting distributions dominated by head categories, rather than learning distinct crash features. While some results, such as the BN on injury prediction, appear promising, the confusion matrix demonstrates its tendency to predict only one head categories. Promisingly, we observe that CrashLLM outperforms all other baselines in the averaged metrics, with the 70B model performing the best on average for all adopted tasks. This validates CrashLLM’s robust capability in following instructions to predict traffic crash properties. A reliable forecasting model that performs accurate reasoning based on crash records is crucial to prevent significant prediction errors. To further evaluate the reliability of all adopted models, we visualize the confusion matrices in Figure3. We can observe that traditional classification ML models tend to predict the dominant categories (e.g., zero injury under Total Injury, no apparent injury Crash Severity), which are mostly in the first column. In contrast, CrashLLM utilizes its text reasoning capacity to predict traffic crashes by leveraging complex, heterogeneous text data. This suggests that CrashLLM can offer valuable insights for making informed operational decisions through detailed model analysis.

What-if Situational Analysis for Informed Transportation.

Transportation is a complex system that directly interacts with human beings. For traffic agencies, changing safety policies and conducting analyses typically require considering many scenarios. It is impossible to collect data for and encode all of these scenarios into traditional machine learning models, as discussed in previous sections. For example, many DoTs previously tried to investigate how an increase in people driving under the influence of drugs and alcohol would change the severity of traffic crashes [1, 45], but can not be done without more useful samples. Similar questions are raised by numerous agencies and policymakers, to name just a few: What if the weather conditions changed from sunny to snowy? How would this impact traffic safety? What if there are work zones on the road? How would this affect the types of crashes and the severity of injuries? What if the road conditions were icy? How could we estimate the impact on the distribution of traffic crashes? These kinds of "what-if" situational-awareness questions are frequently posed, but no model could answer them until the introduction of CrashLLMs. Compared to traditional classification-based ML models, one of the biggest advantages of LLMs is their human-like text reasoning capabilities. Given their extensive vocabulary and ability to infer real-world human logic, LLMs offer a unique advantage which can be used to explore hypothetical scenarios and assess their potential impact on traffic safety outcomes. And we prove the fine-tuned CrashLLM with this capabilities and can provide valuable conditional comparisons and be used as human decision references. In this study, we focused on three key factors known to impact traffic crashes [46, 47]: alcohol and drug use, adverse weather conditions (specifically icy roads), and the presence of work zones. To investigate their effects, we synthesized three scenarios in the test set by incrementally converting a portion of the original conditions (non-alcohol, dry roads, non-work zone) into their corresponding adverse conditions (alcohol, icy roads, work zone). By perturbing the original testing sets at the rates of 100% (double the impact cases), 200% (triple), and finally all cases, simulating increasing levels of these risk factors, compared with the original distribution.

Learning Traffic Crashes as Language: Datasets, Benchmarks, and What-if Causal Analyses (4)

The results, illustrated in Figure4, demonstrate a clear distribution shift in crash outcomes. Even without additional fine-tuning, our CrashLLM model effectively captures the influence of these factors on crash likelihood and severity. Among the three factors, work zone existence has the most significant impact on crash distributions. Doubling work-zone-related crash test cases triggers a 21% increase in crashes with 3 or more injuries, leads to a 42% higher percentage of serious crashes, and results in a 20% higher rate of right-side angle impacts due to necessary lane changes. Doubling the number of alcohol-involved crashes leads to a 10% increase in serious injury crashes and a 200% higher rate of fatal crashes. Alcohol use also triggers a 16% increase in crashes involving pedestrians and cyclists, and a 7% increase in rollovers. Icy road conditions have the most substantial impact on changes in serious injury crashes. Doubling these cases leads to a 27% increase in crashes causing serious injury, with a 30% higher rate of rollovers and a 12% increase in angle impacts. More what-if analysis and casual findings can be found in the supplementary materials. These findings and what-if situational comparisons highlight the model’s ability to leverage existing knowledge and generate insightful predictions even when faced with data limitations.

6 Conclusion

In this study, we introduced a new traffic crash prediction dataset, CrashEvent, by textulizing heterogeneous crash records into a language-based representation and developed a traffic crash prediction framework, CrashLLM, that leverages the advanced capabilities of large language models (LLMs) to analyze and predict traffic crash incidents. We demonstrated that CrashLLM outperforms established machine learning models across multiple tasks. By utilizing complex, heterogeneous data through text reasoning, it allows for a deeper understanding of the underlying factors contributing to traffic crashes. This capability is crucial for developing informed operational strategies and making data-driven decisions for enhancing city-level infrastructures.

7 Limitations, Future Works and Societal Impacts

Despite the successes of our framework in demonstrating promising performance in traffic crash prediction, CrashLLM requires separate training for each adopted task. A potential solution would be to incorporate new versions of LLMs with specifically designed prompts to denote different prediction tasks within a unified model.Our research enables more accurate event-level crash predictions. This technology is advantageous for crash forecasting and enhances overall traffic roadway safety.

Technical Appendices

This technical appendices provides more details which are not included in the main paper due to space limitations.We have included few prompt examples,detailed description of added special tokens, explaination about what-if analysis, and the generation process of satellite images. Our organized CrashEvent datasets can be accessed through the crashllm.github.io.

Prompt Examples.

In our research, we utilize textualized prompts to facilitate model understanding across different tasks. We illustrate this with examples from three distinct tasks: traffic injury prediction, crash severity classification, and accident type estimation. We show five different prompts below, showcasing how we structure the input data into prompts that the model can process effectively.

Additional Special Tokens for Classification

As shown in the main draft, for predicting the Total Injuries, we have introduced four special tokens: [<ZERO>, <ONE>, <TWO>, <THREE OR MORE>], representing zero, one, two, and three or more injured individuals in a crash event, respectively. Similarly, for predicting the Crash Severities, we use five additional tokens: [<NO APPARENT INJURY>, <POSSIBLE INJURY>, <MINOR INJURY>, <SERIOUS INJURY>, <FATAL>], corresponding to different levels of severity. For the task of identifying Accident Types, we utilize 14 special tokens: [<SINGLE VEHICLE WITH OBJECT>, <ANGLE IMPACTS_RIGHT>, <OTHER>, <SIDESWIPES_LEFT>, <FRONT END COLLISIONS>, <REAR END COLLISIONS>, <OVERTURN>, <ANIMAL COLLISIONS>, <PEDESTRIAN COLLISIONS>, <SIDESWIPES_RIGHT>, <PEDALCYCLIST COLLISIONS>, <HEAD ON COLLISIONS>, <OFF ROAD>, <ANGLE IMPACTS_LEFT>], each representing a specific crash type.

The Generation of Satellite Images

HSIS provides the coordinates of crash locations in the Washington State Plane South coordinate system. This system uses the Washington coordinate system of 1983, South Zone, which is a Lambert conformal conic projection based on the GRS 80 spheroid. The standard parallels for this projection are located at north latitudes 45° 50’ and 47° 20’, where the scale is exact. The origin of this coordinate system is defined at the intersection of the meridian 120° 30’ west of Greenwich and the parallel 45° 20’ north latitude, with assigned coordinates: E = 500,000 meters and N = 0 meters⁵⁵5https://business.wsdot.wa.gov/.To obtain the satellite images, we convert these coordinates into GPS coordinates (latitude and longitude). We then use the Google Maps API to request satellite images with a resolution of 512 $\times$ 512 pixels and a zoom level of 19.

Learning Traffic Crashes as Language: Datasets, Benchmarks, and What-if Causal Analyses (5)

Explanation about What-if Analysis.

What-if analysis is a powerful technique used to understand the impact of changes in input variables on the output of a model. This method allows researchers and decision-makers to explore various scenarios by modifying input parameters and observing the subsequent changes in model predictions.In practice, what-if analysis involves altering specific features in a dataset to evaluate how these changes affect the model’s output. For instance, in a traffic crash prediction model, we might modify driver conditions to investigate how these factors influence the likelihood of an accident. This approach is instrumental in identifying key factors that significantly impact outcomes and helps in developing more robust and interpretable models.Specifically, in Figure 4 of the main draft, we analyze the effects of three factors: "driving under or without alcohol," "icy or dry road conditions," and "within or outside a work zone." We examine 842 test examples distributed across January, June, and December.

Consider the "driving under or without alcohol" scenario as an example. Among the 842 test cases, there are 63 crashes involving alcohol and 779 cases without alcohol involvement. To perform the what-if analysis for the alcohol variable, we randomly select an additional 63 cases from the 779 non-alcohol cases, creating a set of 126 cases for the analysis, labeled as "alcohol (+100%)." Additionally, we synthesize another 126 cases from 779 non-alcohol cases, formulating total 189 cases, and denote this set as "alcohol (+200%)." Finally, we transform all 779 non-alcohol cases into alcohol-involved cases and conduct the analysis, labeled as "alcohol (all)."Similarly, there are

Learning Traffic Crashes as Language: Datasets, Benchmarks, and What-if Causal Analyses (6)

Learning Traffic Crashes as Language: Datasets, Benchmarks, and What-if Causal Analyses (7)

Learning Traffic Crashes as Language: Datasets, Benchmarks, and What-if Causal Analyses (8)

Learning Traffic Crashes as Language: Datasets, Benchmarks, and What-if Causal Analyses (9)

Learning Traffic Crashes as Language: Datasets, Benchmarks, and What-if Causal Analyses (10)

References

[1]NaeY Won, AndrewJ McCabe, and LindaB Cottler.Alcohol-related non-fatal motor vehicle crash injury in the us from 2019 to 2022.The American Journal of Drug and Alcohol Abuse, 50(2):252–260, 2024.
[2]Jianyu Wang, Shuo Ma, Pengpeng Jiao, Lanxin Ji, XuSun, and Huapu Lu.Analyzing the risk factors of traffic accident severity using a combination of random forest and association rules.Applied Sciences, 13(14):8559, 2023.
[3]Shakil Ahmed, MdAkbar Hossain, SayanKumar Ray, MdMafijulIslam Bhuiyan, and SaifurRahman Sabuj.A study on road accident prediction and contributing factors using explainable machine learning models: Analysis and performance.Transportation research interdisciplinary perspectives, 19:100814, 2023.
[4]Anshuman Sharma, Zuduo Zheng, Jiwon Kim, Ashish Bhaskar, and MdMazharul Haque.Is an informed driver a better decision maker? a grouped random parameters with heterogeneity-in-means approach to investigate the impact of the connected environment on driving behaviour in safety-critical situations.Analytic Methods in Accident Research, 27:100127, 2020.
[5]Chengcheng Xu, Zijian Ding, Chen Wang, and Zhibin Li.Statistical analysis of the patterns and characteristics of connected and autonomous vehicle involved crashes.Journal of safety research, 71:41–47, 2019.
[6]Zhengjing Ma, Gang Mei, and Salvatore Cuomo.An analytic framework using deep learning for prediction of traffic accident injury severity based on contributing factors.Accident Analysis & Prevention, 160:106322, 2021.
[7]Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, FlorenciaLeoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, etal.Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023.
[8]Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, etal.Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023.
[9]EdwardJ Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, LuWang, and Weizhu Chen.Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021.
[10]Khaled Assi.Traffic crash severity prediction—a synergy by hybrid principal component analysis and machine learning models.International journal of environmental research and public health, 17(20):7598, 2020.
[11]Mohamed Osman, Rajesh Paleti, Sabyasachee Mishra, and MihalisM. Golias.Analysis of injury severity of large truck crashes in work zones.Accident Analysis & Prevention, 97:261–273, 2016.
[12]Bhaven Naik, Li-Wei Tung, Shanshan Zhao, and AemalJ. Khattak.Weather impacts on single-vehicle truck crash injury severity.Journal of Safety Research, 58:57–65, 2016.
[13]Heejin Jeong, Youngchan Jang, PatrickJ. Bowman, and Neda Masoud.Classification of motor vehicle crash injury severity: A hybrid approach for imbalanced data.Accident Analysis & Prevention, 120:250–261, 2018.
[14]Hengrui Chen, Hong Chen, Zhizhen Liu, Xiaoke Sun, and Ruiyu Zhou.Analysis of factors affecting the severity of automated vehicle crashes using xgboost model combining poi data.Journal of advanced transportation, 2020:1–12, 2020.
[15]Arshad Jamal, Muhammad Zahid, Muhammad TauhidurRahman, HassanM Al-Ahmadi, Meshal Almoshaogeh, Danish Farooq, and Mahmood Ahmad.Injury severity prediction of traffic crashes with ensemble machine learning techniques: A comparative study.International journal of injury control and safety promotion, 28(4):408–427, 2021.
[16]Karim Sattar, Feras ChikhOughali, Khaled Assi, Nedal Ratrout, Arshad Jamal, and Syed MasiurRahman.Transparent deep machine learning framework for predicting traffic crash severity.Neural Computing and Applications, 35(2):1535–1547, 2023.
[17]Amir Mehdizadeh, Miao Cai, Qiong Hu, MohammadAli AlamdarYazdi, Nasrin Mohabbati-Kalejahi, Alexander Vinel, StevenE. Rigdon, KarenC. Davis, and FadelM. Megahed.A review of data analytic applications in road traffic safety. part 1: Descriptive and predictive modeling.Sensors, 20(4), 2020.
[18]Zihe Zhang, Qifan Nie, Jun Liu, Alex Hainen, Naima Islam, and Chenxuan Yang.Machine learning based real-time prediction of freeway crash risk using crowdsourced probe vehicle data.Journal of Intelligent Transportation Systems, 28(1):84–102, 2024.
[19]Rongjie Yu and Mohamed Abdel-Aty.Analyzing crash injury severity for a mountainous freeway incorporating real-time traffic and weather data.Safety Science, 63:50–56, 2014.
[20]RabiaEmhamed AlMamlook, KenethMorgan Kwayu, MahaReda Alkasisbeh, and AbdulbasetAli Frefer.Comparison of machine learning algorithms for predicting traffic accident severity.In 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT), pages 272–276, 2019.
[21]Stefan Candefjord, Azam Sheikh Muhammad, Pramod Bangalore, and Ruben Buendia.On scene injury severity prediction (osisp) machine learning algorithms for motor vehicle crash occupants in us.Journal of Transport & Health, 22:101124, 2021.
[22]Cong Chen, Guohui Zhang, Rafiqul Tarefder, Jianming Ma, Heng Wei, and Hongzhi Guan.A multinomial logit model-bayesian network hybrid approach for driver injury severity analyses in rear-end crashes.Accident Analysis & Prevention, 80:76–88, 2015.
[23]Juan de Oña, RandaOqab Mujalli, and FranciscoJ. Calvo.Analysis of traffic accident injury severity on spanish rural highways using bayesian networks.Accident Analysis & Prevention, 43(1):402–411, 2011.
[24]Cong Chen, Guohui Zhang, Jinfu Yang, JohnC. Milton, and Adélamar“Dely” Alcántara.An explanatory analysis of driver injury severity in rear-end crashes using a decision table/naïve bayes (dtnb) hybrid classifier.Accident Analysis & Prevention, 90:95–107, 2016.
[25]Amirfarrokh Iranitalab and Aemal Khattak.Comparison of four statistical and machine learning methods for crash severity prediction.Accident Analysis & Prevention, 108:27–36, 2017.
[26]Zhiyuan Sun, Yuxuan Xing, Jianyu Wang, Xin Gu, Huapu Lu, and Yanyan Chen.Exploring injury severity of bicycle-motor vehicle crashes: A two-stage approach integrating latent class analysis and random parameter logit model.Journal of Transportation Safety & Security, 14(11):1838–1864, 2022.
[27]ScottM Lundberg and Su-In Lee.A unified approach to interpreting model predictions.Advances in neural information processing systems, 30, 2017.
[28]Chen Gao, Xiaochong Lan, Zhihong Lu, Jinzhu Mao, Jinghua Piao, Huandong Wang, Depeng Jin, and Yong Li.S3: Social-network simulation system with large language model-empowered agents.arXiv preprint arXiv:2307.14984, 2023.
[29]Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer.Qlora: Efficient finetuning of quantized llms.arXiv preprint arXiv:2305.14314, 2023.
[30]Alfonso Montella, David Andreassen, AndrewP Tarko, Shane Turner, Filomena Mauriello, LellaLiana Imbriani, and MarioA Romero.Crash databases in australasia, the european union, and the united states: review and prospects for improvement.Transportation research record, 2386(1):128–136, 2013.
[31]Marianna Imprialou and Mohammed Quddus.Crash data quality for road safety research: Current state and future directions.Accident Analysis & Prevention, 130:84–90, 2019.
[32]Mohamed Abdel-Aty, Joanne Keller, and PatrickA Brady.Analysis of types of crashes at signalized intersections by using complete crash data and tree-based regression.Transportation Research Record, 1908(1):37–45, 2005.
[33]Amirfarrokh Iranitalab and Aemal Khattak.Comparison of four statistical and machine learning methods for crash severity prediction.Accident Analysis & Prevention, 108:27–36, 2017.
[34]PeterT. Savolainen, FredL. Mannering, Dominique Lord, and MohammedA. Quddus.The statistical analysis of highway crash-injury severities: A review and assessment of methodological alternatives.Accident Analysis & Prevention, 43(5):1666–1676, 2011.
[35]RandaOqab Mujalli and Juan deOña.Injury severity models for motor vehicle accidents: a review.In Proceedings of the Institution of Civil Engineers-Transport, volume 166, pages 255–270. Thomas Telford Ltd, 2013.
[36]Ilya Loshchilov and Frank Hutter.Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017.
[37]LBreiman.Random forests.Machine Learning, 45:5–32, 10 2001.
[38]J.Ross Quinlan.Induction of decision trees.Machine learning, 1:81–106, 1986.
[39]Yoav Freund, Robert Schapire, and Naoki Abe.A short introduction to boosting.Journal-Japanese Society For Artificial Intelligence, 14(771-780):1612, 1999.
[40]Tristan Deleu, António Góis, ChrisChinenye Emezue, Mansi Rankawat, Simon Lacoste-Julien, Stefan Bauer, and Yoshua Bengio.Bayesian structure learning with generative flow networks.In The 38th Conference on Uncertainty in Artificial Intelligence, 2022.
[41]DavidR Cox.The regression analysis of binary sequences.Journal of the Royal Statistical Society Series B: Statistical Methodology, 20(2):215–232, 1958.
[42]Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, AnnaVeronika Dorogush, and Andrey Gulin.Catboost: unbiased boosting with categorical features.Advances in neural information processing systems, 31, 2018.
[43]Leo Breiman.Random forests.Machine learning, 45:5–32, 2001.
[44]Judea Pearl.Probabilistic reasoning in intelligent systems: networks of plausible inference.Morgan kaufmann, 1988.
[45]JamesC Fell, Geetha Waehrer, RobertB Voas, Amy Auld-Owens, Katie Carr, and Karen Pell.Effects of enforcement intensity on alcohol impaired driving crashes.Accident Analysis & Prevention, 73:181–186, 2014.
[46]Alyssa Ditcharoen, Bunna Chhour, Tunyarat Traikunwaranon, Nalin Aphivongpanya, Kunanon Maneerat, and Veeris Ammarapala.Road traffic accidents severity factors: A review paper.In 2018 5th International Conference on Business and Industrial Research (ICBIR), pages 339–343. IEEE, 2018.
[47]Mohamed Osman, Sabyasachee Mishra, Rajesh Paleti, and Mihalis Golias.Impacts of work zone component areas on driver injury severity.Journal of Transportation Engineering, Part A: Systems, 145(8):04019032, 2019.