Causal factors discovering from Chinese construction accident cases

Zi-jian Ni

a,∗

, Wei Liu

Faculty of Economics and Management, Dalian University of Technology

2 Ling Gong Rd., Dalian 116024, Liaoning, P. R. China

Abstract

In China, construction accidents have killed more people than any other industry since 2012.

The factors which led to the accident have complex interaction. Real data about accidents is

the key to reveal the mechanism among these factors. But the data from the questionnaire

and interview has inherent defects. Many behaviors that impact safety are illegal. In China,

most of the cases are from accident investigation reports. Finding out the cause of the

accident and liability aﬃrmation are the core of incident investigation reports. So the truth

of some answers from the respondents is doubtful. With a series of NLP technologies, in

this paper, causal factors of construction accidents are extracted and organized from Chinese

incident case texts. Finally, three kinds of neglected causal factors are discovered after data

analysis.

Keywords: Roles mismatch, Natural Language Processing (NLP), Accident cases,

Accident causes

1. Introduction

China, as the largest construction market in the world, its value of construction output

was about 24.8 trillion Yuan in 2019. Concerning safety in the construction industry, it

is still challenging today [1]. The death toll reached 1152 in 2003 and then fell for 11

consecutive years. With the holistic improvement of the occupational health and safety

management system of the country, however, accidents in the construction industry have

killed more people than in coal mines since 2012. In 2019, construction deaths on the job

were 904, which ranked the ﬁrst in all types of industrial accidents. Many studies hold that

construction is one of the most dangerous industries due to the complicated and multicausal

factor of accidents on project sites [2, 3].

In [4], accident causation theories were divided into four generations: accident proneness

theories, domino theories, injury epidemiology models, and system theories. In the last

generation, occupational safety is impacted by factors in diﬀerent levels that have complex

interactions. Further, two kinds of elements are analyzed in the construction accident system

∗

Corresponding author

Email address: [email protected] (Zi-jian Ni)

Preprint submitted to May 5, 2021

arXiv:2105.01227v1 [cs.AI] 4 May 2021

model. One is the factors inﬂuencing safety performance, which is called the risk factor. And

the other is the causal factor. As the name implies, they resulted in the accident.

Generally, the system model about risk factors is based on the empirical method. The

whole research begins with statements or hypotheses. After data collection from the ques-

tionnaire and interview, whether a hypothesis is supported or not depends on the appropriate

statistical formula. All kinds of speciﬁc aspects of construction safety have been discussed

in this methodology. Thirteen main risk factors from 55 papers are summarized in a useful

review [5]. In construction accident analysis, there is an essential weakness of this kind

of empirical research. Many behaviors that impact safety is illegal. So the truth of some

answers from the respondents in the questionnaire and interview are doubtful.

Moreover, the unsafety does not equal to the accident. Revealing the causality of ac-

cidents is essential to distinguish between factors that require some action or not [6]. The

research shows that causes of accidents vary substantially between industries [7]. Most causal

models of construction accident [2, 8, 9, 10] originated from systematic and holistic thinking

about accidents. But not all of them have been validated by suﬃcient real accident data.

In work [10], for example, only a small sample of fatal accidents (26 in total 211 accidents

cases) was used to understand underlying causes. Another example is causal factors were

divided into the proximal and distal in the [2]. But because of the limitations of the acci-

dent data available, only the proximal factors are validated [2]. The ConAC (Construction

Accident Causality) framework [8] was veriﬁed [6, 11] and applied [12] a couple of times.

But at the same time, analyzing data is the cost. For extraction data from 84000 words,

this study engaged four analysts [11]. The consistency of criteria for extracting information

is still problematic, even if you can hire more skilled professionals. As a result, for analyzing

construction accidents, real data is the key.

Not only in the ﬁeld of construction, but it is also hard to collect data of accidents in other

industries. The reason is that it is impossible to conduct reproducible incident experiments

like other disciplines. Past accident analysis and learning (PAL) is always one of two pillars

on which the ediﬁce of occupational safety research [13]. For PAL, accident cases are one of

the most important sources [14]. In China, most of the cases are from accident investigation

reports [15]. Finding out the cause of the accident and liability aﬃrmation are the core of

incident investigation reports. [16]. Including illegal acts, in other words, causal factors of

every accident can be found in these documents. NLP (Natural Language Processing) can

assist people in improving the performance of analyzing the unstructured text. In this paper,

causal factors of construction accidents will be extracted from the free text in Chinese with

Automatic Keyphrase Extraction (AKE) [17]. AKE includes a series of NLP technologies

and will be discussed in section 3. Furthermore, not only for incidents of construction, we

believe that our framework for the extraction can be used in other industry accident case

text in Chinese.

For evaluating the necessity and suﬃciency of causal factors in data sets, all valid accident

data in a short-term was input into various algorithms to get the correlation. Because our

Chinese cases are typical incidents for an extended period (more than 25 years), the holistic

causal model can not be proposed in this paper. But due to more accurate information

being extracted and summarized, some neglected causal factors will be revealed. In the

meanwhile, empirical studies may be inspired by these real accident data also. Finally, the

organized data will be shared online for further studies

The rest of this paper is structured as follows: The data source and the case text structure

will be introduced in section 2. A framework for extraction causal factors from texts will be

proposed in section 3. In section 4, the role mismatch and the other two neglected factors

will be discussed.

2. Accident causes in Chinese accident cases

Case title

Project profile...(optional)

Client

Contractor

Details of the accident and Emergency response

Causes of the accident

Direct cause

Indirect cause

Accident severity

Liabilities

Accident prevention and improvements...(optional)

Figure 1: The structure of Chinese construction accident cases.

In our study, 267 typical construction accident cases are all from esafety.cn, which is

the information platform of the Ministry of Emergency Management of China. The text

structure of a Chinese accident case is listed in Fig. 1. Some projects are small, and none

of the stakeholders are corporations. But the loss is severe. So chapters about the project

proﬁle and accident prevention and improvements are sometimes omitted. However, the

accident causes are the core of the document.

Moreover, there are causes-and-eﬀects relationships between two kinds of causes in the

cases. Direct causes have two main factors, which are unsafe behaviors of people and hazard

status of matters. Furthermore, the matter includes equipment, material, and surroundings.

The indirect cause can lead to immediate causes and thus increase the risk of projects, which

is similar to distal causal in [2]. Details of the indirect cause will be discussed in section 4.1.

As discussed above, most of the Chinese cases are from accident investigation reports.

And the legal base of investigation reports is Regulation for the Investigation of Casualty

Accidents of China (RICAC) [18]. Professor Sui is one of the counselors of RICAC, who

proposed an accident model called the cross-track model [19]. This model illustrates the

relation between direct and indirect causes. In Fig. 2, the unsafe behaviors and hazard

statuses of matters are understood as a consequence of management failures. Moreover, the

accident is not an inevitable outcome. But as the project goes on, loss expectation will

increase until an accident happens.

https://github.com/liuwei-965/Digital-management-of-Chinese-accident-cases

Figure 2: Cross track model: causes-and-eﬀects relationships between direct causes and indirect causes.

3. Extract causal factors for each accident

Figure 3: An example for keyphrases, which are indirect causes of an accident. Texts with underline are the

keyphrases for the cause. And the grey one is high-frequency word.

Although causal factors are rich in the two speciﬁc sub-sections of case texts, not every

word is about the cause. In Fig. 3, there is an example from one real case text. The parts

with underlines describe the causes of the accident. Rather than one single word, a sequence

of words makes up this description, which is called the phrase [20]. Moreover, an observation

in Fig. 3 is that only some phrases are valuable to analyze causal factors. In this paper,

these phrases are called keyphrases. Finally, more than one keyphrases can express the full

meaning of accident causes. This kind of keyphrases set is called the fact.

Each case text contains more than one fact about the accident. Based on a series of NLP

techniques, in this section, a framework will be proposed to extract these facts. Due to the

complexity and ambiguity of natural language, there are many ways of expressing the same

semantic [21]. So it is almost impossible to ﬁnd every fact from the free text. Our study,

due to the above, is based on one assumption that people and organizations repeat the same

mistakes always. As a result, if our framework can extract frequent causes automatically,

the manual workload for the rests will be very reduced.

3.1. Framework for extraction

Automatic Keyphrase Extraction (AKE) is a task of natural language processing (NLP),

which may be divided into two kinds [17]: supervised and unsupervised. Although promising

results were delivered from current supervised AKE approaches, both the data labeling and

manual sorting facts are time-consuming. Without training data, unsupervised AKE is a

recent trend aimed at discovering the underlying structure of a document [22]. The graph-

based model is a typical method of unsupervised AKE [23, 24], in which the whole text

is switch to the network, words as nodes. Based on diﬀerent standards, each node gets a

weight to evaluate its importance. Then rank nodes by their weight, and select nodes of top

rank as keyphrases at last. However, based on this graph-based model, it can not guarantee

a phrase representing the text theme is a top-ranking term if it does not frequently occur in

the text. In the text of Fig. 3, the occurrences of some phrases are much higher than anyone

of keyphrases. For example, Songyuan appears 3 times, Property Management appears 4

time and Property Management Co., LTD. is 3.

Figure 4: Keyphrase extraction process stages.

Although the weighted network’s topology can not be used as an extraction basis in our

data, the following two features are still valuable.

1. The core of causal factors is usually verb phrases in Chinese.

2. If people repeat the same mistake, which is the assumption discussed in section 3, one

causal factor in one case will appear in others.

Based on the above features about the case text, the whole workﬂow is depicted in Fig. 4.

In this process, the core parts are the candidate identiﬁcation (step 2) and feature engineering

(step 3). In stage 2, candidate phrases sets will be identiﬁed through dependency syntax

analysis (DSA) and heuristic rules. The core meaning of every sentence will be extracted

in this step. If multiple candidate phrase sets have a similar semantic, in the next step,

keyphrases sets (facts) will easily be brought together with the semantic clustering.

3.2. Case text pre-processing

In the preprocessing stage, text data will be formatted into a machine-readable format

to decrease their complexity. In Chinese, a part of a sentence that can provide additional

information for the sentence is called the sense group [25]. And sense groups of a sentence

are divided into commas, semicolons, and full stops. We believe that a sense group can

retain the whole meaning of a fact. To this end, after noisy symbols are removed, sentences

will be segmented by the three kinds of punctuation. In our studies, these segments are

called candidate clauses.

3.3. Identiﬁcation of candidate phrases sets

In this stage, candidate clauses will be transformed into candidate phrase sets.

For detecting all candidate phrases sets, three main methods were used by previous

studies: N-Gram based [26, 27], Part-Of-Speech (POS) sequence based [28] and both [29].

All methods above fall into the lexical analysis. According to the characters of our data,

a novel method based on syntactic analysis will be proposed in this paper. Dependency

parsing is quite a vital grammar analysis tool [30]. In the dependency grammar, rather than

the constituent and structure of solo phases, binary grammatical relations between words

are directly described.

Figure 5: Two examples of candidate identiﬁcation. Candidate clauses in the two examples are the same.

Because of diﬀerent segmentations, results are very diﬀerent.

In Fig. 5, there are two examples of the sentence dependency syntax analysis (DSA).

A Chinese sentence is cut into words (or phrases). Each one, in Fig. 5, is in a top part

of a word box. And the bottom part of this box is the sequence number and part of the

speech of it. On the top of the word boxes, the directed edge is from the headword to its

dependent. And the labels are all from a ﬁxed library of syntax relations [31]. There must

be a root node in the dependency structure, which is the head of others. Note that if the

sequence number of the headword is less than the dependent’s, the arc is called reverse

syntax relation.

With heuristic rules, the DSA results of candidate clauses will be extracted to get can-

didate phrase sets. Generally, the root of a sentence is a verb phrase, which is the core of

causal factors. As a result, the start point of extraction of rules is the root phrase. Further,

the other two rules will help to ﬁnd the rest candidate phrases if they exist.

1. Taking the root as the start, its nearest dependent will be extracted.

2. The headword and dependent of some certain reverse syntax relation will be extracted,

which is the nearest to the root. These reverse syntax relations include direct object

(dobj ), object of preposition (pobj ), adjectival complement (acomp).

In the sub-ﬁgure b of Fig. 5, the last phrase is the root of the whole sentence, so there is

no reverse syntax relation in it. And following rule 1, namely the nearest dependent of the

root, safety inspection can be extracted. By rules, the candidate phrases set of this clause

is ‘safety inspection not appropriate implementation’.

Moreover, an important observation in Fig. 5 is that although the candidate clauses and

the rules are the same, the results are diﬀerent. The reason is diﬀerent ways of sentence

segmentation. With classical methods, in sub-ﬁgure a of Fig. 5, the clause is divided into

words. And the complete fact can not be found by rules. Rather than words, in the sub-

ﬁgure b, the same sentence is cut into phrases. The phrase, in Chinese, is a group of words

or a single word, which is a single unit in the grammar of a sentence. In the example above,

a group of words is combined to Songy Tians Property Manag Co., LTD which is a noun

phrase. And the phrase in the last box is an adverbial phrase.

A few kinds of noun phrases, such as organizations and locations, can be found by

one NLP technology called named-entity recognition (NER) [32]. Other kinds of phrases,

including some noun phrases, need a novel method. Phrases extraction, essentially, is the

assignment to identify combinations of words that show some idiosyncrasy in some certain

corpus [33]. In this paper, this idiosyncrasy will be evaluated by a mixed index [34]. The

equation is as follows:

Score(b) = P MI(w

, w

) + min(H

C(b), H

C(b)) (1)

In the phase extraction, two sequential words in the text are called the bigram. Let w

, w

be a bigram in the corpus, which is denoted by b. The score of bigram b, in Equ. 1, is

composed of two parts, which will be used to evaluate whether b can be a phrase. Specif-

ically, P MI(w

, w

) is the inner connection index and min(H

C(b), H

C(b)) is the outer

independence index.

Pointwise mutual information (PMI) is one of the standard connection measures in the

phrase extraction, which was introduced into NLP by Church and Hanks [35].

P MI(w

, w

) = log(

P (w

, w

)

P (w

) × P (w

)

) (2)

P (w

, w

) is the probability of the bigram w

, w

, which can be gotten by the maximum like-

lihood estimation. P (w

, w

) = C(w

, w

)/N, where C(w

, w

) is the number of occurrences

of the bigram and N is the number of words in the corpus. By the same way, P (w

) and

P (w

) can be estimated also.

PMI as an inner connection index can not be used to evaluate whether the bigram is a

complete phase. P MI(Songyuan, T ianshan)

, for example, may have a high PMI value.

But ‘Songyuan Tianshan Property Manag Co., LTD’ is a whole noun phrase. In other words,

by the outer index, a bigram can be independent of contextual words.

If contextual words of a bigram are always in change, we believe that it may well be

a complete semantic unit [36] (phrase). Information entropy can be used to calculate the

chaos and unpredictability of a random variable. Let LC(b) = {w

, ..., w

} be left context

words set of the bigram. Thus the left entropy of bigram can be deﬁned as:

C(b) =

∈LC(b)

P (w

)log

P (w

) (3)

By MLE, P (w

) = C(w

)/N, where C(w

) is the number of occurrences of word w

appearing

to the left of b, and N is total number of occurrences that all adjacent words appear to the

left of b. In the same way, the right entropy of b can also be got.

Finally, based on these scores, the bigrams set will be ranked. And top-ranked ones may

be returned as phrases. Note that the phrase extraction can be operated repeatedly until as

many whole semantic units as a possible return.

3.4. Feature engineering

In this step, accident facts will be identiﬁed. In AKE, characters that can distinguish

keyphrases from others in the candidate set are called features. TF-IDF (Term frequency

- Inverse document frequency) is the most popular feature [37, 38]. TF-IDF can select

candidate phrase sets that are frequent in a given document but infrequent in the whole

corpus. As shown in Fig. 3, facts can not be identiﬁed because of less frequency. Assuming

that people always repeat the same mistakes, a novel feature will help to pick keyphrases in

our studies.

Repeating the same mistakes means the facts with similar semantics appear in many

diﬀerent candidate phrase sets. As a result, the cluster based on the semantic similarity can

characterize keyphrases sets from others. By counting the minimum number of operations

required to switch one string to the other, edit distance is a method to evaluate the semantic

similarity [39] between two candidate phrases sets. In our work, types of operation contain

the insertion, removal, or substitution of a character in the string. This kind of distance is

Songyuan is the name of a city. Tianshan is a mountain

called Levenshtein distance [40] which is deﬁned as the following.

sem(a, b) = lev(a, b) =











|a| if|b| = 0

|b| if |a| = 0

lev(tail(a), tail(b)) if a[0] = b[0]

1 + min











lev(tail(a), b)

lev(a, tail(b))

lev(tail(a), tail(b))

otherwise

(4)

lev(a, b) is the Levenshtein distance of the two strings a, b and |a|, |b| is the length of

them. The tail of string a (tail(a)) is the string of all but the ﬁrst character of a, and

a[n] is the nth character of the string a, starting with character 0. For the two strings

a, b (|a| > 0, |b| > 0), if they’re exactly the same, lev(a, b) = 1. Further, the larger the

diﬀerence between a, b, the higher the Levenshtein distance. As a result, lev(a, b) can be

used to evaluate semantic similarity. Let T

be the candidate phrases. Levenshtein distance

is used to get the pairwise similarities between each pair of phases in T

. And the result is

a similar matrix of size |T

| × |T

|, which is denoted by SC.

Then, SC will be clustered. There are many kinds of algorithms to cluster SC eﬃciently,

but not all can analyze the distance matrix. DBSCAN [41] is a robust algorithm that does

not need to specify the number of clusters. DBSCAN requires two parameters. One is

the radius of a neighborhood with respect to some point denoted by ε. The other is the

minimum number of points (minP ts) required to form a dense region. A point is a core

point if at least minP ts points (including the core point) are within distance ε of it. With

the core point, DBSCAN will cluster all points (core or non-core) that are reachable from

it.

Every parameter will inﬂuence the result of an algorithm, which is the key for every

mining task. To DBSCAN, ε and minP ts as parameters are needed to speciﬁed by the user.

• minP ts is then the desired minimum cluster size. Because people always repeat the

same mistakes, minP ts can be set a little higher. Generally, higher values are better

for data sets with noise sets and will yield more signiﬁcant clusters. Here noise sets

mean the content of the phrase set is nothing about the cause of the accident. In the

clustering process of our study, minP ts is always 5.

• It is hard to estimate ε because there are many ways to express the same semantic in

the free text. But it is much easier to get a minimum value of ε than its maximum

value. If two candidate phrases sets are the same, which is very common in SC, the

Levenshtein distance between them is 1. So the lower bound of ε is 1. If ε is chosen

much too small, a large part of the data will not be clustered. The example is in

Fig. 6. Two candidate phrase sets are all about warning signs being ignored when the

Levenshtein distance between them is not small, which is 1.9. Namely, if ε < 1.9, it is

quite possible that they are considered as noise set by DBSCAN. And for a too high

value of ε, clusters will merge, and most nodes will be in the same cluster.

Figure 6: The two sentences have similar semantics, which can be classiﬁed into one causal factor. However,

the Levenshtein distance between them is not small, which is 1.9.

In our work, a succinct multi-density clustering will be implemented in our candidate

phrases sets. The algorithm is listed as the following:

1. To candidate phrases set SC, ε is determined by comparison.

2. With ε, some clusters will be mined from SC.

3. If any two phrases in one cluster satisfy lev(a, b) = Max(|a|, |b|), the algorithm will

stop. All clusters mined by the algorithm are the result.

4. If not, delete candidate phrases set belonging to any clusters from SC to form a new

SC. And repeat step 1.

The whole process is depicted in Fig. 7.

Figure 7: The algorithm ﬂow of multi-density. ε

< ε

< ... < ε

. SC will be clustered by ε

. The the

phrases in any clusters will be removed from the SC. And the rests will be be clustered by ε

. Repeat the

two steps above until the condition is met.

The subgraph named Round1 in Fig 8 depicts the relationship between the ε and the

number of clusters. The whole SC is clustered by diﬀerent ε whose value is from 1.1 to 1.5.

The peak number of clusters appears in ε = 1.32, which is chosen as the value of the radius

of a neighborhood in round 1. The same pattern about the number of clusters appears

in the rest of the data until the stopping rule is satisﬁed. Note that the terminal rule is

lev(a, b) = Max(|a|, |b|), which means there is not one same character in the string a and

b. In our data set, round 6 is the last clustering and ε

= 3 The radius from ε

to ε

are

depicted in Fig. 8.

Figure 8: The value of ε used in the multi-density clustering.

3.5. Summary for extracting causal factors

267 accident case texts are input into our extracting framework, in which accidents

happened from 1998 to 2018. And 5598 candidate clauses format from these text data. Of

course, 5598 candidate phrase sets are ready for clustering analysis by DSA and heuristic

rules extracting. After six rounds of multi-density DBSCAN, 355 clusters are the ﬁnal result,

and 664 phrases sets are not contained by any clusters. In 664 sets, only 3 are not noise

sets.

Note that only 40 clusters (in 355) are noise set also. After removing duplications, 1669

phrase sets about accident causations are the keyphrase sets. Then each case text will

retrieval these key sets to get the recall. More speciﬁcally, if a clause in the text includes

a whole essential phrase set, the causal factor is identiﬁed. The recall of our framework is

87%.

4. New causal factors discovery

As discussed above, the scale of risk factors in the construction are much larger than the

causal factors. An excellent review [5] investigated 55 previous papers, and 95 sub-factors

are summarized into 13 main factors. In contrast to risk factors, ConAC, which is a causal

model, only considers Four main factors and 19 sub-factors [11]. As a result, for revealing

new causal factors, we try to classify 1669 facts into 95 sub-factors until someone can not

be laid down. If some of these neglected facts have common characteristics, we can say one

novel causal factor is discovered.

4.1. Role mismatch

The ﬁrst one, which caught our attention, is a fact which is ‘fake many times to defraud

franchise’. Not only is this fact not classiﬁed into any 13 primary factors, but it makes

me wonder what has happened in that accident. Then we went back to read the case text

and found that it was a complicated accident

. In brief, to save money, a big project is

masqueraded as a small one by lying to the government ﬁrst. Then the client ﬁnish jobs

of the contractor, supervision, and engineering designer. Because of the improper plan,

insuﬃcient strength of columns led to concrete formwork collapsing. Seven people died, and

over ten were injured in this accident. It is impossible for respondents in the questionnaire

or interview to admit such a severe crime.

Stakeholders are the organizations who are actively involved with the project’s work

or have something to either gain or loss due to the project [42]. Much more than other

industries, there are ﬁve kinds of directly involved organizations in China, including the

government, client, project supervision, contractor, and others (Land survey, design, equip-

ment leases, etc.). One stakeholder unfulﬁlling his responsibility to result in an accident has

drawn attention from previous studies [43, 44]. But few people note that one stakeholder

did something beyond their scope of duties and cause accidents. In this paper, this is called

role mismatch. One example is the client in the last paragraph.

From 267 case texts, six kinds of role mismatch are summarized, which is listed in Tab.

1. Except for supervisors, the other ﬁve kinds of stakeholders are included. The second

column of Tab. 1 is the occurrence number of this sub-factor in total 267 cases. If two

factors appear in the same accident frequently, there may be a strong correlation between

them. Before discussing the relations between role mismatch and other causal factors, a

classiﬁcation of causal factors will be proposed ﬁrst.

The case data character is each accident fact has a stakeholder who has to be held

accountable. As a result, six main factors correspond to six diﬀerent kinds of stakeholders

in the construction industry in our classiﬁcation. Each stakeholder’s responsibilities in the

construction safety are deﬁned in two laws of China [45, 46], which are Construction Law

and Regulations on construction engineering quality management, respectively. So the sub-

factors are all from the two laws. Rather than open interpretation [12], the deﬁnition of

these factors in the law is more strict. The main factors and their sub-factors are listed in

Tab. 3 of Appendix A. Note that the number in the bracket is the code of this sub-factor.

And these codes correspond to the number in the last column in Tab. 1.

If two factors appear in the same accident frequently, there may be a strong correlation

between them. Moreover, the causal diagrams [47] of the construction accident can be

deduced from these correlations. With role mismatch, factors that appear in the same

accident are listed in the last column in the Tab. 1. And the number of co-occurrence is in

the bracket behind the factor code. Based on causal diagrams [47], the mechanism of role

mismatch will be discussed in our future work. Here we only come up with some preliminary

observations. Except for government appointing sub-contractor, reducing costs and saving

time may be the common purpose of the rest ﬁve sub-factors.

Table 1: Relations between role mismatch and other causal factors

Role mismatch Occurrence

number

Other factors in the same accident

Client: making construction

plan

1 6-2(1) 2-7(1) 2-12(1) 2-3(1)

Government: appoint sub-

contractor

1 2-12(1) 2-7(1) 3-1(1) 2-3(1)

2-7(33) 2-12(12) 3-4(5) 4-1(16)

2-13(26) 1-3(11) 5-2(3) 2-11(6)

Contractor: construction 41 2-3(23) 3-1(11) 4-3(2) 3-5(1)

without competency 1-1(23) 2-4(11) 6-1(2) 2-5(15)

5-1(22) 2-8(9) 2-9(2) 6-2(6)

1-2(18) 2-1(9) 4-4(2) 2-10(13)

3-2(16) 2-6(8) 4-2(1) 2-2(5)

2-13(5) 1-1(3) 1-2(2) 3-2(3)

2-7(5) 2-10(3) 3-1(2) 4-3(1)

Contractor: 5 2-3(5) 2-5(2) 2-6(1) 4-1(2)

illegal transfer 5-1(4) 2-11(2) 1-3(1) 2-12(2)

2-8(3) 5-2(2) 2-2(1) 2-4(1)

2-1(3)

1-2(48) 2-6(22) 4-3(7) 2-12(27)

2-3(43) 1-3(19) 3-4(6) 2-1(12)

2-7(40) 2-5(19) 2-11(5) 4-2(2)

Worker: labour 57 5-1(37) 2-10(18) 6-2(5) 2-4(25)

without competency 4-1(35) 2-13(17) 6-1(5) 3-1(12)

1-1(31) 3-2(16) 5-2(4) 4-4(2)

2-8(30) 2-2(14) 2-9(3)

2-7(4) 1-2(3) 6-1(1) 4-4(1)

Designer: 4 5-1(4) 1-1(3) 2-2(1) 2-10(1)

without competency 6-2(4) 3-2(3) 2-12(1) 3-4(1)

2-13(4) 2-4(2) 2-1(1) 3-5(1)

2-3(4) 3-1(2) 4-1(1)

Table 2: The other two neglected causal factors

Main factor Sub-factors Accident case title

Supplier: Failure to fully perform the

contract

2003-9-20 Lift cage falling

No engineer contract 1996-3-14 The earth collapsed

Engineer contract

management

No labor contract 2003-5-15 The car crane collided with the high voltage line

In Inappropriate contract management 2002-3-15 Crane boom overturned

2003-9-12 Pipe network trench collapse

2002-11-6 Falling

Delayed response 2001-6-20 The outer cornice collapsed

2003-7-24 The building collapsed

No contingency plan 2002-5-12 Explosion

Response 2003-11-20 Construction collapse

for the accident Contingency plan has not been imple-

mented

2014-9-1 Poisoning in a sewage pumping station project

Inappropriate rescue 2003-3-29 Poisoning in a sewerage project

2012-12-23 Carbon monoxide poisoning

4.2. More than one neglected factor

From case text data, the other two main factors are relatively little studied in construction

accidents. One is engineering contract management, and the other is the response to the

accident. We believe that the reason for neglecting is also data problems. It is hard to collect

enough samples because the people who have participated in rescue or contract management

are very few.

With case texts, other scholars may be inspired by the two factors and their sub-factors.

In Tab. 2, the sub-factors are listed in the second column. Note that all of these sub-factors

are summarized from the real accident cases, and the date and title of them are in the last

column. And our share data in Github has these case texts.

5. Conclusion

The accident data is valuable. After the whole process of past accidents is revealed, the

future losses can be reduced. Very few people have ever had an accident, so the data about

accidents are hard to get. Typical accident cases should be studied carefully because the

cost of life is behind most of these texts. Beyond the limitations of the manual analysis,

based on a series of NLP technologies, a framework to organize data about accident causes

is proposed in this paper. And some neglected causal factors are discovered. Role mismatch

will be further discussed in our future studies.

http://www.safehoo.com/item/157796.aspx Last open in 2021.01.15

We believe that our framework can also analyze Chinese case texts in other industries.

And the research involving other languages can be inspired by this work. Moreover, society

and economic climate can also aﬀect the occupational incident system [11, 43]. As a result,

other developing countries would beneﬁt from our study also.

Acknowledgements

This work is supported by the National Natural Science Foundation of China Nos. 71501022,

71901047, 71874020 and 71774021.

Appendix A

Table 3: Causal factors categorized by stakeholders

Stakeholder Causal factors (ID)

• Unsafe operation (1-1)

1. Worker and Work group • Without competency (1-2)

• Tacit knowledge: ability, experience, knowledge, safety awareness (1-3)

• Responsibilities of contractor is not fulﬁlled (2-1)

• Construction plan (2-2)

• Safety, quality supervision and control (2-3)

• Rules and regulation (2-4)

• Safety culture and climate (2-5)

2. Contractor • Safeguard procedures, equipment and sign (2-6)

• Inappropriate construction operation (2-7)

• Training and education (2-8)

• Site condition (2-9)

• Command (2-10)

• Veriﬁcation of competency (2-11)

• Response to the accident (2-12)

• Competency of itself (2-13)

• Safety management (3-1)

• Illegal construction (3-2)

3. Client • Supervising contractors (3-3)

• Project acceptance (3-4)

• Archives management (3-5)

• Supervising contractors (4-1)

• Communication with client (4-2)

4. Supervisor • Competency of itself (4-3)

• Tacit knowledge: ability, experience, knowledge, safety awareness (4-4)

• Guide and supervise (5-1)

5. Government • Inappropriate punishment (Punishment is too light or laws is not strictly enforced)

(5-2)

• Organization, mechanism, system (5-3)

6. Others • Supplier: Material and equipment quality (6-1)

• Designer: Survey and design (6-2)

Reference

[1] C. Tam, S. Zeng, Z. Deng, Identifying elements of poor construction safety management in china, Safety

science 42 (7) (2004) 569–586.

[2] A. Suraji, A. R. Duﬀ, S. J. Peckitt, Development of causal model of construction accident causation,

Journal of construction engineering and management 127 (4) (2001) 337–344.

[3] P. H. Mohseni, A. A. Farshad, R. Mirkazemi, R. J. Orak, Assessment of the living and workplace

health and safety conditions of site-resident construction workers in tehran, iran, International journal

of occupational safety and ergonomics 21 (4) (2015) 568–573.

[4] V. V. Khanzode, J. Maiti, P. K. Ray, Occupational injury and accident research: A comprehensive

review, Safety Science 50 (5) (2012) 1355–1367.

[5] A. Mohammadi, M. Tavakolan, Y. Khosravi, Factors inﬂuencing safety performance on construction

projects: A review, Safety science 109 (2018) 382–397.

[6] A. Gibb, H. Lingard, M. Behm, T. Cooke, Construction accident causality: learning from diﬀerent

countries and diﬀering consequences, Construction Management and Economics 32 (5) (2014) 446–459.

[7] A. M. Williamson, A.-M. Feyer, D. R. Cairns, Industry diﬀerences in accident causation, Safety Science

24 (1) (1996) 1–12.

[8] S. Hide, S. Atkinson, T. C. Pavitt, R. Haslam, A. G. Gibb, D. E. Gyi, Causal factors in construction

[9] P. Mitropoulos, T. S. Abdelhamid, G. A. Howell, Systems model of construction accident causation,

Journal of construction engineering and management 131 (7) (2005) 816–825.

[10] A. Hale, D. Walker, N. Walters, H. Bolt, Developing the understanding of underlying causes of con-

struction fatal accidents, Safety science 50 (10) (2012) 2020–2027.

[11] S. Winge, E. Albrechtsen, B. A. Mostue, Causal factors and connections in construction accidents,

Safety science 112 (2019) 130–141.

[12] M. Behm, A. Schneller, Application of the loughborough construction accident causation model: a

framework for organizational learning, Construction Management and Economics 31 (6) (2013) 580–

595.

[13] B. Abdolhamidzadeh, T. Abbasi, D. Rashtchian, S. A. Abbasi, Domino eﬀect in process-industry

accidents–an inventory of past events and identiﬁcation of some patterns, Journal of Loss Prevention

in the Process Industries 24 (5) (2011) 575–593.

[14] S. Tauseef, T. Abbasi, S. A. Abbasi, Development of a new chemical process-industry accident database

to assist in past accident analysis, Journal of loss prevention in the process industries 24 (4) (2011)

426–431.

[15] M. of Housing, U.-R. Development, Case analysis of construction safety accidents, China Architecture

Press, 2019.

[16] S. C. of the People’s Republic of China, Regulations on the reporting, investigation and handling of

production safety accidentsconstruction engineering quality management regulations (2007).

URL http://www.gov.cn/zwgk/2007-04/19/content_588577.htm

[17] Z. A. Merrouni, B. Frikh, B. Ouhbi, Automatic keyphrase extraction: An overview of the state of the

art, in: 2016 4th IEEE international colloquium on information science and technology (CiSt), IEEE,

2016, pp. 306–313.

[18] S. A. of China, Regulation for the investigation and analysis of accidents involving casualties of enter-

prise employees.

[19] S. ChengPeng, Casualty accident analysis and prevention principle, Industrial Safety and Environmental

Protection 05 (1982) 1–8.

[20] Z. A. Merrouni, B. Frikh, B. Ouhbi, Automatic keyphrase extraction: a survey and trends, Journal of

Intelligent Information Systems (2019) 1–34.

[21] J. Piskorski, R. Yangarber, Information extraction: Past, present and future, in: Multi-source, multi-

lingual information extraction and summarization, Springer, 2013, pp. 23–49.

[22] H. H. Alrehamy, C. Walker, Semcluster: unsupervised automatic keyphrase extraction using aﬃnity

propagation, in: UK Workshop on Computational Intelligence, Springer, 2017, pp. 222–235.

[23] T. Washio, H. Motoda, State of the art of graph-based data mining, Acm Sigkdd Explorations Newslet-

ter 5 (1) (2003) 59–68.

[24] S. S. Sonawane, P. A. Kulkarni, Graph based representation and analysis of text document: A survey

of techniques, International Journal of Computer Applications 96 (19).

[25] D. X. Zhou C., Diﬃculties and counter measures for machine understanding of chinese: A viewpoint

of the sense-group dynamics, Modern Foreign Languages (Quarterly) 23 (2).

[26] C. Huang, Y. Tian, Z. Zhou, C. X. Ling, T. Huang, Keyphrase extraction using semantic networks

structure analysis, in: Sixth International Conference on Data Mining (ICDM’06), IEEE, 2006, pp.

275–284.

[27] Z. Liu, P. Li, Y. Zheng, M. Sun, Clustering to ﬁnd exemplar terms for keyphrase extraction, in:

Proceedings of the 2009 conference on empirical methods in natural language processing, 2009, pp.

257–266.

[28] K. Barker, N. Cornacchia, Using noun phrase heads to extract document keyphrases, in: conference of

the canadian society for computational studies of intelligence, Springer, 2000, pp. 40–52.

[29] M. Grineva, M. Grinev, D. Lizorkin, Extracting key terms from noisy and multitheme documents, in:

Proceedings of the 18th international conference on World wide web, 2009, pp. 661–670.

[30] H. Calvo, O. J. Gambino, A. Gelbukh, K. Inui, Dependency syntax analysis using grammar induction

and a lexical categories precedence system, in: International Conference on Intelligent Text Processing

and Computational Linguistics, Springer, 2011, pp. 109–120.

[31] J. Nivre, M.-C. De Marneﬀe, F. Ginter, Y. Goldberg, J. Hajic, C. D. Manning, R. McDonald, S. Petrov,

S. Pyysalo, N. Silveira, et al., Universal dependencies v1: A multilingual treebank collection, in: Pro-

ceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16),

2016, pp. 1659–1666.

[32] D. Nadeau, S. Sekine, A survey of named entity recognition and classiﬁcation, Lingvisticae Investiga-

tiones 30 (1) (2007) 3–26.

[33] G. Bouma, Normalized (pointwise) mutual information in collocation extraction, Proceedings of GSCL

(2009) 31–40.

[34] Hankcs, Introduction to nature language processing, People’s posts and telecommunications press, 2019.

[35] K. Church, P. Hanks, Word association norms, mutual information, and lexicography, Computational

linguistics 16 (1) (1990) 22–29.

[36] C.-W. Lee, Y.-L. Wu, L.-C. Yu, Combining mutual information and entropy for unknown word ex-

traction from multilingual code-switching sentences., Journal of Information Science & Engineering

35 (3).

[37] T. D. Nguyen, M.-Y. Kan, Keyphrase extraction in scientiﬁc publications, in: International conference

on Asian digital libraries, Springer, 2007, pp. 317–326.

[38] F. Liu, D. Pennell, F. Liu, Y. Liu, Unsupervised approaches for automatic keyword extraction using

meeting transcripts, in: Proceedings of human language technologies: The 2009 annual conference of

the North American chapter of the association for computational linguistics, 2009, pp. 620–628.

[39] M. Baroni, J. Matiasek, H. Trost, Unsupervised discovery of morphologically related words based on

orthographic and semantic similarity, arXiv preprint cs/0205006.

[40] G. Navarro, A guided tour to approximate string matching, ACM Computing Surveys 33 (1) (2001)

31–88. doi:10.1145/375360.375365.

[41] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al., A density-based algorithm for discovering clusters in

large spatial databases with noise., in: Kdd, Vol. 96, 1996, pp. 226–231.

[42] R. Newcombe, From client to project stakeholders: a stakeholder mapping approach, Construction

management and economics 21 (8) (2003) 841–848.

[43] Q. Chen, R. Jin, Multilevel safety culture and climate survey for assessing new safety program, Journal

of Construction Engineering and Management 139 (7) (2013) 805–817.

[44] A. Pinto, I. L. Nunes, R. A. Ribeiro, Occupational risk assessment in construction industry–overview

and reﬂection, Safety science 49 (5) (2011) 616–624.

[45] S. C. of the People’s Republic of China, Regulations on construction engineering quality management

(1997).

URL http://www.gov.cn/flfg/2005-08/06/content_20998.htm

[46] S. C. of the National People’s Congress, Construction law of people’s republic of china (2019).

URL http://www.npc.gov.cn/npc/c30834/201905/0b21ae7bd82343dead2c5cdb2b65ea4f.shtml

[47] J. Pearl, D. Mackenzie, The book of why: the new science of cause and eﬀect, Basic Books, 2018.