-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsec2_background.tex
394 lines (331 loc) · 30 KB
/
sec2_background.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% Background
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% In a thesis, every section starts a new page, hence \clearpage
\clearpage
\section{Background and related work}
\label{sec:background}
\subsection{Process mining}
To understand process mining it is necessary to start with the concept of \emph{event logs}.
An event log is a collection of recorded data for an information system.
Activities executed in the system leave records which are stored in the event log.
\emph{Process mining} describes methods of using the event log data to extract useful information.
The information can be used to discover previously unknown processes, to find bottlenecks and inefficiencies, to detect and understand anomalies, and to support redesign actions \cite{van2015extracting}.
In this thesis I look at \emph{process discovery}, which means finding the underlying process model
by reading the event log.
In process discovery, the model should be generated without any a-priori information, meaning that the algorithm should be unsupervised \cite{van2013discovering}.
The process model can be expressed as a directed graph or a \emph{Petri net}.
These models can be used for business intelligence (BI) solutions, or they can be compared to the real life
known models to check conformance \cite{van2013discovering}.
Discovering strictly sequential processes from event logs is straightforward. However, modern systems are increasingly concurrent, which complicates the issue.
Parallel systems generate events in an undetermined order based on how the processes are interleaved.
When the events are logged the order is realized resulting in a totally ordered sequential log \cite{van2004workflow}.
The challenge is having to discover and construct the parallelism from the event log.
Furthermore, the event log may be \emph{incomplete}, which means not all possible behaviour is present in the logs \cite{van2013discovering}.
Additionally, the event logs can be \emph{noisy} and contain random infrequent behaviour \cite{van2013discovering}.
It may be undesirable to present the infrequent behaviour in the models.
These characteristics give the task of process discovery challenging trade-offs.
\subsubsection{Event logs and traces}
\label{sec:eventtheory}
\begin{definition}
Let $A$ be the set of all \emph{events}.
$\sigma \in A^*$ is a \emph{trace} describing a sequence of events.
The event log $L \in \mathbb{B}(A^*)$ is a \emph{multi-set} of \emph{traces}.
\end{definition}
A process consists of a set of distinct \emph{activities}.
An activity is a well-defined step belonging to the process (for example, ``resize an image'').
Each event in the event log is a log entry documenting an execution of an activity.
In other words, $A$ is the vocabulary of events.
The event may also hold additional information about the process.
A workflow \emph{trace} contains a sequence of events documenting a single execution of the process (a \emph{process instance}).
This kind of single process execution is called a \emph{case} (later in this thesis also: \emph{submission}).
The event log can be seen as a \emph{multi-set} of \emph{traces}.
A multi-set (a \emph{bag}) is a set that allows multiple instances of the set's elements. \cite{van2015extracting}
In this thesis it is assumed that any concurrent execution of events can be recorded into
an ordered and sequential event log.
The concurrent execution generates randomly ordered sequences.
The parallelisms can be discovered by observing the different traces and their frequencies.
For example, for a process consisting of activities $a,b,c,d,e \in A$, one trace could be
$\langle a,b,c,d,e \rangle$ and another $\langle a,c,d,b,e \rangle$.
These different traces could be found in the event log a number of times, some more frequent than others.
In process discovery the task is to construct a model with sequential and parallel steps
that describes a process that can generate these traces \cite{van2013discovering}.
The frequency of a trace matters in process discovery, and this is why the event log is a multi-set.
Figure \ref{tab:multisettrace} illustrates the difference of the mathematical abstraction of a multi-set of traces and figure \ref{tab:interleaved} a real-life event log with interleaved traces in a physical log file.
\begin{figure}[htb]
\centering
% footprint table example
\begin{subfigure}[h]{0.4\linewidth}
\begin{center}
\begin{tabular}{r | l}
ID & trace\\
\hline
1 & $\langle a,b,c,d,e \rangle$ \\
2 & $\langle a,c,d,b,e \rangle$ \\
3 & $\langle a,b,c,d,e \rangle$ \\
\end{tabular}
\end{center}
\caption{Event log as multi-set of traces}
\label{tab:multisettrace}
\end{subfigure}
% directed graph example
\begin{subfigure}[h]{0.4\linewidth}
\begin{center}
\begin{tabular}{r | l}
ID & event\\
\hline
1 & a \\
2 & a \\
1 & b \\
1 & c \\
2 & c \\
3 & a \\
\end{tabular}
\end{center}
\caption{Interleaved event log (truncated)}
\label{tab:interleaved}
\end{subfigure}
\caption{Event log illustrated}
\end{figure}
When dealing with discovering processes from event logs, there are two clear challenges: \emph{incompleteness} and \emph{noise}.
Discovering the model from the event log is difficult since one cannot assume that all the possible traces are present \cite{van2013discovering}.
This is because the order in which parallel processes finish is random.
For a set of $n$ concurrent processes, there are $n!$ possible orderings in the trace.
The number of sequences observed in the logs \cite{van2007business} is often much smaller than the number of all possible sequences.
Moreover, some sequences can be inherently more frequent than others.
Such behaviour can be seen as undesirable noise, but sometimes infrequent traces may also present important information.
This makes decisions about trade-offs necessary.
The model should describe the process accurately even when the event log is incomplete or noisy.
Van der Aalst et al. \cite{van2013discovering} describes four criteria that need to be balanced when generating a model from a log: fitness, precision, generalization, and simplicity.
\emph{Fitness} measures whether the model fits the log, meaning that the traces in the event log can be generated by the discovered model.
\emph{Precision} measures whether the model allows only the behaviour observed in the event log. The model should describe the traces in the log, but should exclude completely unrelated behaviour not seen in the traces.
\emph{Generalization} relates to the unseen behaviour. The model should be general enough that it describes the process while allowing the cases that were not observed in the specific set of sequences.
Lastly, \emph{simplicity} relates to the noisiness. The models should be as simple as possible and undesirable, rare, or exceptional behaviour should not be included if it increases the model complexity.
\subsubsection{Process models and discovery}
\label{sec:processmodelsdsc}
A simple way to describe a process model is a \emph{directed graph}. A directed graph is a set of \emph{nodes} that are connected together by \emph{edges}.
\begin{definition}
A directed graph $G = (\mathcal{N}, \mathcal{E})$ consists of the set $\mathcal{N}$ of nodes and the set $\mathcal{E} \subseteq \{ (x,y) | x,y \in \mathcal{N} \} $ of edges, which are ordered pairs of nodes.
\end{definition}
A process can be described as a directed graph by having a graph node for each activity of the process ($\mathcal{N} = A$). The graph edges describe the possible transitions between the activities.
In this model, any two parallel activities $a, b \in A$ will have directed arcs $(a,b)$ and $(b,a)$ between them.
A log is generated by traversing through the graph and visiting all nodes \textit{exactly once}.
This type of path is known as the \textit{Hamiltonian path} or a \textit{traceable path} \cite{garey1979computers}.
Figure \ref{fig:directedgraph} illustrates an example of a directed graph.
The graph shown has two nodes $c$ and $d$ which are parallel.
By following the directed arcs and visiting each node exactly once, the graph shown can generate exactly two distinct traces which are also Hamiltonian paths: $\langle a,b,c,d,e \rangle$ and $\langle a,b,d,c,e \rangle$.
In other words, a trace conforms to the model if and only if the trace is a Hamiltonian path over the directed graph.
\begin{figure}[htb]
\centering \includegraphics[width=0.7\linewidth]{gfx/figures/directedgraph.pdf}
\caption{An example of a directed graph}
\label{fig:directedgraph}
\end{figure}
A \emph{Petri net} \cite{rozenberg1998lectures} can be used as a more detailed process model. A Petri net consists of \emph{places}, \emph{transitions}, and directed arcs connecting them. In this thesis I focus on \emph{finite nets} where the sets forming a Petri net are finite.
\begin{definition}
A Petri net is a tuple $N = (S, T, F)$ which consists of:
\begin{enumerate}
\item a set $S$ of \emph{places},
\item a set $T$ of \emph{transitions} in a way that $S \cap T = \emptyset$
\item a set $F$ (flow relation) of directed arcs: $F \subseteq (S \times T) \cup (T \times S)$
\end{enumerate}
\end{definition}
In a Petri net, the places and transitions $S \cup T$ are called the \emph{elements} of $N$.
For any element $x \in S \cup T$ its \emph{pre-set} ${}^\bullet x$ is all the elements that have a directed arc to $x$, that is ${}^\bullet x = \{y \in S \cup T | (y,x) \in F\}$. Similarly its \emph{post-set} $x^\bullet$ is defined as
$x^\bullet = \{y \in S \cup T | (x,y) \in F\}$.
A Petri net can be mapped to a directed graph by creating equivalent nodes for each place and directed edges for each transition.
Van der Aalst et al. \cite{van2013discovering, van2004workflow} define a subclass of Petri nets called a \emph{workflow net} to describe processes. A workflow net (WF-net) is a Petri net with some further restrictions to make it more suited for process discovery. In their description, a WF-net needs a single starting point (an input place) and a single ending point (an output place). Furthermore they define that a WF-net needs to be \emph{strongly connected}, meaning every node is on a directed path from the start to the end.
\begin{definition}
Let $N = (S, T, F)$ be a Petri net. $N$ is a workflow net \emph{iff}:
\begin{enumerate}
\item $S$ contains an input place $i$ such that ${}^\bullet i = \emptyset$ (starting point)
\item $S$ contains an output place $o$ such that $o^\bullet = \emptyset$ (ending point)
\item $N$ is \textit{strongly connected}
\end{enumerate}
\end{definition}
With the definitions for event logs, traces and WF-nets, we can describe \textbf{process discovery} as an algorithm that maps any event log $L$ into a model such as a directed graph or a Petri net that describes the underlying process that generated the event log.
\subsubsection{\textalpha-algorithm}
\label{sec:alphaalgorithm}
Constructing concurrent process models from an ordered event log is a challenging task.
The event log is a linear and ordered list of events and does not directly express the dependencies between these events.
To find the dependencies, and thus the shape of the process model, it needs to be extracted with an algorithm.
The \textalpha-algorithm is a process discovery algorithm that reads ordered event logs.
The algorithm discovers parallelism by comparing the order of events in different traces.
In a trace, the activities that depend on another activity always come later in the trace.
However, the activities that are parallel and have no such dependency can come in any order regarding each other.
Thus, in different traces the parallel activities are ordered differently.
This is the main idea behind the \textalpha-algorithm.
A detailed mathematical description of the \textalpha-algorithm can be found in \cite{van2004workflow}.
The idea of the algorithm is:
\begin{enumerate}
\item Find all the events that appear in event log $L$. I call this the \emph{vocabulary}. This corresponds to all the transitions $T_L$ in the WF-net.
\item Find all the events that appear as the first event in any trace (the \emph{start transitions}).
\item Find all the events that appear as the last event in any trace (the \emph{end transitions}).
\item Find all pairs of sets of events $(A,B)$ that have a (causal) dependency in all the traces. This means that all events in $A$ always come before all the events in $B$ in all the traces.
\item Reduce the set of pairs discovered in the previous step to only include the ``maximal pairs'' (see below).
\item Create a place for each pair $(A,B)$ from the previous step. Additionally, create places for a starting point $i$ and an ending point $o$. This will be the set of places $S_L$.
\item Connect the arcs (the flow relation $F_L$). All start transitions from step 2 will have $i$ as their input place and all end transitions from step 3 will have $o$ as their output place. All the places for each pair $(A,B)$ will have transitions in $A$ as their input node and transitions in $B$ as their output node.
\item The result is a WF-net $\alpha(L) = ( S_L, T_L, F_L )$.
\end{enumerate}
In step 5, the set of ``maximal pairs'' means that there are no pairs which include subsets of another pair. For example, the pairs $\{(\{a\},\{b,c\}),(\{a\},\{b\}),(\{a\},\{c\})\}$ can be reduced into just $\{(\{a\},\{b,c\})\}$.
In essence, the \textalpha-algorithm tries to find arcs that form a WF-net describing the causal dependencies observed over all the traces in the event log.
For example, if activity $b$ comes after activity $a$ in all the traces, then it suggests a causal dependency $a \rightarrow b$.
On the other hand, if in some traces $a$ comes after $b$ and some others before, then it suggests that the activities do not depend on each other and are parallel.
There are some limitations to this algorithm.
The \textalpha-algorithm assumes that the event log is complete and has accurate ordering \cite{van2013discovering}. This means that all the events are present in the logs, and that the time-ordering in the traces corresponds to causal relations accurately. Furthermore, there can be different WF-nets that create similar traces \cite{van2013discovering}. In other words, different process models can create the same traces.
The algorithm is also not suited to dealing with non-local dependencies \cite{van2013discovering}.
For example, if $a \rightarrow b$, $b \rightarrow c$, and $a \rightarrow c$, this $a \rightarrow c$ dependency is said to be ``non-local'', since from the logs it would only appear that $b \rightarrow c$.
% ################################
\subsection{Machine learning}
The term \textit{machine learning} means the study and development of algorithms that give computers the ability to learn from data without being explicitly programmed for that data.
The traditional approach in programming is to explicitly define the steps the computer should take to solve a problem.
However, sometimes you know the input to a problem and what the solution should be, but not how to get to the solution.
Figures \ref{fig:traditionalvs} and \ref{fig:mlvs} illustrate this fundamental difference.
For example, we can look at the problem of computer vision.
As an input we could have a thousand photos of birds and a thousand photos of without birds.
For us humans it is easy to recognize a bird and decide which of these two groups an arbitrary photo belongs to.
What we don't know are the exact rules that make a photo a ``bird photo''.
It would be very difficult to program an algorithm that examines the pixels of a new photo and decides whether there is a bird in the image.
This situation is a good example of what machine learning algorithms excel in.
\begin{figure}[htb!]
\centering
\begin{subfigure}[h]{0.4\linewidth}
\centering \includegraphics[width=\linewidth]{gfx/figures/traditionalvs.pdf}
\caption{Traditional programming}
\label{fig:traditionalvs}
\end{subfigure}
\begin{subfigure}[h]{0.4\linewidth}
\centering \includegraphics[width=\linewidth]{gfx/figures/machinevs.pdf}
\caption{Machine learning}
\label{fig:mlvs}
\end{subfigure}
\caption{Illustration of machine learning}
\end{figure}
Machine learning is data driven and the focus is less on traditional programming instructions.
Machine learning algorithms focus on finding patterns and models in data,
which can then be used to process new data or make predictions.
Supervised learning is a subset of machine learning.
In supervised learning, the input for the computer is a set of \textit{training data} with predefined labels.
This training data is the ``known'' data points of the ``seen'' data.
Usually, these labels come from real world measurements or manual (human) labeling.
For example, in the computer vision problem described before, the training photos could be manually labeled by humans as ``bird'' and ``no bird''.
The supervised machine learning algorithm uses this labeled data to learn patterns in the \emph{training phase}. After the training, computer can then use the information learned to label new (previously unseen) data.
A separate set of data points called the \emph{test set} is used to evaluate the model performance.
The assumption is that the performance over the test set is an estimate of the performance over new data.
In classification tasks, the models try to predict a \emph{class} for an unknown datapoint. The classes are discrete and the number of classes is finite. In binary classification there are two possible classes, often labelled 1 and 0, or true and false, as seen in the bird photo example above. In regression tasks the predictions result in a real number instead of discrete values, such as when predicting a length of time.
Different models are used for these different tasks.
The model best suited for an application depends on the distribution of the possible values and other characteristics of the phenomenon being modeled.
In conclusion, Machine Learning is a set of algorithms and models that leverage a large dataset to estimate some feature of new datapoints.
In \textit{classification} tasks, this feature is chosen from a finite set of possible classes, whereas in \textit{regression} tasks the value estimated is a real number.
\subsubsection{Regression}
Regression algorithms use a set of training data to learn a model.
The training data is set of labelled data points.
The data set needed is often large and consists of thousands of points.
Each data point consists of a set of \emph{features} (the inputs) and a \emph{label} (the output).
The features are properties that are available for all the data points.
The label is the feature that is unknown and we want to estimate for new data.
In regression problems the label being estimated is a continuous value (a real number).
For example, regression could be used to estimate wine quality.
The features could be some measurable aspects of the wine such as the pH-value, the amount of sugar, or the year the wine was bottled.
The label would be a number ranging from 1 to 10 corresponding to the quality of the wine.
A training dataset could be generated by measuring the features of many wines and then having (human) wine tasters manually label them with their understanding of the wine quality.
This set of features and labels could then be used to train a regression model.
The model could then be used to estimate the quality of a previously unseen wine by measuring the same features and providing them as an input for the model.
The model would then output a number from 1 to 10 as the estimate for the wine quality.
Expressed mathematically, regression estimates an output value $r \in \mathbb{R}$, which is a real number \cite{alpaydin}.
The training set consists of $N \in \mathbb{N}$ feature vectors (inputs) $\mathbf{x}^t \in \mathbb{R}^d$ and $N$ output values $r^t$.
Here $d \in \mathbb{N}$ is the dimensionality of the data, that is, the number of features being used as inputs. Thus, the training data $\mathcal{D}$ can be expressed as follows:
$$\mathcal{D} = \{(\mathbf{x}^t, r^t)\}_{t=1}^N, \mathbf{x}^t \in \mathbb{R}^d, r^t \in \mathbb{R}$$
The regression algorithm (the learner) processes the training data and outputs a model $g(\mathbf{x})$ which estimates the output parameter $r$ for a new unseen feature vector $\mathbf{x} \in \mathbb{R}^d$.
$$g(\mathbf{x}) = r$$
The model $g(\mathbf{x})$ is chosen based on the data and domain knowledge.
The models can range from a simple linear models to complex models such as decision forests.
In theory any model $g: \mathbb{R}^d \rightarrow \mathbb{R}$ can work,
as long as a learning algorithm for the model exists.
In the case of features that are not real numbers, a mapping function is used that takes a different type of feature (such as a string or a boolean) and converts it into a real number.
In the ideal case, the model should fit the phenomenon (physical or other) being modeled in the machine learning application.
However, sometimes the underlying model is not known.
In these cases, multiple models can be used and compared based on the dataset in order to find the best performing one.
The model is evaluated over a separate test set.
Commonly, the test set is created by separating the training set into two distinct sets before the training phase.
$$\mathcal{D}_{test} = \{ (\mathbf{x}_*^t , r_*^t) \}_{t=1}^{N_{test}}$$
In essence, the model is tested with the inputs of a data point in the test set. The output value from the model is compared to the ``correct'' label in the test set data point.
The difference between the values is the \emph{error}.
The aim of machine learning is to minimize this error by choosing a good model for the application and the data, and tuning the model parameters until the desired (low) error is reached.
The error measures I use in this thesis are \textit{mean absolute error} (MAE) and \textit{root mean square error} (RMSE).
The mean absolute error is the average of all the individual errors of the test set data points.
Root mean square error is the square root or the average of squared errors.
$$E_{MAE}(g | \mathcal{D}_{test}) = \frac{1}{N_{test}} \sum_{t=1}^{N_{test}} (r_*^t - g(\mathbf{x}_*^t))$$
$$E_{RMSE}(g | \mathcal{D}_{test}) = \sqrt{ \frac{1}{N_{test}} \sum_{t=1}^{N_{test}} (r_*^t - g(\mathbf{x}_*^t))^2 }$$
All this is based on the assumption that the samples are \textit{independent and identically distributed} (iid) \cite{alpaydin}.
This means that the order of the samples in the sets does not matter and that the distribution of the features and the labels is the same in the training set, the test set, and for any unseen data point outside these sets.
If there is enough information in the training set, the model should be able to use the information to label new data points with the same performance as over the test set.
Thus, the performance over the test set can be used to estimate the performance of the model for unseen data.
There are many different regression learners that are based on different theories. The two models used in this thesis are Poisson Regression \cite{azurepoisson} and Boosted Decision Trees \cite{azurebdt}.
\subsubsection{Poisson regression}
The Poisson distribution model was derived to analyze counts of events within a specific time span \cite{osgood2000poisson}.
It analyzes the probability of observing a discrete number of events when given a ``mean rate'' of events happening.
The mean rate $\lambda$ describes the average number of events happening during the time span.
The Poisson model assumes that the events are independent from each other and they happen at random.
An example could be to model the probability of how many phone calls happen at a call center in a single day.
To model this with a Poisson distribution, knowing the average number of calls per day ($\lambda$) would need to be known.
When the mean rate $\lambda$ is low, the distribution is skewed towards the beginning.
As $\lambda$ increases, the distribution approaches the normal distribution \cite{osgood2000poisson}.
Figure \ref{fig:poissongraphs} illustrates the different shapes of Poisson distributions.
Poisson regression makes an assumption that the underlying variable follows a Poisson distribution.
It works in cases where the data is labelled by counts or probabilities of events.
\begin{figure}[htb]
\centering \includegraphics[width=0.8\linewidth]{gfx/figures/poissons.pdf}
\caption{Poisson distributions with different values of $\lambda$}
\label{fig:poissongraphs}
\end{figure}
\subsubsection{Boosted decision trees}
The second model used in this thesis is boosted decision trees (BDT).
A \textit{decision tree} is a model for supervised learning that is based on a sequence of splits which form a hierarchical tree \cite{alpaydin}.
The tree consists of internal decision nodes and the terminal leaves containing values.
To find the result of the regression model, the algorithm steps through the decision nodes starting from the root.
At each node a component of the feature vector is compared against a predetermined threshold value.
If the value is lower (or higher) than the threshold value, a decision is made and either the left or the right child node is chosen.
This is repeated until a terminal leaf node is reached at the bottom of the tree.
The terminal nodes contain the label values.
The value in the leaf reached by the iterative algorithm is the result of the regression model.
The boosted decision tree learner in Azure ML is based on LambdaMART \cite{azurebdt}.
LambdaMART is a set of algorithms that implement a \textit{forest} of regression trees \cite{lambdamart2010}.
Gradient boosting \cite{friedman2001greedy} is used to combine several weaker models to form a stronger model.
The forest consists of multiple decision trees.
The result of the regression given by the LambdaMART forest is a linear combination of the outputs of the trees.
The learner trains the thresholds at the decision nodes and the values at the terminal leaves of each tree, as well as the weights for the linear combination to minimize error over the training set.
% Talk about practical usage (strengths, limitations)
One strength of the boosted decision tree model is that it is highly interpretable and can be used for rule extraction \cite{alpaydin}. Each tree can be converted into a set of IF-THEN rules, and the weights of the linear combination correspond to the importance of each tree.
Because of these reasons, the domain expert can analyze the model to see which features are being used and whether the model seems to make sense.
Furthermore, the model can even provide information about the real world ruleset for the measured phenomenon.
In general, boosted decision trees work well when the components of the feature vectors are related. In the case of completely independent vectors, a BDT model will not be as effective \cite{azurebdt}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% Related work
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\clearpage
\subsection{Related work}
\label{sec:relatedwork}
Process mining has gained interest in the recent years, since it offers possibilities of streamlining business processes and gaining business intelligence (BI) by finding bottlenecks and inefficiencies in the processes.
Process mining has many uses, such as process discovery \cite{van2016discovery, van2013discovering}, trace clustering \cite{de2016general}, and conformance checking \cite{chomyat2016process}. Furthermore, there's some research done in the area of detecting process anomalies automatically \cite{bezerra2009anomaly}.
Process discovery is in the heart of process mining, since most process mining methods rely on modeling the process and taking advantage of the model in some manner.
Because of this, process discovery is talked about in many papers related to process mining \cite{chomyat2016process, de2016general, van2013discovering, van2015extracting, van2016process,bezerra2009anomaly}.
Conformance checking means making sure the current process conforms to a previously determined set of rules or a model \cite{chomyat2016process}.
Conformance checking can be especially useful in sectors where conforming to regulations or other such rules is critical for the work.
Examples of such sectors are the financial sector, medical business, and government institutions.
Process mining can be used to ensure the regulations are being followed.
The main purpose of my thesis is to help visualize the processes and to show anomalies to the users or even send notifications about the anomalies automatically.
The definition of \textit{anomaly} is not consistent across research \cite{bezerra2009anomaly}.
An anomaly can be anything exceptional or abnormal in the behaviour of a system.
It can also describe an error, noise, or fraudulent behaviour.
Usual definitions include that an anomaly needs to be infrequent, unexpected, and something that deviates from the normal behaviour.
Precise definition is difficult, since the definition of what is considered ``normal'' differs between cases.
Bezerra et al. \cite{bezerra2009anomaly} presents how an established process mining tool ProM can be used to discover models and check their conformance.
Each trace gets assigned a conformance value (fitness) describing how well it fits the previously discovered model.
With this method, a threshold value for required model fitness can be assigned to find the anomalous traces.
However, there is little focus on detecting anomalies real-time.
The past event logs can also be used to build workload histograms \cite{getta2014mining}.
This method reads event logs with periodic patterns (such as the re-occuring submissions in the Microsoft store).
The workload information can be used to predict incoming workloads and to improve system performance by preparing against peak loads.
This method is labeled \textit{process enhancement}.
An example of process enhancement is detecting periods of low activity and using them to run background tasks.
This type of balancing decreases the idle time of a system and improves its capability to respond to peak loads \cite{getta2014mining}.