Mining Your Business

Python in Process Mining, PM4Py, with Sebastiaan van Zelst, head of process mining at Fraunhofer FIT

April 27, 2022 Mining Your Business Episode 34
Mining Your Business
Python in Process Mining, PM4Py, with Sebastiaan van Zelst, head of process mining at Fraunhofer FIT
Show Notes Transcript

To say that Sebastiaan van Zelst, head of process mining group at Fraunhofer FIT research institute, knows Python quite well would be an understatement. He is an author of the most comprehensive and popular Python library for process mining - PM4Py. And he did not stop there! He also helps develop a process mining tool Cortado, spearheads the process mining research and leads a Center for process inteligence!

Learn more at the Processand website!

Follow us on our LinkedIn page here: LinkedIn
Learn more about what we do at Processand here: Processand

00:00

Patrick:

We are back, Patrick And Jakub, the Dynamic Duo with the Mining Your Business podcast, the show all about process mining, data science, and advanced business analytics. How are you doing today, Jakub?

 

00:09

Jakub:

I'm doing quite nice, Patrick, hi.

 

00:12

Patrick:

Joining us today on the podcast is Sebastiaan van Zelst, head of process mining group at Fraunhofer FIT, the founder of the most downloaded process mining python library PM4Py and co-creator of Cortado a process mining software that unifies the world of manual process modelling and automated process discovery. Let's get into it.

 

00:38

Jakub:

Hi there Process Mining Community. Welcome to yet another episode of Mining your Business podcast. Did you think that with our previous guests such as Wil van der Aalst or Marlon Dumas we were getting technical. Well, let's wait for today's guest with Sebastiaan van Zelst, head of the Process Mining Group and a deputy head of the FIT department at Fraunhofer, a German applied research organisation. Oh, and also a researcher and a Python expert. Sebastiaan, welcome to the Mining Your Business podcast. It's a huge pleasure to have you on our show.

 

01:11

Sebastiaan van Zelst:

Thank you and happy to be on the show.

 

01:14

Jakub:

First things first, Sebastiaan, congratulations on recently born baby daughter.

 

01:20

Sebastiaan van Zelst:

Thank you very much. Thank you.

 

01:23

Jakub:

Yeah, I guess family business is always exciting stuff, but however, we today have a process mining topic in mind. And first question, I would actually have on you is how is it that every process mining researcher seems to be from the Netherlands?

 

01:41

Sebastiaan van Zelst:

Good question. I think inevitably. Well, you had Will van der Aalst on th eshow, as you already introduced him, so as he's been a driving force for the field for many years. I guess he sort of naturally attracted some people that were studying there interested in doing a Ph.D. and it's kind of spiralled out of control. But these days, it's getting very international at the same time. But I guess it's Wil, that is the major source of this phenomenon.

 

02:22

Jakub:

Yeah, I guess those are also the reasons why he's called the Godfather of Process. Mining. It's exciting that he sparks this young generation of process mining, I would say members or researchers across the globe. But it's also cool whenever I see the resumes of the guests that we have, that most of them worked with him at some point, that any paper or university in Eindhoven or Aachen. And it's very exciting how this community is actually revolving still in a relatively smaller circle.

 

02:54

Sebastiaan van Zelst:

Yeah, indeed, indeed. But getting bigger and bigger by the today.

 

03:01

Jakub:

Absolutely. So, Sebastiaan, first let's say I would like to really discuss your career and you work in Fraunhofer which is the research institute in Germany. You are a head of process mining there. And since it's a research institute, does it mean that you are kind of this bridge between like a pure academia or an industry or do I understand it wrong?

 

03:28

Sebastiaan van Zelst:

No, that's actually exactly what it is. So far it's actually quite a large scientific organisation established after the Second World War, actually, that is having exactly this mission. So it's a non-profit organisation with the sole purpose to transfer results from academia into industrial applications. Yeah, that's what exactly what we do, so spot on.

 

04:01

Patrick:

So in general, what do you do as the deputy head of data science and artificial intelligence?

 

04:08

Sebastiaan van Zelst:

Well, as the deputy head, I support our general head of the Department. So it's it's a mixture of certain leadership activities as well as administrative tasks. In that sense, it's a minor role that I play. My main responsibilities is leading a group that focuses on Process Mining, which I actually inherited from from Wil van der Aalst some time ago.

 

04:43

Patrick:

So what do you do in this Process Mining department?

 

04:45

Sebastiaan van Zelst:

Our main goal is twofold. On the one hand, most of the team members that are in my group, they also are Ph.D students, which I supervise. We are strongly related to the research group of Wil van der Aalst, so he's often acting as a promoter and I'm taking care of the day to day supervision of these people. So from that perspective, we do research, we try to develop new things. What's important for the work of our group is that somehow specifically for the Ph.D students it has to have a bit of an applied nature. What they do with some problems, somehow have to be real or there should really be a strong two or software component that flows out of this research. And secondly, we try to do projects with industry We try to do all kinds of Process Mining related stuff ranging from applying it, to developing or co-developing technology.

 

06:02

Patrick:

So I've heard that as a researcher, it's hard to get your hands on some real world data. So is this kind of an industry approach like the one of those golden nuggets you have to go out and beg these companies. So to have your researchers work on this topic, or is it more the other way around that they really want to work with you? Like, how does that come about?

 

06:21

Sebastiaan van Zelst:

Yes, that's a good question. On the one hand, we don't have that issue so much when you compare it to the more classical researchers because of course we're a non-profit organisation, but in the end we just do projects together with the industry partners trying to help them achieve a business goal or understand the processes better. So whenever we apply process mining, we definitely get data. We do not always necessarily use the data in our scientific outputs. So, how you should look a bit more of those projects is often when we when we do these projects, we observe certain problems that we then later try to generalise a bit and then in the end write a paper about it. If we're lucky, we can then use anonymized form of the data that we've been working with, but it's not our main aim of these projects. It's really basically helping our partners in achieving their goals and whatever scientific ideas we get out of it, that's the Synergetic Side Effect.

 

07:34

Jakub:

What would be interesting also for me is what kind of projects are you actually working on then, because I can imagine what academia works on. They have a lot of theories that they're trying to solve or problems they're trying to solve. And then on the other side, it's us, it's people who are implementing let's say the standard processes, and then it's Fraunhofer. So how does a typical project look like and how do you even distinguish between, let's say this difference between, whether it's already too business oriented or it's already too academical.

 

08:11

Sebastiaan van Zelst:

I would distinguish what we do in roughly three categories. There's actually a bit more, but to try to categorise the bits. On one end, we are active in trying to apply process mining. So trying to effectively analyse event data that originates from a process. That's something we do together with actually colleagues that are more from business informatics. So we tend to cover the technical aspects of these projects, they tend to cover the business questions or and it's a very nice collaboration. So those are projects where eventually a project partner provides us with data of a process. We analyse such data, we come to some sort of conclusion that's often a report or when preferred presentation trying to highlight what are the main issues, where is improvement potential, etc. And then we in certain cases go into a larger scale project where we, for example, try to implement the analysis that we did in, for example, the commercial tool of choice of the industry partner. That's one sort of type of project. So that's the typical thing that we do. We start small with more custom tooling. Then we try to look, OK, what did we get out of it? Can we adopt this in any commercial solution?

 

09:44

Patrick:

So that sounds awfully similar to what you and I do, Jakub, doesn't it? So we do something very, very similar. What do you think is like the main difference between doing it from like an academic point of view and doing it from a more industry point of view? Because, you know, we're on the industry side. We implemented directly with customers. What do you think makes a big difference to doing it with that academic aspect?

 

10:08

Sebastiaan van Zelst:

Um, I think in the end, for us, the goal is always to learn something or to transfer knowledge. I also for every project that I do, I have to make the case, right? So of course there's different chasing factors there. Another thing is that we tend to not focus so much on a highly standardised processes necessarily. Of course, if we collaborate with an industry partner that says, would you like to look at the purchase to pay process or the order to cash process? We won't necessarily say, no we don't. We're not going to do that. As you guys are probably well aware, every process and corresponding data set has numbers of challenges. So the learning aspect is actually easy to cover there but in general, we have to specify this and at the moment that a certain type of analysis would get very repetitive. I should actually try to teach them the industrial part and how to do that themselves. But obviously there will be overlaps with what's industrial. I mean, it's not a black and white thing, I guess. So yeah, there will be definitely overlaps with what you guys do but I think that's the main difference. And I prefer to focus on core processes of companies. Yeah. Because it has more potential for improvement.

 

11:45

Patrick:

You already touched on this a little bit about some of the research that you guys do. So this is like secondary or at least separate from the industry implementations. Can you give us some examples or at least tell us a little bit about what kinds of problems that you guys are focusing on your research?

 

12:03

Sebastiaan van Zelst:

Yeah, so there's a bunch of research topics that we're working on. Of course, when you look at my publications list, you'll find also certain, what we would say, exotic theoretical things.

 

12:18

Jakub:

I have to say I was trying to read a bit in to it. It was very challenging, very interesting things that I've never heard of.

 

12:26

Sebastiaan van Zelst:

Yeah, depends when I have time, which these days specifically with a young daughter is not so much the case. I tend to write more theoretical papers because that's something I like personally. When you look at more strategically research lines, one big research line that we have been pushing with one of my team members that focuses on event abstraction where the idea is that you try to lift and record even data into a higher level notion. There's another strong research line that is focusing on trying to integrate domain knowledge into process discovery. So trying to leverage the knowledge that is inside an organisation regarding the execution of a process, trying to embed that in the algorithms that they try to discover process models A third sort of more recent direction that we are pushing, but we haven't published about it. We're still in the, let's say, Start-Up phase is to try to exploit historical process event data for the purpose of scheduling. Trying to make the schedules of processes more realistic or try to dynamically counteract if certain events occur that would violate the schedule, trying to smartly compute new schedules. And on the longer run, it's also a bit related to the things like some of our outings to the industry that we haven't really discussed yet. It's software development side, and I'm still very interested to try and really push process mining algorithms at scale. So to really design distributed approaches to be able to handle billions of events, essentially. Something that has my personal interest, which I want to also push in the upcoming time.

 

14:30

Jakub:

So Sebastiaan, what can companies actually do to come to Fraunhofer and say, ok, we have this very specific problem in the process. We want to work with you guys to solve it.

 

14:43

Sebastiaan van Zelst:

Well, ideally, they contact me, not necessarily directly. They can contact us over various channels. We also have what we call Centre for Process Intelligence, which is this joint collaboration that I referred to earlier. What I prefer to do in such cases is to first sit together with the industry partner and try to get a feeling of, first of all, the process. What is the potential there also and whether the data footprint is a really accurate representation of that process. That could in itself be a short project. Not necessarily the case could also be a shorter workshop. Sometimes you can easily decide it in one to two hours. And usually what we prefer to do is a shorter term project, say two to three months, where we try to analyse the data also support the extraction of the data. if that's necessary. Where we typically would try to discover a normative process model, we would do some conformity checking on that normative model and we'd also do some performance and analysis. And then usually time runs out. And as I said, this is a Non-profit organisation, so you cannot go too much over time in that sense. Then you get the issues. And then if we see improvement potential, we typically try to look at further projects together with the partner. And then depends on what they would like to do with their results and how they would like to embed it in their organisation, etc., etc.

 

16:27

Patrick:

So you mentioned that these implementations typically are short two to three months, but do you also do longer term kind of cooperation’s and collaborations and how long do these collaborations can, how long can they last?

 

16:43

Sebastiaan van Zelst:

Yeah, we do that. So with an industry partner, obviously I'm never allowed to say which exact partner that is. We basically have a one of the members of the Centre for Process Intelligence being in the organisation largely responsible for process mining endeavours or supporting their implementation in process mining, a process mining ecosystem. We have another collaboration for a longer time with a large insurance company in which one of the daughter companies is using our software as a process mining solution. And the idea there is actually that we collaborate with them also explicitly on new features that are specific for them or specific for the insurance domain. And with that company, we're trying to also enter a bit sort of relationship that we are the first to go to group when they have process mining related questions or projects that they would like to set up. And then, of course, there's the larger R&D type of projects where you have a multi-year agreement with the goal to really develop novel technology specific for the initial partner. Although, these are hard to obtain because such a project is often a longer term collaboration. It's very difficult for the average smaller to do to these type of projects. 

 

18:35

Jakub:

Well, what you've said raises a lot of questions for me actually, because I was just about to ask some like a real life examples of what kind of problems you are maybe trying to solve because I know our listeners just love examples. And when you mentioned insurance, that also basically ring a bell for us because we also had an insurance customer where we did some insurance related processes. And it's pretty cool also for us as a partner to look at process mining from different perspectives. You know, everything is not order to cash or purchase to pay. There are also different processes and it's exciting. So not to name the companies, obviously. Could you at least go a bit more into examples of what kind of, let's say, insurance specific issues were you trying to solve?

 

19:23

Sebastiaan van Zelst:

Um, yes. At some point, we just analysed the general handling of claims for such a company. At the moment the company is using the process mining software to improve certain KPIs and the discussions we have are primarily on the really the technical side of things. So which charts would they like to see? Which filtering functionality would they want to have. We did recently have, I think they've been focussing more on the customer experience and processes relate to that. I think they recently showed us that as part of their process mining endeavour because it's not only our software of course that they use to analyse things, but they also actively take the results and implement them in the systems that they actually use to drive that process. Also certain parts were actually done again, I think exported through to power BI or something. So but in the end they managed to improve one of their KPIs significantly in the end, when they finished that whole cycle. And we know exactly what specific process it was, but I remember it was something related to customer journey and specifically I think the time required to respond to some something. In another parallel universe, I would say, I also supervise, of course, some master students and bachelor students in their graduation process. And very also an interesting project at the moment with another insurance company where he's basically trying to look at exploiting process mining theology for fraud, so basically to be able to more characterise fraudulent behaviour rather than simply having a sort of black box predictive that says yeah, this is fraud and prediction and it is not trying to understand fraud.

 

21:35

Jakub:

Do you also sometimes have scenarios with your customers and with the research topics that you basically draw a big fat line and say, ok, this is just a no go because it simply not realistic to do? Or are you always trying to, let's say, deliver at least some kind of a result?

 

21:54

Sebastiaan van Zelst:

We have been trying to deliver always results. But yeah, I think we learned the hard way that it's not always possible. For me the red line should be that the data should really be there, and sometimes organisations, what I mean with that is sometimes organisations, they say the data is there, but then I haven't seen it. And then we embark on a project and it turns out that it is not really there. Or in other cases, I mean, we like challenges where for example, we've been doing several projects where the data is not stored at all as events, but for example a time series or sensor values or measurements, where you then first try to transform certain sensoric measurements into certain actions. Those are very challenging projects, not necessarily I mean problematic, but but of course what you get out of that, this is not always something that is directly translatable into meaningful insights. So data should be there.

 

23:11

Jakub:

Yeah, I remember we used to have a customer. It's been a while. It happened a couple of years ago who wanted us to analyse emails like a text content of emails. So we were thinking, uh, yeah, maybe, maybe let's not do that.

 

23:26

Sebastiaan van Zelst:

Yeah. That's another thing. I would like to add that she built a notion of a process actually, that sometimes you have data that is in a form that you could say, ok, you could look at this as events, but then generally if you would, for example, have a system where humans are free to do whatever they want. Yeah, the truth is that, then it will be not necessarily random, but the behaviour will be more pseudo random. But this large freedom basically comes at the very chaotic footprint and process and also tend to get more difficult.

 

24:13

Jakub:

Before we jump onto the next topic and I know Patrick's already excited about that because it's going to be Python related, I wanted to ask also as I know that there are a lot of listeners amongst our ranks who are students or visiting universities, working on their thesis, thinking about applying on Ph.D.s and so on. What would a student have to do to start, let's say, applying for you and working with you on some of these very interesting research topics that you're working on?

 

24:44

Sebastiaan van Zelst:

If you work in our group, you would work three days per week basically on projects and two days per week is devoted on scientific work. Usually when you look at Ph.D in Europe, it is important to mention Ph.D positions in the university, it's the other way round. 60% research for 40% is education. So people that would like to do a Ph.D in our group really need to have an interest in applying this in practise. It's very important. I mean, there are some people don't have that interest, then you shouldn't do this. People also should realise that if you apply this in practise that there's always a gap between what you research, the assumptions and abstractions of what we have in research versus what you do in practice. This is simply a big gap. So this is something that is very important. And secondly, if you want to do a Ph.D you have to be intrinsically motivated to do that because otherwise it can be a bumpy and also challenging road to perform a Ph.D.

 

26:10

Jakub:

Or everyone involved.

 

26:12

Sebastiaan van Zelst:

Yeah, that also affects the supervisory team. Usually, it's true.

 

26:17

Jakub:

OK, yeah. So if you want to work in Fraunhofer, you have to be motivated!

 

26:22

Sebastiaan van Zelst:

I guess it holds for most jobs. But specifically motivated for doing things in practise and also being interested in scientific challenges. Usually people are either of the two, but not both. So yeah, it's not always easy for us to find people.

 

26:42

Jakub:

All right, then Sebastiaan, moving on to the next topic, and you already mentioned that with one of your or with some of your customers, you're actually working on your own process mining tool, which is called Cortado. And just to read a description that I found of your website, so a Cortado basically enables the users to incrementally add new process behaviour to the process module under construction in a visual and interactive measure. What does it mean? How does it differ from a standard process mining tool and explorer that I would say most of the listeners already familiar with from other commercial vendors.

 

27:20

Sebastiaan van Zelst:

That's a lovely sentance, by the way. I don't know if I thought of that or Daniel, who is the main developer. Um, I think Daniel did. The idea is that first of all, most commercial tools show press maps, usually, which basically is a representation of which activities can follow other activities process. The tool Cortado is trying to discover process models. So those are models that are, for example, BPMN type of models. Which have choice constructs but also parallelism constructs, so that the tool basically allows one to abstracts a bit more from the data. I think most commercial tools don't really discover process models. I know Celonis does offer the functionality, but it's at the same time it's a bit hidden. I think you have to go to the conformance checking step and then somewhere you can still find the process model. So the tool, what it allows you to do in a nutshell, it allows you, first of all, it's actually a process model editor. So you can always edit your model like you can also do in Signavio or something like that. So you can edit process model. The interaction with the data is that Cortado will show you the most frequently occurring executions. Well, actually shows you all executions, but it sorts them based on most frequently. So what I'm trying to say is if 4000 instances of the process, you always first see that a request is being sent in a system and then another check is done, and then all the activity. So, if for 1000 different customers the same set of activities is happening in the same order, it will basically say, ok, this is a very important execution. So you can view of all these different executions of the process and you can actually select which of these you would like to add to the model. So you make a selection, you press enter or you say discover model and what the technique will do, it will take the model you have and will try to augment the model with the new behaviour that you have selected.

 

30:00

Patrick:

So it's not just like on an activity basis where you have one activity and you say, Oh, that's nonconforming, I don't want that in my model, but it's taking the whole process as or variance as they are and saying Yes, this is a legitimate one or this is not.

 

30:12

Sebastiaan van Zelst:

And the question that the tool answers sort of is, ok, give me the best possible model that still represents old behaviour that the previous version of the model represented plus all the excuses you would like to also describe. That's what it does. So it's sort of an incremental approach because you also realise when you add ten executions or say at least ten executions, it can be that if in the dataset there are 2000 executions by trying to learn a model that just describes these ten executions, you also describe 50 other executions out of the box because they actually fit the same model that has been learned. That happens.

 

31:00

Patrick:

So you just then go back and say, well it didn't cover these points and or maybe I hadn't even considered that this could be an option. And you say, well I guess this is also ok, and add them in your model.

 

31:09

Sebastiaan van Zelst:

And specifically it also allows you to, we've been looking also a bit to trying to explicitly block behaviour, but that's, that's actually much more difficult than the other way around. I think we did all kinds of improvements. One thing that you can do in a tool is it can basically you can freeze, we call that freeze parts of the model. So if you have a certain part of the model that you would say, ok, whatever you do, don't touch this, then that can be actually also done by the algorithm.

 

31:45

Patrick:

So I've kind of wanted to ask in general why develop this tool in the first place? Right. Did you see that there was a distinct lack of this type of functionality that you're building in the market and you just didn't find a proper tool to do it for you? So you decided to build it yourself or generally how did this tool come about even?

 

32:03

Sebastiaan van Zelst:

There have been more attempts to foster interactivity in this type of algorithms. What drove me personally was when I was doing my Ph.D research I have been studying process discovery in a more classical sense. So taking an event log, trying to learn a process model that then describes the behaviour, so I think I spent one half years extending an existing algorithm, not achieving many results, in all honesty. And then I realised, I was trying to analyse myself and even dataset, trying to use, you know, all the tools that I had available in my toolbox but I realised that when I really wanted to discover a process model, I would actually inspect the most frequent cases and I would make a mental image. Okay, if I would take these three, four, five, six executions that are very common and I would combine them in a model, how would the model then look? And then I would be iteratively modelling that by met by hand, then again investigating, okay, if I want to add this new behaviour, would it fit? So I noticed that when I was applying this, I was doing something completely different than what, you know, we always pretend that we want to do with these process discovery algorithms. Another thing I've noticed in practice is there is an open publicly available data set that that describes administrative process of road traffic fines. There I think the second or third most frequently executed process is actually people that don't pay the fine. And any process discovery algorithm will of course, always consider that as normative behaviour because it happens maybe 20 or 30,000 times, it's significant amount of times. People don't pay, any process discovery algorithm will learn the behaviour and it's questionable whether if you want to have a view on your process that is as much supported as possible by the data, if you want to integrate that behaviour or whether you want to have a normative model that is backed by data, but that you can also use to quantify unwanted behaviour. And I think what we want to do is this second. I would not want to have a model that completely describes everything. I want to have a model that describes my ideal world, but that's also realistic at the same time. Yeah, and then I realise that by fostering interaction, I expect that the models we can learn are of much higher quality, thus some other benefits that also come naturally by doing this. It's a bit technical to explain, so if you want I can do that. If not, I can not do that.

 

35:19

Jakub:

Yeah, maybe. So essentially you don't want your process or let's say we call it the happy path when you see your most frequented executed path of the process to be already nonconforming basically, to a certain degree.

 

35:32

Sebastiaan van Zelst:

I've noticed that happy path like process executions maybe not the ultimate happy path, but some other execution is also very frequent, is not necessarily always what you would really want the process to be. And the problem is to come to that observation, I doubt that having a process model that simply describes that that can happen, I doubt whether that is the best possible way to come to that observation. Something is happening that you would not want to happen.

 

36:08

Jakub:

Maybe a question, would you have a specific use case in mind? You already mentioned one with the fines where basically this functionality would be superior to standard process discovery available in other commercial tools.

 

36:24

Sebastiaan van Zelst:

How I look at it, this is just a prototype. How I look at the technology we are developing is I think it is bridging the two worlds of modelling and mining, right? I mean SAP recently bought Signavio for a significant amount of money. So that signifies that somehow modelling processes is very important within organisations, I personally believe that one future avenue would definitely be that tools like Signavio, but it can also be that Celonis moves towards that direction, get more and more backed by data. It's nice to have a process modelling tool which is also allowing you to collaborate with your colleagues. But if the model you make is a completely unrealistic view on the process, you are operating, the question is what is the model's value, at the same time, having a model that is an appropriate representation of the process can be very helpful in various dimensions.

 

37:41

Patrick:

So is that a very typical kind of problem that your clients face in that regard where they say, well, we have a modelling tool and we have a process discovery tool, but there's just nothing kind of that meets in the middle and kind of gives me an accurate depiction of what my actual process is. And that's where you go, well, I've got the tool for you.

 

38:03

Sebastiaan van Zelst:

I'm not sure whether the awareness in industry is already at that level. Because if you look at the process maps, I mean these artifacts already reveal all kinds of unexpected cnnections between activities or potential repetition of activities. But I think at some point if you want to really go one step deeper and you would like to really try to express, ok, this is really what we want it to be and are we doing that? Or if legislation is dictating that you need to do something in a specific way for all these type of questions, having accurate models of the process is a vital importance and I noticed that next week or you're going to release a talk with Boudewijn van Dongen, and I can imagine he has a somewhat similar opinion on this.

 

39:03

Patrick:

No spoilers.

 

39:08

Jakub:

There we actually discussed this suitcase use case where we are basically looking where the suitcase will go and trying to map it at the certain part of the process that you are not showing the whole part but you know, incrementally showing these steps. So I think this could be a very good use case for your tool as well. What you also wrote in the description and I will read this again that there is a feedback mechanism that are implemented that notify the user of the quality of the discovered process models. What does that mean?

 

39:41

Sebastiaan van Zelst:

Well, that basically means that whenever you learn a process model, it will automatically do a full conformance check of the event data with that model. So you get instant feedback. OK, this new model that you've derived, what percentage of cases does it cover? etc.

 

40:04

Jakub:

And maybe another question would be if let's say we also decide to work with your tool, is it also like a public available? Is it for free or is there some licensing fee? Is this more like a library that you just download and then use?

 

40:19

Sebastiaan van Zelst:

Yeah. So it's a web based tool which is packaged as a standalone tool. The same holds, by the way, we have another more analysis oriented to PMTK. Both of these tools are web applications, meaning you can run them in an organisation internally and you can reach it from anywhere with the web client. But both of them are packaged as standalone tools for any operating system and you can use them either for academic or non-commercial purposes and of course evaluation purposes. But our main aim is not necessarily to sell the technology. Our main aim would be to do a project where we exploit that technology together with industry partner to also extending it in a direction that is interesting for the partner.

 

41:18

Patrick:

Now, if it's a web based tool, I must ask, how efficient is this tool? How much data can I throw into it before it gives up?

 

41:29

Sebastiaan van Zelst:

So that depends a bit. Both tools have PM4Py actually in the back end, so it's a simple architecture where you have a web service at tje front end, front end goals, web service and the back end PM4Py is running. Of course the amount of data you can upload depends a lot on the specs of the hardware that is running in the back end. We have not yet but are very interested in trying to see, ok, can we, I don't know, hook up spark cluster and then go use some specific library that allows us to do distributed computations. Then it will basically scale until the very last service. So yeah, and I think at the moment of course there's a limitation. It's hard to pinpoint a number on it. What I think Daniel does in Cortado is very smart and I always would advise every industrial entity to also do that is when he computes these conformance checking artifacts. This generally can be a fairly computationally expensive thing to do. He always does in an asynchronous way. So he just pushes a number of goals. So it can be that over time, the statistics you see change in it because it's asynchronously updating the results.

 

43:13

Jakub:

Now, Sebastiaan, you already mentioned it a little and that's moving on a bit from the Cortado tool and that would be the already mentioned Python libraries. And as a preparation for the episode, I was reading up a bit on your previous career successes and achievements and you actually developed one of the, well actually the largest Python library on the market when it comes to process mining. It's called PM4Py. I know Patrick is a big Python nerd, and he loves developing stuff himself, although we did not go as far as to develop our own process mining library. What could we use the library for?

 

43:56

Sebastiaan van Zelst:

In my view the strength of PM4Py is that it would be for people that are more interested in really developing novel things or custom things to do rapid prototyping, for sure. Both scientists, but also I think people that that apply process mining in practice, which have a bit more technical focus, that's a very good use case. I think in general the strengths are more the data engineering side of things. Of course you can compute statistics, you can can learn process models. We have a standardised feature extraction for event logs that you can then feed into neural for networks prediction purposes but I personally use it mostly for like, if I have an event log and I need to manipulate the data or I need to filter, then I'm using certain parts of it. For discovery, you can also use it, but I think using a tool that has a bit of interaction is simply easier to use for them.

 

45:17

Patrick:

So it seems to be covering a lot of the, you know, from the data engineering point all the way to like the process discovery part. There's a whole bunch that goes on in between. And so does it cover all these things and how does that work? Is it fairly intuitive or and how complex of operations can you do in this tool?

 

45:40

Sebastiaan van Zelst:

So in the tool we have implement a bunch of process discovery algorithms, various filters that are also very specific to process mining and processes, conformance checking, performance analysis of various standards like also of these process maps. We have a lot actually more theoretical work also like detecting whether a certain model is of a certain class or whether it can be transformed to a certain class. If you have process models in any formats, you can generate synthetic data out of it, that's the things you can do now. There's a lot of these these types of functionalities, probably too much to mention. When you look at the ease of use in the initial versions, I came from a Java background and the architecture is a bit Java esque actually, and we've been trying to more and more change the usage of the tool to be more Pythonic by simply looking at how do other libraries do that. We're planning also to do a major of these in the upcoming weeks, months where we are really trying to change a few things under the hoods, primarily using bundles everywhere. Insights rather than event log objects, which we custom coded ourselves sometime ago.

 

47:24

Patrick:

I have so many questions, one of them being. So how does it work when you decide on, you know, what bugs to fix or what features to implement, what to change? Is it more that the users of the tool kind of submit bug reports or gets feature requests and things like that and you take that in? Or do you have your like your own milestones, your own agenda that you that you're trying to push?

 

47:46

Sebastiaan van Zelst:

It's a bit of a mixture. For me, the agenda has been lately more on trying to actually cut down some of the functionalities because from time to time, actually one of my colleagues did main development. I did more of the steering, I did more development in the beginning, and he took over gradually, it's been more coordinating recently. My personal focus is that I would like to simplify and I would like to document better at the moment. That's my personal goal for the library. As I said, it's a bit of a mixture. So of course on GitHub we get requests or issues. We also, as we use it ourselves in some of our software that also reveals from time to time, issues and bugs, etc. What sort of is relatively new is that, so my colleague Allesandro, who is doing a lot of coding for PM4Py. He also has been involved in the development of this more recent object centric process mining idea. So he's been also implementing support to handle these type of data sets and to discover models with the types of datasets and I expect that more algorithms will eventually come in that direction.

 

49:13

Patrick:

Ok, it's very interesting. So is this something that you could just install on a laptop and have it run things? Or is this also something that companies can implement, you know, on a larger architecture in their kind of server clusters?

 

49:26

Sebastiaan van Zelst:

They can definitely do that. At the moment, we don't have ourselves any sort of support for distributed computation yet. There are certain companies using PM4Py in a proprietary fashion for which we of course provide licences so that they can do that, because the licence and the way PM4Py is being released, does not allow you to do that, but this is definitely possible and there are companies doing that.

 

49:57

Patrick:

So what is the goal? What are you trying to aim for with this project? Just kind of Pythonic versions of some of the mainstream tools and functions that we already see in commercial tools, or do you have some sort of a bigger goal that you're trying to achieve with this Python project?

 

50:17

Sebastiaan van Zelst:

Yeah. So I think the goal would be to try to enable people to use, I would say, the major process mining algorithms that, you know, have been around for a while, have proven useful to be available in Python. If you look at the problem framework, which sort of still exists and is also used, of course, contains a lot of algorithms that are very hard to use or just simply don't work. So the goal is definitely to make time to time an assessment of what is out there in literature and what seems to also remain, to some degree.

 

51:12

Jakub:

Speaking of goals, Sebastiaan, what gets you excited for the next couple of months and maybe even years when it comes to new research topics and new areas of process mining that you're trying to solve together with your team at Fraunhofer, that you could maybe share with us?

 

51:33

Sebastiaan van Zelst:

So what at the moment excites me a lot is the new research direction that we're trying to push. What I noticed is that and this is based on actually interactions with industry, I noticed that the process mining fields, which originates from the BPM, the business process management fields, somehow we tend to simply assume that the process is executed as if some oracle does that. We don't really take into account that there's people involved and that people have working hours and usually need a schedule or something to do their thing. So we ignore the whole idea of planning ahead. What I also noticed in the operations research community, which is the community that looks to some degree or largely to automated planning systems, is that they often ignore the dynamic aspect of the real world. So they sort of have this assumption that, OK, you know exactly what you want to do and in certain cases this is true. If you look at production, yeah, you probably know a bit better in advance how many cars you want to produce, but still so many things can go wrong. And I look at event data as sort of the missing link between those things, those two worlds. So we can on the one hand, operation research can exploit the data and learn from the data and at some point counteract and revise planning completely in an automated fashion. Processes can be run much smoother if we improve the schedules and plannings that we have. So to me it feels like a very big gap. It can be if you talk with the operations research guy, that says, no, we've already done that. Right? That's also what scientists sometimes think to do. But now I've seen, at least in industry cases where such technology would be extremely helpful.

 

53:48

Jakub:

Interesting. Well, Sebastiaan, time is running by fast. And before we wrap up the episode, I would like to ask you, where could our listeners and our other process mining nerds, Python nerds, whoever listens to this episode. Where can they go and find out more about you, what you've been doing and eventually also about the Python libraries and the tool Cortado?

 

54:13

Sebastiaan van Zelst:

Well, there's a bunch of websites that I would need to list. The best thing to do is if you would be interested in reading some of my scientific work. I have usually the papers freely available on my personal website that's sebastiaanvanzelst.com, if you want to know more about applyinh process mining, you can go to the Centre for process intelligence website, so it's cpi.fit.fraunhofer.de, very difficult and finally we have of course a page of a research group which is fit.fraunhofer.de/process-mining and I'm also available on LinkedIn. So people can send me a message me there.

 

55:16

Jakub:

Yeah, you will find all the links in our post when it is released. So no worries. If you didn't get the long website of Fraunhofer, it'll be there. Sebastiaan, thank you very, very much for coming to the show. It's been a real pleasure. It's as I keep saying, it's always exploring new ways of working and I already have a lot of ideas especially for this, this exploratory approach that you're using in Cortado, so we will definitely catch up on this together with Patrick. So thank you very much. 

 

55:48

Sebastiaan van Zelst:

Yep. Was fun to be in the show.

 

55:51

Jakub:

All right. For everyone else who's listening also, thank you for that. As you know, you can find us on LinkedIn. If you have any questions, just hit us there. We will be sure to answer you or eventually also answer the question in the podcast if you have any ideas for a future guest or someone who would really like to hear on our show, let us know. You can also find the show notes on the website, miningyourbusinesspodcast.com If you like us, leave us a review, give us some comments, we always appreciate that and thank you for your time and thank you for being with us. Patrick, Sebastiaan, bye bye.