Security

Safety and security incident handling

Lassie

Problem

Currently incident reporting in many organizations is very old-fashioned and focuses only on major accidents. The traditional way is to report the incident after it has happened using pen and paper. This is time consuming and not very motivating to the employees. Incident management also uses a lot of resources and there are heavy processes behind it. There is little learning, lots of work and nobody actually cares.

Learning from incidents is, however, important. Near-misses, minor incidents and other unsafe acts and conditions precede and predict bigger risks and accidents. These types of incidents are very common, but they are not usually reported because reporting is inconvenient and perceived useless within the organization.

Solution

The solution to these problems which was created during D2I-program is called Lassie. Lassie is a low barrier security and safety incident reporting and analysis system, that enables the staff to report incidents easily and immediately at location via a mobile device.

Lassie is a mobile application for the informant, which enables the informant easily report the incident, follow-up on the incident processing, receive important security information via bulletins and communicate with the handler. It is also a web app for the handlers for incident processing, further analysis (root-cause, impact), communications and reporting.

Benefits

Lassie enables many benefits for users and the organization. It makes incident reporting easy and time-saving for the users while at the same time it helps the organization to avoid major accidents and save money.

It also helps to have all the incident related data in one place, to share the reports and get better knowledge about safety and security situation.

Lassie also helps to create and promote safety culture inside an organization. Motivated and safety-oriented employees are efficient and work safely.

Future

In the future, components and technologies created in the Lassie project will be used in various different business concepts related to other areas, like cyber security. One example is LiveNERs concept demonstration (video below). The technologies created in Lassie project (Infinity Drive) are utilized in this demonstration.

More about the magic, science and technology behind Lassie:

Screenshots and promotional videos:

Lassie in mobile
Lassie in web browser

Web Content Analysis

Web Content Analysis

Objective and applications

The goal is to build a system for high quality classification of web content with two primary applications:

Here we focus mainly on the first application: Parental Control, which provides a way for parents to prevent their children from visiting undesirable web sites. We note that Parental Control is an important component in "family protection" solutions and often a requirement from operators providing service packages to their subscribers. Another similar application is Productivity Control, which limits access to content distracting employees from work. Use of Productivity Control software is somewhat controversial and the business demand for it is often unclear.

Categories to support

Naturally, most of the content categories that we need to support are those undesirable for children, covering such content as adult, alcohol, drugs, violence, etc. At the same time, we need to be able to recognize such web sites as news and blogs, since those are hard to handle correctly for most of machine learning-based classifiers. Examples of categories relevant for Productivity Control solutions are Sport, Games, Dating, but those were outside of the project scope.

Challenges

Those are many and diverse. To start with, many categories have no clear and commonly accepted definitions. Training (labeled) sets are hard to obtain and maintain, and high quality is difficult to achieve. How should we sample to avoid bias and achieve good representation of the real web content "out there"? How can we minimize subjectivity of labeling? How can we minimize the costs of keeping the sets up-to-date?

We debated many times whether we should train the classifiers on more specific (fine-grained) or more general categories. If we had large, representative, and clean training sets, teaching our classifiers to recognize more specific (sub-) categories would likely be a better strategy. For instance, recognizing beer-related content is probably easier than (more general) alcohol-related content. The reality, however, makes this choice non-obvious, as we rarely have good and sufficiently large sets for each natural subcategory within the top-level categories.

There are sites with very diverse or frequently changing content, and it probably makes little sense to send those to our classifiers as that will likely lead to confusion and false positives. So, we need a logic for selecting URLs to classify. In particular, it makes sense to avoid such highly dynamic sites as news, forums, etc., handling those in other ways.

It is obvious that even with good training sets the classifiers cannot be perfect, so we need a logic for mitigating mistakes, e.g., via rules based on human expertise or feedback from experts (or even from the users). It is important to take into account severity of mistakes, as some of those are much more expensive than others (for instance, for very popular web resources).

Handling multiple languages is important and hard, in particular, this concerns Asian ones. Machine translation seems to show good results but utilizing it in production can be very costly. Use of non-text-based features is certainly an avenue for exploration.

Web content keeps changing and evolving, and performance of classifiers inevitably degrades sooner or later. So, we need strategies for detecting that and efficiently re-training the classifiers. This requires other supporting systems, processes, and human expertise.

What content to look at?

There are several natural and good ideas for what information web content classifiers can be based on:

Our classifiers at the moment are based on text analysis, using a number of HTML components as the sources. These include "a", "h1", "h2", "linktitle", "metacontent", "span", etc. For each category-source pair, we look at the number of occurrences of a specific word, from a carefully selected category-source pair-specific set, in the specified HTML component of a web page, and those numbers, appropriately normalized, form the feature space. A natural extension would be to look at pairs or triples of words, but the first experiments did not show any noticeable added value. This probably means we need to try other classification approaches, since it appears evident that analyzing words in context should improve classification accuracy.

Overall approach

The current text-based classifier is based on a 2-layer model, where Support Vector Machines (SVM) are used in the first layer and Extreme Learning Machines (ELM) are used in the second layer. Each specific source-category pair has its own SVM binary classifier, which outputs a probability that a given sample belongs with the specified category. The final output of the first layer classifiers is a vector of real numbers, where each number is a probability for one specific source of the analyzed web page to belong with one specific category.

The second layer is constructed as a single multi-class ELM. The input to this layer is the vector of probabilities described above. To avoid potential source-specific bias, we normalize the components of the vector over the training set. The output of the second layer is a single label corresponding to the "most likely" category of a given web page.

Sentiment analysis-based classifiers were prototyped for such categories as Hate and Violence. We hope that combining these with the above classifier will improve the classification results for these two challenging categories, but better training sets and extensive experiments are required.

The URL-based classifier is naturally fast and generates small number of False Positives but has a low recall.

Finally, an image-based classifier was prototyped but requires further work. If we achieve good classification precision and recall based on image information, that will also help with categorizing non-English-language web sites.

Moving forward

While one can easily identify many directions to explore, prioritizing the list is non-trivial. We mention here a number of clear action points and ideas to investigate that were brought up in the project.

Further details: