Structured or unstructured data. What is the story?

Everyone who works with data tries to express that digital information has many forms. In order to distinguish the different types of challenge that different information structures throw at us, we have come up with words as ‘structured’, ‘unstructured’ and ‘raw’ data.

We use these words a lot, but they have an ambiguous meaning. This simple question on LinkedIn has a comment string which illustrates the problem.

The problem illustrated with words

I think the confusion has its origins in mixing up the difference in information structures with the different types of data formats. We do not distinguish between what is human understandable and machine processable.

To illustrate the problem, I will use this sentence: ‘When walking in the woods, most mushrooms I saw are of the type figuring in children’s stories as the home of the leprechauns’.

Most people would identify the mushroom as a fly agaric, at least when you live in North West Europe. A computer program would not know what to do if we feed it the sentence and ask the question ‘what mushroom is this?’

A lot of people would say that a file with a sentence is unstructured data. But that is a dubious thing to say. It is a challenge to combine the information in the sentence with other information by using a computer, but both the sentence and the file have a structure.

Grammar is a structure which enables humans to process this information without a problem. Machines can read the datafile without a hitch and do something with it. Like display it on the screen for you, the reader.

Machines can process the data, but it becomes difficult if the machine has to process the information contained in the sentence. You need different types of software processors depending on the intended use. You might want to index it with a search engine, or you use a natural language algorithm to extract information out of it, or classify the sentence.

What we do know is that analysing sentences with computers takes more effort than information structured in a tabular format, like it exists in most databases or your Excel sheet.

So, if a tabular format is easier to process, how would you store the sentence in a tabular format? The answer is that you don’t. It does not make sense to store a fully contextual sentence in tabular format to communicate the information. You would design a tabular structure to do a count of observed mushrooms, as reported by different users in different data sources, like Instagram photo tags, blogs, or the reports of foresters.

What we call structured or unstructured are indicators of the way information is structured: in a fixed tabular format, or not.

Information or data? The age old discussion

As can be observed in the LinkedIn discussion, professionals start to debate ‘data’ and ‘information’, without having consensus on what is information and what is data.

The debate over the finer semantics of the words do not help in aiding users of digital information who are mystified by technological capabilities, and it certainly does not help in setting expectations on what it takes to process different types of information by using computers and what you can do with the results.

I believe it is more useful to think in measures, transactions and messages. Barry Devlin has defined a modern architecture, called REAL, to distinguish these three types of information structures. Each information structure needs different ways of computer processing to make sense of the information for humans.

  • Measures are very close to information produced by equipment and machines. Think of a thermometer which produces an endless stream of measures. Measures are easy to digest, process and aggregate by computers. Humans need a bit more context to know that ’32’ is a temperature measured in Fahrenheit and is originating from a thermometer in freezer no.5, situated in building B. We need to add a lot of structural elements to the measures to be able to monitor what the temperate curve looks like of all our freezers over the last 24 hours.
  • Transactions are precisely defined, coherent pieces of information that are often represented in tabular structures, or in more flexible structures like XML or JSON. Think about a sales order record. Every attribute of a transaction is named and the nature of each attribute, like textual or numeric, is predefined. Depending on the rigidity or the flexibility of the chosen data format, you need different levels of sophistication in software to process this and produce a result that is for humans consistent and readable.
  • Messages are what we indicate as natural language. This article is an example of a message. For humans, the information is easy to understand and interpret, because it is rich in context. Machines can read the data format, like a text processor file or a HTML file, but have a hard time aggregating the information captured in a message. We need sophisticated algorithms to extract the most important information out of it.

So what about analytics?

If you want to do analytics across differently structured information, you need to pre-process the message, transaction or measures to a structure which is compatible with the other ones.

In reality, you can either index measures and transactions and add them to the index of messages, or you restructure messages and measures to a tabular form, so calculations and different cross sections can be made.

The progress in technology is that software will take care of that tedious process for you. It will not do so by itself though, or at least not yet. Human intervention and decisions are needed in the pre-processing part, so we can use the information in our digitisation downstream.