May
17

Big Data – A Small Example: Analysing States of Jersey Hansard Transcripts in HDInsight

This post forms part of our “Big Data – A Small Example” series and describes how we used Microsoft HDInsight to extract data from States of Jersey Hansard transcripts previously collected.

Setup

As Buck Woody explains here, you have several options for using HDInsight: as an Azure Service; as an HDInsight Server; or installing it on as Azure virtual machine, which is the approach we took. M’colleague Jonathan Holmes recently blogged about how to set up an Azure VM.

Once you have your VM up and running, it’s easy to download the Web Platform Installer, and install the HDInsight developer preview from there. However, there is one gotcha: while it claims to download and install all dependencies, make sure you install Sql Server first as the web UI wont start without one of the DLLs that comes with it!

Loading data

So that the MapReduce job we’re going to write can access our scraped transcripts we need to load them into the HDFS file system. From the Start menu we open the ‘Hadoop Command Line’ where we can execute Hadoop commands of the form hadoop <command> <arguments>. Executing hadoop without any commands will list the commands available, and executing hadoop <command> without any arguments will list the arguments for the specified command.

First we create a project folder:

> hadoop fs -mkdir myProjectFolder

This will create a folder with the absolute path of /user/<username>/myProjectFolder. The current working directory is /user/<username> so you can check this folder has been created with hadoop fs -ls . (that’s a single period to indicate the current folder.)

We then use the -put command to load the transcripts file into that folder:

> hadoop fs -put hansard.json myProjectFolder

MapReduce jobs

The standard way to implement a MapReduce job in Hadoop is in Java. the Hadoop command-line has as a jar command for launching Java MapReduce jobs, which are simply Java programs implementing Mapper and Reducer base classes in the org.apache.hadoop.mapred library. However, if, like me, you’re more used to programming in C#, help is at hand! Over at CodePlex is the Microsoft .NET Hadoop SDK, which can be installed via the NuGet package manager.

The SDK provides a set of classes for you to inherit from – a Mapper, Reducer, Combiner and Job class. Create a Console application with the SDK packages installed and you can easily create a MapReduce job to run on HDInsight. The functionality supplied by the SDK will create the necessary Java to execute your code as a MapReduce job – all you have to do is run the .exe created when you build the solution, and your job runs! (Gotcha: make sure there is no whitespace in the folder path to your .exe – this will cause dependency issues as it won’t be able to find the SDK dlls.)

Mapper

The data input to the job will be split up and passed line by line into Mappers. The number of Mappers operating will depend on your setup, but in this HDInsight Developer’s Preview you get three. The basic structure of the Mapper  code looks like this:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Microsoft.Hadoop.MapReduce;

namespace MyMapReduceJob.Map
{
    public class MyMapper : MapperBase
    {
        public override void Map(string inputLine, MapperContext context)
        {
            //code to analyse the inputLine
            context.EmitKeyValue("key", "value");
        }
    }
}

What the keys and values are is totally up to you.

In our case the array of JSON objects representing the scraped transcripts were stored one per line, meaning that each Mapper was handed a single transcript at every execution. Since our primary aim was to measure the contribution of each States Member we had to traverse the transcript identifying who was speaking at each point. Fortunately, every paragraph and header comes wrapped in its own HTML tags, so we were able to use this to put some structure around the text. I shall spare you the code, but the algorithm looked like this:

  1. Deserialise the JSON object.
  2. Extract the transcript and clean the HTML (which was not well-formed, so some tags had to be stripped out.)
  3. Load the cleaned transcript into an XDocument, allowing us to traverse the HTML.
  4. Iterate over the XDocument loading elements that contained text into ‘Paragraph’ objects via a ‘TranscriptManager’ object which handled things like tracking section numbers. The ‘Paragraph’ object also performed functions such as stripping out the remaining HTML to provide clear text.
  5. Iterate over the paragraphs in the TranscriptManager, using regular expressions to identify new speakers and proposition codes and also checking for Oral or Written Question headers.
  6. Finally, for each paragraph two key/value pairs were output: (1) a key in the form “count|<date>|<subject>|<states member>” and the transcript text, (2) a key in the form “text|<date>|<subject>|<states member>” and the transcript text annotated with its location in the transcripts (this is used later to show the text making up the numbers.)

Reducer

The job of the Reducer is to perform the work on the data provider by the Mapper. The Reducer is handed a key and a list of values emitted for that key by the Mappers. In our case we implemented the Reducer thus:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Microsoft.Hadoop.MapReduce;
using System.Diagnostics;

namespace Map.Reduce
{
    public class StatesMemberReducer : ReducerCombinerBase
    {
        public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context)
        {
            Trace.WriteLine(DateTime.Now + " - Reduce called.");

            string[] keyParts = key.Split('|');

            if (keyParts[0].Equals("count"))
            {
                ReduceCount(key, values, context, keyParts);
            }
            else if (keyParts[0].Equals("text"))
            {
                ReduceText(key, values, context);
            }
        }

        private void ReduceCount(string key, IEnumerable<string> values, ReducerCombinerContext context, string[] keyParts)
        {
            if (keyParts[2].Equals("Written Questions"))
            {
                context.EmitKeyValue(key, values.Count().ToString());
            }
            else
            {
                context.EmitKeyValue(key, GetWordCount(values));
            }
        }

        private string GetWordCount(IEnumerable<string> values)
        {
            int wordCount = 0;

            foreach(string paragraph in values)
            {
                string[] words = paragraph.Split(' ');

                wordCount += words.Length;
            }

            return wordCount.ToString();
        }

        private void ReduceText(string key, IEnumerable<string> values, ReducerCombinerContext context)
        {
            StringBuilder builder = new StringBuilder();

            bool first = true;

            foreach (string paragraph in values)
            {
                if (!first)
                {
                    builder.Append("|");
                }
                else
                {
                    first = false;
                }

                builder.Append(paragraph);
            }

            context.EmitKeyValue(key, builder.ToString());
        }
    }
}

(I’m including the actual Reducer rather than an algorithm and template code as the Reducer is much simpler than the Mapper.)

Here, depending on the first part of the key we either iterate over the values counting the number of words/questions (in the case of ‘count’ values) or append all the transcript text together (in the case of ‘text’ values.)

Job

The final piece is the Job class, which ties everything together. Ours looks like this:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Microsoft.Hadoop.MapReduce;
using Map.Map;
using Map.Reduce;
using System.Diagnostics;

namespace Map.Job
{
    public class StatesMemberCounterJob : HadoopJob<StatesMemberMapper, StatesMemberReducer, StatesMemberCombiner>
    {
        public override HadoopJobConfiguration Configure(ExecutorContext context)
        {
            Trace.WriteLine(DateTime.Now + " - Configure called.");
            var config = new HadoopJobConfiguration();
            config.InputPath = "hansard";
            config.OutputFolder = "hansard/output";
            return config;
        }
    }
}

The key parts here are the Generic Types in the base class, which define the Mapper, Reducer and Combiner (more on that in a moment) and the configuration provided to the ExecutionContext. The generic typing of the base class is overloaded, so you must at minimum provide a Mapper, but can also provide a Reducer and also a Combiner. The Combiner is used to collate the output from Reducers who have reduced a subset of key values. In many cases the Combiner can be another instance of the Reducer class, however, in this case we had to create our own to ensure the counting of words/questions was handled properly (i.e. the values were added together rather than counted.)

Execution

Once the program is compiled and (if necessary) moved to the machine your HDInsight cluster is installed on, you only need to execute the Exe for it to load itself into HDInisght and execute as a MapReduce job. A console window will open describing the dependencies being loaded

However, some more Gotchas:

  • HDInight is built using .Net 4, so make sure that is the operating version of .Net (installing Sql Server can involve installing .Net 3.5 components causing that to become the registered version of .Net.)
  • It is very likely your HDInsight cluster will be on a 64-bit server. Ensure your C# program is set to target this platform (Project Properties -> Build) otherwise you will get an obscure ‘Bad Image’ exception.
  • Your Mappers and Reducers do not run in the same application context as the Exe that launches them, logging and debugging becomes… challenging.
  • Use Console.Read() at the end of your Program.cs Main method so that the Console window does not close before you’ve read any success or error messages.

Debugging

As the C# code creates a Hadoop job that executes Java to run the C# code, the execution of your Mappers and Reducers is outside the context of the launch application. If your job fails you can look in the job logs, but don’t expect too much. If you have access to the Server HDInsight is running on you can navigate to http://localhost:50030/ to access the Hadoop Administration page. From here you can access the logs, but the stack traces available are for the Java, so make your errors descriptive!

An alternative, however, is to home-brew your own logging and write directly to the filesystem. This provides a lot more context than what you’ll get out of the Hadoop logs.

Summary

Being able to code a MapReduce job in C# definitely speeds up development and makes for a good entry point to coding jobs in HDInsight. However, it does come with some challenges, specifically around debugging. I also noticed that the performance suffered once I started making use of other .NET libraries, so there is definitely an overhead and it may be worth considering coding-your-own. Otherwise it’s pretty easy to set up and get going. I like. Even better, as this was on an Azure VM I was able to beef up the power of the machine when necessary (outputting huge chunks of transcript caused quite a performance hit!)

Next week: pulling the output out of HDInsight and loading it into a data warehouse.

Share:
  • Print
  • Twitter
  • LinkedIn
  • Facebook
  • Google Bookmarks
  • del.icio.us
  • StumbleUpon
  • Digg

May
10

Big Data – A Small Example: Screen-scraping the States of Jersey Hansard

This post forms part of our “Big Data – A Small Example” series and describes how we collected our data.

The States of Jersey Assembly maintains a Hansard record of States sittings, available online here. To scrape the transcripts we used a Python tool called Scrapy, which is relatively easy to set up, and ridiculously easy to use.

The Scrapy framework works by providing a lot of functionality “out of the box.” All you have to do is tell it where to start, where to crawl and how to parse what it finds. You then also have further options for what you do with the output. This is a fairly simple case for Scrapy, so we didn’t need to dive too deeply into what it can do.

Setup

The Scrapy Documentation provides setup guides and a tutorial, and Google can point you in the direction of plenty of tutorials.  This post picks up from the point of having installed Scrapy and created a new project.

The Spider

Here’s the Python spider we wrote to collect the Hansard transcripts:

#!/usr/bin/env python
# encoding=utf-8

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request
from scrapy.http import FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy import log
from sojhansard.items import SojhansardItem
import sys
### Kludge to set default encoding to utf-8
reload(sys)
sys.setdefaultencoding('utf-8')

class SoJHansardSpider(CrawlSpider):
    name = "sojhansard"
    allowed_domains = ["gov.je"]
    start_urls = [
        "http://www.statesassembly.gov.je/Pages/Hansard.aspx?page=1"
    ]
    rules = (
        Rule(SgmlLinkExtractor(allow=('page=*',),deny=('fontSize=*', ),unique=True)),

        #When we get to specific doc do this
        Rule(SgmlLinkExtractor(allow=('docid=*', )), callback='parse_hansard_item'),

    	)

    def parse_hansard_item(self, response):
    	self.log('parse_hansard_item called for: %s' % response.url, level=log.INFO)
        hxs = HtmlXPathSelector(response)
        item = SojhansardItem()
        item['url'] = response.url
        item['body'] = hxs.select('//*[@id="mddocument"]/div').extract()[0]
        item['title'] = hxs.select('//*[@id="ctl00_PlaceHolderMain_EditModePanelintroview_AssemblySearch1_documentControl_lblTitle"]/text()').extract()[0]
        item['PdfUrl'] = hxs.select('//*[@id="ctl00_PlaceHolderMain_EditModePanelintroview_AssemblySearch1_documentControl_panPDF"]/div/a/@href').extract()[0]

        return item

The significant parts are:

class SoJHansardSpider(CrawlSpider):
    name = "sojhansard"
    allowed_domains = ["gov.je"]
    start_urls = [
        "http://www.statesassembly.gov.je/Pages/Hansard.aspx?page=1"
    ]

Here, we create a new Spider class called ‘SoJHansardSpider’ which inherits from the CrawlSpider base class. The spider is given a name, ‘sojhansard’, which is how it will be referred to at the command prompt, and told which domains to restrict itself to and given a URL to start from.

Rules

rules = (
        Rule(SgmlLinkExtractor(allow=('page=*',),deny=('fontSize=*', ),unique=True)),

        #When we get to specific doc do this
        Rule(SgmlLinkExtractor(allow=('docid=*', )), callback='parse_hansard_item'),

    	)

Rules are how you tell the Spider how to behave. Here we define two rules, both to extract and follow links. The first Rule tells the spider to follow any link where the URL includes ‘page=’, this allows the spider to crawl through the history of Hansard transcripts. However, since we don’t want to crawl duplicated listings, we deny ‘fontsize=’ URLs (these provide access to the same transcripts, just at different font sizes for those who need them) and specify that each followed URL must be unique, i.e. not one we’ve visited before. This prevents us, for example, going back to page one after moving to page two.

The second Rule tells the crawler to follow links with ‘docid=’ in the URL. These are the ones we’re interested in, that hold a Hansard transcript for a given day. Here we specify a callback function to handle the retrieved page.

Parsing the response

Here’s the callback function again:

    def parse_hansard_item(self, response):
    	self.log('parse_hansard_item called for: %s' % response.url, level=log.INFO)
        hxs = HtmlXPathSelector(response)
        item = SojhansardItem()
        item['url'] = response.url
        item['body'] = hxs.select('//*[@id="mddocument"]/div').extract()[0]
        item['title'] = hxs.select('//*[@id="ctl00_PlaceHolderMain_EditModePanelintroview_AssemblySearch1_documentControl_lblTitle"]/text()').extract()[0]
        item['PdfUrl'] = hxs.select('//*[@id="ctl00_PlaceHolderMain_EditModePanelintroview_AssemblySearch1_documentControl_panPDF"]/div/a/@href').extract()[0]

        return item

First, we do a little logging to help in debugging, then we get down to the business of extracting the transcript. To do this, we use an HtmlXPathSelector to pull the transcript out from the HTML of the webpage and store the information we want in an ‘SojhansardItem’ object. The HtmlXPathSelector allows us to query the HTML returned in the Response object using XPaths, using the ‘select’ method. This returns an array of selected items which we can then access using extract() (to get the values) and specifying the first item in the array with [0]. In this example we know we only want the first item in each case because it only appears once on each page. Most browsers will allow you to inspect the document model to discover the XPath to the element you are interested in. In Chrome you can simply right-click and choose ‘Inspect element’, details for using Firefox can be found here. Gotcha: remove any ‘tbody’ elements from the supplied XPaths, as the HtmlXPathSelector won’t recognise them.

Result items

The SojhansardItem object we are using is defined in the items.py file which looks like this:

from scrapy.item import Item, Field

class SojhansardItem(Item):
    # define the fields for your item here like:
    # name = Field()
    url = Field()
    title = Field()
    body = Field()
    PdfUrl = Field()

class SojpropositionItem(Item):
	title = Field()
	code = Field()

Exporting results

So now we have a functioning spider which will crawl the States Assembly website and scrape all the Hansard transcripts. The question now, is, what do we do with it? The default behaviour of Scrapy is to dump the output to Stdout (aka the Console window) which isn’t too helpful.

It is possible to do processing on your scraped data before outputting it, using Item Pipelines, although this wasn’t necessary in this example. We used Feed Exports to format and output the data as JSON objects. You can specify the output location and format either in the settings.py file or as arguments at the command prompt. Interestingly, Scrapy has support (requiring the ‘boto’ library to be installed) for pushing your scraped data straight to an Amazon S3 bucket – although if you do this, make sure you also run your crawler in the cloud, otherwise your upload stream will be the bottleneck that slows everything down.

Saving your Items in JSON format will output the scraped items as serialised JSON objects, allowing us to deserialise them again later. Scraping 350+ transcripts created a 180mb file.

Running the spider

The spider is run from the command prompt. Navigate to the directory containing the python files and execute the spider using the command: scrapy crawl <spidername> <arguments>:

> scrapy crawl sojhansard --output=hansard.json --output-format=JSON

Use scrapy -help and scrapy crawl -help to access help on the command line commands, or refer to the documentation.

Also…

We also scraped the published list of Propositions, as these are the matters debated by the States. The code for this spider does contain one interesting feature:

#!/usr/bin/env python
# encoding=utf-8

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request
from scrapy.http import FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy import log
from sojhansard.items import SojpropositionItem
import sys
### Kludge to set default encoding to utf-8
reload(sys)
sys.setdefaultencoding('utf-8')

class SoJPropositionsSpider(CrawlSpider):
    name = "sojpropositions"
    allowed_domains = ["gov.je"]
    start_urls = [
        "http://www.statesassembly.gov.je/Pages/Propositions.aspx"
    ]
    rules = (
    	Rule(SgmlLinkExtractor(allow=('page=*',),deny=('fontSize=*','infoid=*','SortBy=*', ),unique=True), callback='parse_proposition_page', follow=True),

    	)

    def parse_proposition_page(self, response):
    	self.log('parse_proposition_page called for: %s' % response.url, level=log.INFO)
        hxs = HtmlXPathSelector(response)

        propositions = hxs.select('//*[@id="bodycontent"]/div[2]/ul[1]/li')
        items = []

        for prop in propositions:
            item = SojpropositionItem()
            item['title'] = prop.select('table/tr[1]/td[2]/a/@title').extract()[0].strip()
            item['code'] = prop.select('table/tr[1]/td[1]/text()').extract()[0].strip()
            items.append(item)

        return items

This is all very similar to the Hansard spider, with the exception that each Propositions.aspx webpage had multiple propositions on it, therefore we had to iterate over the HTML list items to access each one – notice the ‘for’ structure looping over the result of the ‘select’ method of the HtmlXPathSelector with no extract() call.

The propositions were outputted using the CSV feed exporter, ready to be loaded by the ETL into our data mart. The file of JSON objects representing the transcripts, on the other hand, was loaded into HDInsight to be analysed.

More on that next week.

Share:
  • Print
  • Twitter
  • LinkedIn
  • Facebook
  • Google Bookmarks
  • del.icio.us
  • StumbleUpon
  • Digg

May
08

Building Virtual Machines with Microsoft Azure

With the advent and integration of cloud computing seen in modern  business environments today, the Microsoft Azure platform is one notable example promoting its usage. Using this infrastructure, businesses can rapidly deploy services without the need of hosting or maintaining physical devices within the company; this blog post will demonstrate how a virtual machine running Windows Server 2012 may be deployed within a company using the Azure platform.

For the purposes of this tutorial, a trial subscription is in use. Continued usage requires an active Microsoft Azure subscription.

Firstly, navigate to http://manage.windowsazure.com/ and log-in with your associated Microsoft Account (e.g. john.smith@business.co.uk). You will now be presented with the Microsoft Azure Dashboard as seen below.

Azure 1

Creating a new virtual machine can be accomplished by pressing the ‘New’ button available in the bottom left corner of the dashboard. A new interface will slide up with numerous selections.

Azure 2.5

Select ‘Compute’ then ‘Virtual Machine’ and select the option ‘From Gallery’; the ‘Select the Virtual Machine Operating System’ pop-up will be displayed.

Azure 3

Numerous software packages are available including Windows Server 2012, SQL Server 2012, SharePoint Server 2012 and even Ubuntu packages depending on the business’s needs. For this tutorial, Windows Server 2012 will be installed on the Virtual Machine, click the arrow to continue to the Virtual Machine Configuration page seen below.

Azure 4

Using this section of the virtual machine deployment allows a user to configure the machine’s Name, its User Account and associated password as well as the size of the machine (which can vary from a shared core with 768MB of Memory to an 8-core with 56GB of Memory). Click the arrow to continue.

Azure5

Seen above is the Virtual Machine Mode configuration page; here a user can designate the new virtual machine to be stand-alone or for it to be connected to an existing virtual machine (connecting virtual machines enables load and traffic balancing); for the purposes of this tutorial a stand-alone machine will be created. Provide a DNS Name (e.g. AzureTest001.cloudapp.net), select ‘Use an automatically generated storage account’ under Storage Account and select a region appropriate to your location. Click the arrow to continue.

The Virtual Machine options page will now appear; use the default selections and press the tick button to complete the setup. Your new virtual machine will be displayed in the dashboard and will be made available after it has completed installation.

Azure7.5

Connecting to your new virtual machine can be accomplished by selecting the machine and pressing the ‘Connect’ button displayed on the bottom area of the screen, a remote desktop protocol (RDP) file associated with your virtual machine will be downloaded.

Azure8

Opening the file will start a remote desktop session connecting you to your new Virtual Machine hosted through Microsoft Azure.

Azure11

Once the virtual machine is active, management options become available detailing the status of the current machine as well as additional configuration options and can be accessed through the Azure dashboard.

Azure 13

Share:
  • Print
  • Twitter
  • LinkedIn
  • Facebook
  • Google Bookmarks
  • del.icio.us
  • StumbleUpon
  • Digg

May
03

Big Data – A Small Example: Introduction

Having written a lot on this blog in recent times about Big Data we thought it was time to code up or shut up. So we set ourselves a challenge, one that encapsulates a number of the problems that can be faced when working outside the confines of the traditional data warehouse.

Our task: using the Hansard transcripts of the States of Jersey Assembly sittings, find out how active each States Member has been.

There’s a number of challenges here. First, the source data is outside of our control. It is publically available on the States website, but we have to take it as we find it. Second, the transcripts are free-text. They are, at best, only semi-structured – and in practice we found them to be so inconsistent in style and layout (even within the same document) that they may as well have been totally unstructured. Third, they’re written by humans and so there are typos, abbreviations, alternative titles and forms in use and other variations that contrive to make our lives difficult. Finally, with over 350 transcripts available online there’s a fair few to analyse and, frankly, I’m far too lazy to do them one by one.

So, how did we approach this?

The first stage was data collection. We used a Python web-crawling and screen-scraping framework called Scrapy to collect and save the transcripts. Next, the (rather large) output was loaded into a Microsoft HDInsight cluster where a MapReduce job (written in C#) was run to extract the data from the transcripts. This stage analysed each document to determine how many Written Questions each Member submitted, and how many words they spoke on each subject on each day. This output was then cleaned and loaded into a MS Sql Server data mart where an Analysis Services cube provides the ability to ‘slice and dice’ the data.

We’ve split this into several parts:

Part One: Data Collection – Screen-scraping

Part Two: Data Extraction – MapReduce

Part Three: Data Loading – ETL, Warehouse and Cube

Part Four: Summary & Evaluation

Part One coming up next week…

Share:
  • Print
  • Twitter
  • LinkedIn
  • Facebook
  • Google Bookmarks
  • del.icio.us
  • StumbleUpon
  • Digg

Apr
14

using geoflow to find out where neutrals should watch football

Cross post from www.cibis.org.je

GeoFlow is a new Excel 2013 add-in that has this week been released into public preview. You can download it from here http://office.microsoft.com/en-us/download-geoflow-for-excel-FX104036784.aspx

It was the big announcement at the recent #passbac

GeoFlow is a tool that allows you to easily create 3D maps and visualisations to ‘tell a story’ about your data.

Now that I have nice and clean football data I thought it might be cool to use GeoFlow to see in what region of the country a neutral should watch football matches (e.g. where they are likely to get the best sort of matches to watch).

To tackle this task the first thing I need to do is find out the locations of football grounds, so I can combine it with the football match data I already have. This site http://www.doogal.co.uk/FootballStadiums.php had a pretty definitive list. They have even helpfully added a csv version of the data, but we all know we don’t need that anymore as we can also use Data Explorer (see my previous post on using data explorer).

I quickly fire up data explorer, connect it to http://www.doogal.co.uk/FootballStadiums.php and merge it with my clean football data so I am left with something like this:

I have the results of matches along with where they took place.

So lets map it. Insert->Map

This opens the GeoFlow application

I immediately start by adding my latitude and longitude stadium data to the map and then zoom into the map.

Once your points are mapped you can start adding your measures, and change your chart types. You can choose from, columns chart, heat map, and bubble chart. From here you can create your ‘story’ by adding layers, visualisations, annotations and charts.

GeoFlow is really easy to use, and I suggest you should just ‘have a play’.

Here’s a story about where to watch football matches if you’re a neutral I made.

Rather frustratingly GeoFlow doesn’t provide the option to export a video, so I had to use CamStudio to record my screen. I think the story telling abilities of GeoFlow is good and I think it would be useful to add the ability to export to video so stories can be shared. Hopefully they’ll add this.

If you don’t want to watch the video, here are some screenshots.

 

 

Share:
  • Print
  • Twitter
  • LinkedIn
  • Facebook
  • Google Bookmarks
  • del.icio.us
  • StumbleUpon
  • Digg

Apr
12

cleaning football team names with DQS

Cross post from www.cibis.org.je

As a little side project of mine, I want to play about with a whole bunch of football data (soccer to you American’s).

After some ‘googling’ I came across this site http://www.football-data.co.uk/englandm.php which has a whole bunch of data. (All English divisions scores since 1993-1994 seasons, amongst other goodies).  All the data is in separate csv files

So, I download all of them and then go about loading them into a single database table – to do this I just used SSIS, a foreach loop and a bulk insert task.

Yay, I now have all English football league match results from 1993-1994 until approx now. So, all excited, I fire up Excel, connect to my table and start to pivot.

So the team who has scored the most home goals is… Man United. Dammit! (I’m a Liverpool fan, but I guess it is no real suprise.)

I was surprised by Tranmere being second though. But wait what is that? Peterboro? Hmm I think it should be Peterborough. Also, Man United should really be Man Utd or Manchester United too. What else is ‘weird’?

Looking further at the data I have things like:

Bristol Rvs
Dag and Red
Middlesboro
Middlesbrough
….

Looks like before I get to play with this data properly, I am going to have to clean it up *sigh*.

As part of me using this football data, I hadn’t planned on having to clean it up, but needs must. It does give me the chance to use SQL Server Data Quality Services, which is part of SQL Server 2012 and is used to cleanse ones data.

Having ‘messy’ data is a big thing that is unappreciated by a lot of clients wanting to build BI solutions. I’m an advocate for fixing this data as early as possible e.g. at source by adding business validation in source input systems. I am also pragmatic enough to understand that this is not always possible or viable to do this, so tools like SQL Server DQS are interesting to help solve this problem that I suspect every business has.

So, this is an intro into using DQS.

First of all I install DQS, by clicking the ‘Data Quality Services Installer’ app.

This goes off, does it thing and you are left with 3 new databases.

 

So now I start the DQS Client

Here you are presented with the DQS Client dashboard where you can administer the server, create or maintain knowledge bases and create or maintain projects.

You need to start with a knowledge base. The installer will create a sample geographic knowledge base, but we’re going to create a new football team one. We will use our football team knowledge bases to clean up our football teams.

So we go ahead and click New Knowledge Base

Enter the name of the knowledge base, then select None as your Create Knowledge Base From (we haven’t added any data yet), then hit Next.

Now we need to add a domain. Your knowledge base can have multiple domains. It is easiest to consider a domain as a field within your data. e.g. if you had an address Knowledge Base, your domains might be, Post Code and County.

So our domain is simply teams. So we add our domain.

 

Next we need to ‘discover’ our data. To me this is initially loading in ‘correct’ values that we want as our end game. You can discover data from SQL or from Excel. I created an Excel workbook with a list of all valid football teams.

I upload it and map the Excel football team name column to my teamName domain.

When I clicked Next I get the following error:

This seems to be a slight bug and you just need to re-select your data source.

Now your data will be loaded into your knowledge base and you can view your results.

 

Now we are ready to start cleaning our data. So we create a new Data Quality Project and select the knowledge base you just created.

Select your data you want to clean up as per above. One thing that was frustrating was that the table I have has both home teams and away teams in it. I need to clean both the columns. It seems like you cannot clean both the columns using the same domain teamName. To get around this you can either parse your data twice, or add a duplicate domain (e.g. HomeTeam domain and AwayTeam domain). Neither of which seems that appealing, but for now let’s just clean up the home team and see how we go.

Start the cleanse, and we will see the output.

So, we have 100 correct items, 8 new and 2 suggested cleaned. DQS has told me that Peterboro is probably Perterborough and Sheffield Wed is probably Sheffield Wednesday. DQS is correct, so I’ll approve those suggestions.

Let’s have a look at the new records.

So you can see here that DQS has found values that are not in my knowledge base. e.g. Accrington. This is because I have change the name to Accrington Stanley in my knowledge base. So I go through this list and correct and approve them. Later I can load these rules back into my Knowledge base domain, so I don’t have to re-do the changes in the future.

I can now see a summary of the changes and export the changed data.

Here I can export my clean data to another SQL table, CSV or Excel.

Now I want to ‘commit’ my rules I have made in the project back to my Knowledge base, so next time I wont need to change anything. Your knowledge base evolves over time, ‘learning’ new cleansing rules.

So open up your Knowledge Base and got to Domain Management.

Now go to Domain Values and Import from project.

Select your project that you just loaded.

Now this will show you the rule you have created, and click finish and publish.

If we re-open our knowledge base, and go to domain value we can see the cleaning rules in our knowledge base as per below.

Here you can see Accrington is cleaned to Accrington Stanley.

So now if we create a new project to clean the away team, we shouldn’t need to do any cleaning as we have committed the changes to the Knowledge Base.

So you can see, everything was cleaned for me as the rules exist in the Knowledge Base. Pretty cool.

 

 

So, that is all well and good, but is a little bit manual in terms of having to clean the data each time. You can however utilise DQS in SSIS. So lets build a simple package to clean the data.

Fire up SSIS and add a Data Flow task. Add a source of your ‘messy’ data table.

Now add a DQS Cleansing transformation and call it clean teams.

Choose your DQS connection and the footyTeams knowledge base.

Now map your columns to your domains

Again, like in the DQS Project you can only seems to use a domain once during a parse. So to clean the away team in a single Data Flow task we just add another DQS Cleaning transformation and map the away team.

Then we just use the destination transformation and create a cleaned version of our results and execute the package.

We are now left with our table all nice and clean with corrected team names.

 

 

 

 

Share:
  • Print
  • Twitter
  • LinkedIn
  • Facebook
  • Google Bookmarks
  • del.icio.us
  • StumbleUpon
  • Digg

Apr
11

Big Data – a Glossary of Terms

The Channel Islands Business Intelligence team here at Altius has been getting excited about the promise of Big Data for some time now.  My colleague Charles has written a series of blogs on it detailing the rise of Big Data, why visualisation is going to be ever important, and also a look at the risks involved.  It is also worth noting that the enhanced visualisation technologies have made an impact to the latest Gartner Magic Quadrant for Business Intelligence.

No doubt you’ve all heard of the term Big Data for a while now, and with all technologies come a whole load of new terms, technologies and other phrases which we’ll have to get to know.  So on the back of the recent announcement that the next version of Hortonworks’ Hadoop platform 1.1 for Windows is ready for beta testing, and that Microsoft’s recent Data Explorer plugin for Excel 2013 has a built in connection to Hadoop based data sources I thought it worth outlining some of the common terms involved. A beginners glossary if you like.  These are in an order, which I’ll explain further down.

1)      Distributed Computing

Basically using lots of machines or nodes to process your queries instead of single expensive servers.  Queries or tasks are distributed across as many nodes as are appropriate or available and executed in parallel, each focusing on part of the problem. So rather than a single server crunching through a single data set, the data set is spread over lots of machines each of which runs a small part of the query. There is a high level of redundancy as data is replicated across nodes and machine failure is handled by adding or removing nodes.  The theory is that lots of smaller less powerful machines when combined together will actually be quicker and certainly more resilient than a single expensive server.

2)      MapReduce

MapReduce is a programming framework designed work with a distributed computing (DC) environment.  If in a DC environment the data set is spread over multiple machines, or nodes, then MapReduce is the framework which allows this data to be queried.  It was designed by Google to work with the massive amounts of data their search algorithms were producing and each MapReduce query has two parts to it.  These are:

  1. The first part is a Map() function.  This uses key-value pairs to isolate data (e.g. customer = 123456, or country=Jersey, or user=John).  Now imagine we have a vast amount of web based analytics data and we’ve associated parts of it as being generated by particular users.  In our distributed computing environment this data may be spread over hundreds of nodes.  Each node then runs its own query and returns part of the answer.
  2. The second part is the Reduce() function.  When all the Maps() have returned the Reduce() part of the query brings them all together and returns our single dataset of results.

In a DC environment a Controlling node distributes the Map() part to worker nodes, then combines the results in the Reduce() part when they all return.

3)      Hadoop

The MapReduce framework as produced and used by Google was designed to work on the Google File System.  This is a distributed file system designed specifically by Google for themselves and consequently isn’t suitable for all organisations.  Hadoop  is a distributed file system which addresses this.  It was created by Doug Cutting and Michael J. Cafarella, who were working for Yahoo! at the time, as a distributed file system which implemented MapReduce and can be installed and run by virtually any organisation.  It is written in Java, runs primarily on Linux and comprises of a set of standards and protocols which allow anyone to use it.  The distribution is now controlled by the not-for-profit Apache organisation.

BTW- Hadoop was the name of Doug Cuttings son’s toy elephant.  HDInsight is Microsoft’s implementation designed to run on Windows or Azure platforms.

4)      Big Data

Now we get Big Data.  Distributed Computing, MapReduce() and Hadoop are the main technologies that enable vast data sets to be stored and more importantly queried quickly and efficiently.  Big Data itself is just a term used to describe these data sets which, thanks to be above now become usable.  Typically these data sets are identified by three characteristics known as the “three-V’s”.  These are data sets which have:

  • Volume. By definition these data sets are huge.  Vast even.  We go way outside the normally amounts of data we all normally deal with day to day. Think data sets which are tera-, peta- or exabytes in size.  (1 exabyte = 1 billion gigabytes).
  • Variety. Data is coming for a variety of sources, and often from data sources outside of your organisation.  Altius was recently working a hotel chain which was noticing that in some weeks booking revenue in certain areas was dramatically down.  When analysed the team found that one factor was that these hotels were mostly by the coast and that the weather was poor for the days /weekends in question.  That makes sense, who goes away for a short-notice weekend by the coast when they know the weather is going to be rubbish?  Now this hotel chain looks forward in it’s planning meetings with weather forecast data added to the mix.  This allows them to predict dips in sales and lower their prices to potentially attract more people.  Better to get some revenue in for a room than nothing at all.
  • Velocity.  This final characteristic is perhaps the most interesting.  Big Data sets are often characterised by the speed they are updated, often if not at real time, then very close to it.  This enables real-time monitoring and alerting to any changes in behaviour.  The F1 team McLaren are well known users of Big Data technology enabling them to monitor cars during races.

All of this has the potential to produce some vast amounts of data.

5)      Hive

Now this wouldn’t be a Business Intelligence blog without throwing some database technology in.  Hive is a data warehouse application that runs on Hadoop.  Hive provides mechanisms to create familiar SQL-style tables to build structured data warehouses and provides a query language similar to SQL.  Hive can also hold unstructured data as well (e.g. images, documents) as well as the more formal structured tables that we are all used to.  We can connect Hive to OLAP databases such as Essbase for ease of querying.

You’ll notice that in the above we move from a hardware environment (distributed computing) through to an application that we can use (Hive). In doing so it’s tempting to consider MapReduce and|or Hadoop as operating systems on which applications such as Hive can run, but really MapReduce and Hadoop are applications in their own right.  In both cases the operating system running on each node is still Linux or Windows.  MapReduce is a framework for handling queries and tasks over a distributed file system.  Hadoop is, amongst other things, a distributed file system which implements MapReduce functionality.  By layering applications like Hive on top of Hadoop then Hive gets access to the MapReduce functionality meaning that the processing and querying time of vast datasets becomes reasonable.

So where do I see all this in action?  Well, all over the place.  Facebook, Google, Twitter – all of these services all usable due to this technology.  For those of you are interested, here is the McLaren dashboard, no doubt updated for 2013 season.  Clear and concise with data coming from all of the 3200 data points collected in almost real time during a race.

McLaren dashboard

McLaren dashboard

Share:
  • Print
  • Twitter
  • LinkedIn
  • Facebook
  • Google Bookmarks
  • del.icio.us
  • StumbleUpon
  • Digg

Mar
08

A Look at Microsoft PowerView

Introduction

Following on from John’s blog two weeks ago, which touched upon data visualisation using Tableau, this week I’ll be taking a look at Microsoft’s new offering, PowerView.

PowerView is a data exploration and self-service BI tool that comes as part of Excel 2013. It’s aimed primarily towards end-users and is excellent for ad-hoc reporting and data visualisation.

PowerView’s inclusion as part of Excel 2013 is both advantageous and disadvantageous when compared to other offerings such as Tableau or Qlikview – on the one hand, Excel is probably the most widely used program in the office environment and the vast majority of users will be familiar with it; but on the other, PowerView is only available in Excel 2013, whereas both Tableau and Qlikview are capable of using earlier versions of Excel.

 

Dataset

The dataset I used – European Union Greenhouse Gas emissions between 1985 and 2010 – was obtained for free from the Microsoft Azure Marketplace.

After being downloaded from the Marketplace, it was imported into Microsoft Excel and staged in Microsoft PowerPivot. PowerView is launched directly from the data model – be it a PowerPivot table or an SSAS server – and as such there is no need to add data sources or server names. As a result, it’s extremely easy to begin – it requires all of two clicks from your initial spreadsheet.

 

Visualisations

As expected, PowerView offers all the usual suspects – line, bar, scatter and pie charts are easy to create and manipulate. I would go as far as to say that it’s easier to do this in PowerView than it is in Excel, due to the “drag and drop” functionality. However, it also suffers slightly in comparison to Tableau in that the range of visualisations is far more limited.

In effect, creating a visualisation in PowerView is exactly the same as creating and manipulating a pivot table in Excel – dragging fields into a horizontal/vertical/values area, applying filters and selecting the style of visualisation.

For a more detailed guide of how to actually create visualisations, have a look at the Microsoft Technet page:

http://technet.microsoft.com/en-us/library/hh213577.aspx

Whilst the ease of initially building visualisations is great, it’s the ability to adapt them instantaneously that makes PowerView shine. Because of its similarity to a pivot table, changing the axes, filters or values takes seconds.

The three visualisations I’ve prepared illustrate some of what I consider to be the more interesting features of PowerView – the ability to tile or create multiples; the mapping function and the ability to link multiple visualisations, allowing them to act as filters.

 

1. Tiling and multiples

Tiling and multiples (for which there are two options – horizontal and vertical) are PowerView’s method of adding an additional dimension to your analysis. In the examples below, the data being displayed is the tonnage of emissions over time in a single country, with each greenhouse gas being plotted.

As plotting all gases in all countries over time would leave the graph an incomprehensible mess, PowerView gives you two options:

-          Tiling displays the graph at full size, with the ability to scroll between (in this case) countries. Whilst this doesn’t give quite the same level of comparison as the multiples option, it does leave the visualisation larger and clearer. This option is more suitable when the “tile by/multiples of” field contains numerous different values, as is the case here.

-          Multiples does exactly as it says on the tin – multiple graphs within a single visualisation. In this case, there’s one graph per country. As you can see, this shrinks the graphs so far as to be virtually unreadable, so my suggestion would be to only use multiples where the “multiples” field contains a low number of values.


Whilst the example above doesn’t display the multiples option very well, it takes all of 30 seconds to swap the fields around and have the “pollutant name” field act as the multiples field. As I’ve only selected 4 pollutant gases via the view filter, this shows off the multiples option nicely.

 

2. Mapping


PowerView introduces the ability to create maps from your data and to link these maps to Bing maps. Unlike Microsoft’s previous offerings in SSRS, you can actually zoom and pan around the maps in PowerView, as seen below.

As my dataset contained a “country” field, it was extremely easy to set up the map – simply hit the “map” button from the list of possible visualisations, allow the data to be sent to Bing and away you go!

The connection with Bing maps means that there’s no need to include a shapefile or set any latitude/longitude co-ordinates – it will automatically work these out.

Powerview handled the geographical element of my dataset extremely well – marginally better than Tableau handled a similar dataset – but as it’s a very clearly defined geographical set (names of countries) I was interested to see how it dealt with something a little more obscure such as UK police forces – a dataset which Tableau had previously been unable to handle without a great deal of effort on my behalf. The dataset was obtained from the Guardian’s Data Journalism blog:

https://docs.google.com/spreadsheet/ccc?key=0AonYZs4MzlZbdFpWcW80ZU9LSDZvajlNS0JXX0UzVmc#gid=24


Surprisingly, PowerView actually manages to handle the data (Tableau simply came up blank) despite its lack of structure. However, it seems that the British police forces are somewhat more global than anticipated…:

Most of these errors are caused by identical town/area names – hence why most of the incorrect locations are in the USA. The others are caused by the data set’s poorly structured attempt at a hierarchy – attempting to roll up individual police forces into larger regions such as “The North East”. Unsurprisingly, PowerView doesn’t recognise these and places them almost randomly around the globe.

Adding in a hierarchy as detailed in section II) of the blog linked below and categorising the police force names as counties improves the accuracy, but there are still some errors – police forces which are not county names (such as London Metropolitan) go missing. Ultimately, obtaining 100% accuracy still involves a fair bit of work/adapting the data, so in this respect PowerView isn’t much better than its competitors.

http://www.sqljason.com/2012/07/creating-maps-in-excel-2013-using-power.html

 

3. Multiple Charts in a single Visualisation

 

PowerView allows you to place multiple charts into a single view and link them together, allowing one chart (in this case, the “Pollutant Name” chart, referred to as Chart A) to act as a filter on another (the “Emissions by Country and Sector” chart, referred to as Chart B) and thereby completely change the appearance of Chart B as shown below:

Here you can see the changes made to Chart B by selecting the CH4 in Chart A – several sectors are removed entirely and the emissions profiles for each country are changed drastically.

Of course, this also works in reverse – different sectors or countries can be selected on Chart B in order to give a breakdown of emissions by pollutant name:

 Whilst I sadly lacked the data to explore this properly, this cross-chart filtering could be extremely interesting if two or more datasets are combined – leading to powerful and flexible ad-hoc analysis.

 

Conclusion

PowerView’s capabilities are pretty much in line with those of similar offerings such as Tableau or QlikView. It’s simple and intuitive to use and its ability to integrate into PowerPoint slides, allowing visualisations to be adapted mid-presentation, is a huge positive.

In addition to this, the fact that it’s (from an end-user perspective, at least) effectively an Excel add-on is a big advantage and could lead to a much higher end-user uptake – a perpetual challenge for BI programs!

In short – from my brief play around with it, PowerView doesn’t seem to have any particularly groundbreaking USP, but it’s very easy to use, well put together and offer comparable functionality to more established BI visualisation tools. Given that this is the initial release, it has great potential to improve and expand in the future.

- Matt

 

 

 

 

Share:
  • Print
  • Twitter
  • LinkedIn
  • Facebook
  • Google Bookmarks
  • del.icio.us
  • StumbleUpon
  • Digg

Mar
01

Best Oscar winning Film? My first Data Explorer adventure…

With the official review release of the Excel plugin Data Explorer this week, I thought I would give it a little go to see what kind of thing you can do.

Microsoft Data Explorer is an Excel plugin that allows end users do their own self service data collection and data transformations. You can download it from here: http://www.microsoft.com/en-us/download/details.aspx?id=36803 you’ll need Excel 2010 or Excel 2013 though.

Saturday night in the Smillie household is film night, and with the recent Oscars in mind I thought it might be kinda cool to watch an Oscar winner. But which one?

I’m fond of IMDB and I like to watch as many of the top 250 ranked films as possible. (If you don’t know about IMDB’s top 250 then check it out here: http://www.imdb.com/chart/top , seriously, check it out)

So, what better way to play with Data Explorer than to combine the past Oscar Best Film  (from here: http://en.wikipedia.org/wiki/Academy_Award) winners with the IMDB Top 250 to see which film me and Mrs Smillie should watch this Saturday.

Wikipedia Page

Academy Award - Wikipedia, the free encyclopedia

IMDB Page

IMDB Top 250

 

So, here is how I did it.

First of all, I hooked up Data Explorer to the award winners by adding in the URL above as a Web data source. This presented this screen, where I removed the fields I didn’t want anymore by hiding them (ones in green are the ones I want to keep)

WebSource

This pulled in the Oscar Best Picture data into my Excel table. Pretty painless and very simple.

Now then, lets repeat this for the IMDB Top 250 (here I did the exact same thing,  but just used the different URL). This gave me the IMDB Top 250 films list, which looks likethis.

Now to merge them.

Lets click the ‘Merge’ button. This opens up a screen where I can select my primary data source, what I want to merge with, and  what columns (as per below).

MergeDialogue

So I choose, ‘Best Picture Winner’ from my Oscars data source and I choose ‘Title’ from my IMDB data source.

Click apply and watch it merge…

Oops, no matches. Ah, of course, the IMDB Title has the year on the end. Lets see if Data Explorer can clean that up.

Yup, it sure can. I just edit  my IMDB data source (by clicking on my IMDB Table and hitting ‘Filter & Shape’ button) and split the Title on a fixed length from the right.

EditSource

TitleSplit1

TitleSplit2

Then retry the merge

MergeDialague2

MergeSelection

So now the merge works and I am left with 35 Oscar winners that appear in the IMDB top 250 and their IMDB rank. Pretty cool huh. I guess I’ll be watching Godfather this weekend!

I appreciate all this could have been achieved by Copy and Paste, VLOOKUPs, formula,… but the beauty of doing this in Data Explorer is that if the sources change e.g. new Oscar winners or movement in the top 250 films, all I have to do is click ‘Refresh’.

All in all, I am pretty impressed with Data Explorer. It definitely a tool I’m going to use more of.

P.S. Of course I have seen the Godfather already!

 

Share:
  • Print
  • Twitter
  • LinkedIn
  • Facebook
  • Google Bookmarks
  • del.icio.us
  • StumbleUpon
  • Digg

Feb
20

Data Discovery highlighted by Gartner in latest Magic Quadrant for Business Intelligence and Analytics Platforms

On Friday last  (15th Feb) Gartner released the latest version of it’s Magic Quadrant for Business Intelligence and Analytic Platforms.  Here’s the grid for 2013.

You can read the full report here, but there were a couple of stand-out items:

  1. Gartner’s interpretation of what a Business Intelligence system should be able to do as assessment criteria is interesting.  Broadly it evaluates the vendor offerings in three key areas:  Integration, Delivery (including mobile and search) and Analysis.
  2. Of the above Analysis now includes interactive visualisations.
  3. In 2012, the main BI activities were organisations trying to mature from descriptive reporting, to diagnostic analysis (i.e. the ability to visualise and interact with data).
  4. Interactive visualisation or Data Discovery tools became a key component of “mega vendors” (e.g. Microsoft, Oracle, SAP) who have launched tools to counteract the growing maturity of vendors such as Tableau and QlikTech.
  5. Amongst the “mega vendors” Microsoft has overtaken Oracle for the first time.  Gartner states that SQL 2012 and PowerView have increased Microsoft’s capability in this area, but cautions that Microsoft components require extensive implementation.
  6. The best vision for Business Intelligence is still provided by IBM (Cognos)
  7. Some of the previous niche players, such as Tibco Spotfire , Tableau Software and QlikTech have matured and have now moved into the Leaders part of the Quadrant.  See point  4 above.

Gartner concludes that the emphasis from the mega vendors on providing Data Discovery tools will accelerate the trend away from the decentralisation of a corporate BI environment and will further empower end users in the years to come.  Personally, I agree on this point.  I think the role of a BI Consultant in a few years time will be quite different to what it is now.

For the record, with regard to data discovery tools this is what happened in 2012:

  • MicroStrategy significantly improved Visual Insight.
  • SAP launched Visual Intelligence.
  • SAS launched Visual Analytics.
  • Microsoft bolstered PowerPivot with Power View.
  • IBM launched Cognos Insight.
  • Oracle acquired Endeca.
  • Actuate acquired Quiterian.

 

Data Discovery

Some more on this data discovery then.  If the above new tools have been launched in response to the likes of Tableau and QlikTech improving can we see this at all? A limitation of the report is that it doesn’t show last year’s positions, so it’s difficult to see which vendors have an advancing product set.  We can’t see what’s changed easily.  So step forward Google images.

   
   
   
   

But again, these don’t really tell the whole story, so how can we make this better?  Well, one of the new movers into the Leader space is Tableau and I’d been looking for an opportunity to try out Tableau Public.  Now this isn’t 100% scientific, but from the images we can create a data set mapping out the scores for Gartner’s two metrics, Completeness of Vision and Ability to Execute.

For example:

Vendor Year Vision Execute
arcplan 2007 5 6
arcplan 2008 5 8
arcplan 2009 5 8
arcplan 2010 6 8
arcplan 2011 5 7
arcplan 2012 6 6
arcplan 2013 5 5
Applix 2007 7 7
Tibco Spotfire 2007 7 6
Tibco Spotfire 2008 11 7
Tibco Spotfire 2009 14 8
Tibco Spotfire 2010 10 11

From this we can feed this into Tableau and produce some animated visualisations of the change.  Very easily too I might add.

So from this we can create a visualisation showing how different organisations have been assessed by Gartner over the years, the interface allows us to interact with the data too.  Next I just need to filter out all those I’m not interested in for now, and we see:

… and there we have it, Tableau, QlikTech and Tibco Spotfire all maturing and progressing into the Leaders space in the quadrant.  They are also all showing an increased ability to execute too, so no wonder the bigger vendors have responded.

A couple more filters now:

1) So how have Gartner now assessed the new Data Discovery offerings from some of the “mega vendors”?

We see that Microsoft, Cognos / IBM, and SAP have all moved to the right.  Only Oracle appears to have stood still. Gartner explains this by stating that Oracle’s has been slower than other vendors to deliver enhanced Data Discovery tools (Endeca)  in OBIEE and it’s mobile offering is currently limited only to iOS.  NB. On the mobile aspect Microsoft are currently worse than Oracle, but Gartner seem to be rating them higher due to improvements in SQL Server 2012 and Excel 2013, plus the integration of BI within tools most organisations already have a licence for, namely SQL Server, Office and SharePoint.  IBM (Cognos) retains the top spot due to the overall quality of the applications, plus the release of Cognos Insight … (must try that out soon!)

2) Who have been solid performers that we should also keep an eye on ..

 

The point here is that being able to visualise and interact with the data allows you to understand it and let it tell it’s story to you.  This is the essence of Data Discovery and is what these tools allow you to do.

Tableau

Tableau is just one of a number of vendors highlighted here that offer this type of capability.  I could have used QlikTech or Spotfire and seen similar patterns emerge.  I chose Tableau as it was the one I personally knew the least about and wanted to try it out.  This was all done in Tableau Public (which is free), was remarkably easy to use and I should add handled the data extremely well.

Overall this has been quite an interesting couple of evenings putting this together.  The report is an interesting read and it’s been fun to try out one of the Data Discovery tools the larger organisations have brought out tools to compete against.  The “mega-vendors” will have to keep an eye on these guys for some time yet I think.

If you want to see for yourself, the Tableau Visualisation can be found here:  http://public.tableausoftware.com/views/Gartner_Altius_Visualisation/Sheet1?:embed=y

 

 

Share:
  • Print
  • Twitter
  • LinkedIn
  • Facebook
  • Google Bookmarks
  • del.icio.us
  • StumbleUpon
  • Digg

Older posts «