This post forms part of our “Big Data – A Small Example” series and describes how we used Microsoft HDInsight to extract data from States of Jersey Hansard transcripts previously collected.
Setup
As Buck Woody explains here, you have several options for using HDInsight: as an Azure Service; as an HDInsight Server; or installing it on as Azure virtual machine, which is the approach we took. M’colleague Jonathan Holmes recently blogged about how to set up an Azure VM.
Once you have your VM up and running, it’s easy to download the Web Platform Installer, and install the HDInsight developer preview from there. However, there is one gotcha: while it claims to download and install all dependencies, make sure you install Sql Server first as the web UI wont start without one of the DLLs that comes with it!
Loading data
So that the MapReduce job we’re going to write can access our scraped transcripts we need to load them into the HDFS file system. From the Start menu we open the ‘Hadoop Command Line’ where we can execute Hadoop commands of the form hadoop <command> <arguments>. Executing hadoop without any commands will list the commands available, and executing hadoop <command> without any arguments will list the arguments for the specified command.
First we create a project folder:
> hadoop fs -mkdir myProjectFolder
This will create a folder with the absolute path of /user/<username>/myProjectFolder. The current working directory is /user/<username> so you can check this folder has been created with hadoop fs -ls . (that’s a single period to indicate the current folder.)
We then use the -put command to load the transcripts file into that folder:
> hadoop fs -put hansard.json myProjectFolder
MapReduce jobs
The standard way to implement a MapReduce job in Hadoop is in Java. the Hadoop command-line has as a jar command for launching Java MapReduce jobs, which are simply Java programs implementing Mapper and Reducer base classes in the org.apache.hadoop.mapred library. However, if, like me, you’re more used to programming in C#, help is at hand! Over at CodePlex is the Microsoft .NET Hadoop SDK, which can be installed via the NuGet package manager.
The SDK provides a set of classes for you to inherit from – a Mapper, Reducer, Combiner and Job class. Create a Console application with the SDK packages installed and you can easily create a MapReduce job to run on HDInsight. The functionality supplied by the SDK will create the necessary Java to execute your code as a MapReduce job – all you have to do is run the .exe created when you build the solution, and your job runs! (Gotcha: make sure there is no whitespace in the folder path to your .exe – this will cause dependency issues as it won’t be able to find the SDK dlls.)
Mapper
The data input to the job will be split up and passed line by line into Mappers. The number of Mappers operating will depend on your setup, but in this HDInsight Developer’s Preview you get three. The basic structure of the Mapper code looks like this:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Microsoft.Hadoop.MapReduce;
namespace MyMapReduceJob.Map
{
public class MyMapper : MapperBase
{
public override void Map(string inputLine, MapperContext context)
{
//code to analyse the inputLine
context.EmitKeyValue("key", "value");
}
}
}
What the keys and values are is totally up to you.
In our case the array of JSON objects representing the scraped transcripts were stored one per line, meaning that each Mapper was handed a single transcript at every execution. Since our primary aim was to measure the contribution of each States Member we had to traverse the transcript identifying who was speaking at each point. Fortunately, every paragraph and header comes wrapped in its own HTML tags, so we were able to use this to put some structure around the text. I shall spare you the code, but the algorithm looked like this:
- Deserialise the JSON object.
- Extract the transcript and clean the HTML (which was not well-formed, so some tags had to be stripped out.)
- Load the cleaned transcript into an XDocument, allowing us to traverse the HTML.
- Iterate over the XDocument loading elements that contained text into ‘Paragraph’ objects via a ‘TranscriptManager’ object which handled things like tracking section numbers. The ‘Paragraph’ object also performed functions such as stripping out the remaining HTML to provide clear text.
- Iterate over the paragraphs in the TranscriptManager, using regular expressions to identify new speakers and proposition codes and also checking for Oral or Written Question headers.
- Finally, for each paragraph two key/value pairs were output: (1) a key in the form “count|<date>|<subject>|<states member>” and the transcript text, (2) a key in the form “text|<date>|<subject>|<states member>” and the transcript text annotated with its location in the transcripts (this is used later to show the text making up the numbers.)
Reducer
The job of the Reducer is to perform the work on the data provider by the Mapper. The Reducer is handed a key and a list of values emitted for that key by the Mappers. In our case we implemented the Reducer thus:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Microsoft.Hadoop.MapReduce;
using System.Diagnostics;
namespace Map.Reduce
{
public class StatesMemberReducer : ReducerCombinerBase
{
public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context)
{
Trace.WriteLine(DateTime.Now + " - Reduce called.");
string[] keyParts = key.Split('|');
if (keyParts[0].Equals("count"))
{
ReduceCount(key, values, context, keyParts);
}
else if (keyParts[0].Equals("text"))
{
ReduceText(key, values, context);
}
}
private void ReduceCount(string key, IEnumerable<string> values, ReducerCombinerContext context, string[] keyParts)
{
if (keyParts[2].Equals("Written Questions"))
{
context.EmitKeyValue(key, values.Count().ToString());
}
else
{
context.EmitKeyValue(key, GetWordCount(values));
}
}
private string GetWordCount(IEnumerable<string> values)
{
int wordCount = 0;
foreach(string paragraph in values)
{
string[] words = paragraph.Split(' ');
wordCount += words.Length;
}
return wordCount.ToString();
}
private void ReduceText(string key, IEnumerable<string> values, ReducerCombinerContext context)
{
StringBuilder builder = new StringBuilder();
bool first = true;
foreach (string paragraph in values)
{
if (!first)
{
builder.Append("|");
}
else
{
first = false;
}
builder.Append(paragraph);
}
context.EmitKeyValue(key, builder.ToString());
}
}
}
(I’m including the actual Reducer rather than an algorithm and template code as the Reducer is much simpler than the Mapper.)
Here, depending on the first part of the key we either iterate over the values counting the number of words/questions (in the case of ‘count’ values) or append all the transcript text together (in the case of ‘text’ values.)
Job
The final piece is the Job class, which ties everything together. Ours looks like this:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Microsoft.Hadoop.MapReduce;
using Map.Map;
using Map.Reduce;
using System.Diagnostics;
namespace Map.Job
{
public class StatesMemberCounterJob : HadoopJob<StatesMemberMapper, StatesMemberReducer, StatesMemberCombiner>
{
public override HadoopJobConfiguration Configure(ExecutorContext context)
{
Trace.WriteLine(DateTime.Now + " - Configure called.");
var config = new HadoopJobConfiguration();
config.InputPath = "hansard";
config.OutputFolder = "hansard/output";
return config;
}
}
}
The key parts here are the Generic Types in the base class, which define the Mapper, Reducer and Combiner (more on that in a moment) and the configuration provided to the ExecutionContext. The generic typing of the base class is overloaded, so you must at minimum provide a Mapper, but can also provide a Reducer and also a Combiner. The Combiner is used to collate the output from Reducers who have reduced a subset of key values. In many cases the Combiner can be another instance of the Reducer class, however, in this case we had to create our own to ensure the counting of words/questions was handled properly (i.e. the values were added together rather than counted.)
Execution
Once the program is compiled and (if necessary) moved to the machine your HDInsight cluster is installed on, you only need to execute the Exe for it to load itself into HDInisght and execute as a MapReduce job. A console window will open describing the dependencies being loaded
However, some more Gotchas:
- HDInight is built using .Net 4, so make sure that is the operating version of .Net (installing Sql Server can involve installing .Net 3.5 components causing that to become the registered version of .Net.)
- It is very likely your HDInsight cluster will be on a 64-bit server. Ensure your C# program is set to target this platform (Project Properties -> Build) otherwise you will get an obscure ‘Bad Image’ exception.
- Your Mappers and Reducers do not run in the same application context as the Exe that launches them, logging and debugging becomes… challenging.
- Use Console.Read() at the end of your Program.cs Main method so that the Console window does not close before you’ve read any success or error messages.
Debugging
As the C# code creates a Hadoop job that executes Java to run the C# code, the execution of your Mappers and Reducers is outside the context of the launch application. If your job fails you can look in the job logs, but don’t expect too much. If you have access to the Server HDInsight is running on you can navigate to http://localhost:50030/ to access the Hadoop Administration page. From here you can access the logs, but the stack traces available are for the Java, so make your errors descriptive!
An alternative, however, is to home-brew your own logging and write directly to the filesystem. This provides a lot more context than what you’ll get out of the Hadoop logs.
Summary
Being able to code a MapReduce job in C# definitely speeds up development and makes for a good entry point to coding jobs in HDInsight. However, it does come with some challenges, specifically around debugging. I also noticed that the performance suffered once I started making use of other .NET libraries, so there is definitely an overhead and it may be worth considering coding-your-own. Otherwise it’s pretty easy to set up and get going. I like. Even better, as this was on an Azure VM I was able to beef up the power of the machine when necessary (outputting huge chunks of transcript caused quite a performance hit!)
Next week: pulling the output out of HDInsight and loading it into a data warehouse.












































































