Simple Unit Tests For HDInsight C# SDK

October 3 2014

I have been working on a project using the .NET SDK for Hadoop.  I wanted to add some unit tests to the project, so I ended up writing some fakes for HDInsightClient, JobSubmissionClientFactory and JobSubmissionClient. I was hoping I might be able to reuse some fakes from the SDK git repo, but it seems like their unit tests actually stand up an instance of Hadoop. I didn’t want to actually stand up an instance; I’m treating Hadoop like a black box and I’m more interested in getting code coverage on all the C# code around the calls to Hadoop.

For my fake of IHDInsightClient, I only implemented CreateCluster() and DeleteCluster(), nothing fancy.

I had to make my own interface and wrapper to have a factory that would make a JobSubmissionClient (which is the same thing that the SDK did for its cmdlets):

public interface IAzureHDInsightJobSubmissionClientFactory
    IJobSubmissionClient Create(IJobSubmissionClientCredential credentials);

Then, for the service itself, I implement this interface using the static JobSubmissionClientFactory:

public class AzureHDInsightJobSubmissionClientFactory : IAzureHDInsightJobSubmissionClientFactory
    public IJobSubmissionClient Create(IJobSubmissionClientCredential credentials)
        return JobSubmissionClientFactory.Connect(credentials);

Whenever I need a JobSubmissionClient, I get one using my wrapper.

In the case of my fake, I have the factory return a new fake job submission client:

public class FakeJobSubmissionClientFactory : IAzureHDInsightJobSubmissionClientFactory
    public Microsoft.Hadoop.Client.IJobSubmissionClient Create(Microsoft.Hadoop.Client.IJobSubmissionClientCredential credentials)
        return new FakeJobSubmissionClient();

Finally, for my FakeJobSubmissionClient, I do need to fake the work that the job does in Hadoop. In this case, it writes a file to blob storage as a result of the Hive query it runs. So, since my fixture has a static reference to a fake blobClient, I was able to fake the work that Hadoop would do in my implementation of CreateHiveJob(HiveJobCreateParameters hiveJobCreateParameters).

With all these fakes, I then wired up dependency injection in my UnityContainer and I was good to go. And now I have much more confidence that future changes to this codebase won’t cause regressions.

Beware: Hadoop C# SDK Inserts Line Breaks, Tabs That Break Your Queries

August 22 2014

After banging my head against the wall for many hours, I finally figured
out that .NET is adding escaped carriage returns, aka \r\n when the
queries are sent to HDInsight, which is causing the queries to fail. My code was loading the queries from files on disk like this:

string query = string.Empty;
using (var fs = new StreamReader("CreateTempTable.hql"))
    query = fs.ReadToEnd();

I figured this out by looking at userArgs
file in the templeton-hadoop directory to see what the jobs looked
like, and they appear like this:

"ADD JAR wasb:///user/jars/csv-serde-1.1.2-0.11.0-all.jar;
(viewerId string, asset string, device_os string, country string, state 
string, city string, asn string, isp string, start_time_unix_time bigint,
startup_time_ms int) \r\nROW FORMAT serde 'com.bizo.hive.serde.csv.CSVSerde'
\r\nSTORED AS TEXTFILE LOCATION 'wasb:///temptable';\r\n\r\n               "

As you can see, the query is littered with escaped characters which causes the HIVE query to fail.
These same queries can be submitted via PowerShell no problem.

So, basically, I removed all the linebreaks in Notepad for my HQL and everything worked.

Submitting HDInsight Jobs From An Azure Webjob or WorkerRole Using the C# Hadoop SDK

August 22 2014

All the samples for submitting jobs programmatically to HDInsight assume that you are doing so from a desktop working station that has been set up with a management certificate. The code gets your cert out of the cert store and creates a JobSubmissionCertificateCredential as such:

// Get the certificate object from certificate store using the friendly name to identify it
X509Store store = new X509Store();
X509Certificate2 cert = store.Certificates.Cast<X509Certificate2>().First(item => item.FriendlyName == certFriendlyName);
JobSubmissionCertificateCredential creds = new JobSubmissionCertificateCredential(new Guid(subscriptionID), cert, clusterName);
// Submit the Hive job
var jobClient = JobSubmissionClientFactory.Connect(creds);
JobCreationResults jobResults = jobClient.CreateHiveJob(hiveJobDefinition);

This is all well and good, but what if you need to submit jobs programmatically from, say, an Azure WebJob or a worker role.

The way I solved this was generating my own management cert with a private key and then uploading it with the exe, placing the cert in the bin with the .exe. Here’s the code to generate a cert (tip of the hat to David Hardin’s post)

makecert -r -pe -a sha1 -n "CN=Windows Azure Authentication Certificate" -ss my -len 2048 -sp "Microsoft Enhanced RSA and AES Cryptographic Provider" -sy 24 -sv ManagementApiCert.pvk ManagementApiCert.cer
pvk2pfx -pvk ManagementApiCert.pvk -spc ManagementApiCert.cer -pfx ManagementApiCert.pfx -po password

Then, after uploading the .cert to the Azure Management Certificate store (see here for doing that) and adding the .pfx to your project (be sure to set copy local to true) you can use the following code to create a JobSubmissionCertificateCredential:

var cert = new X509Certificate2("ManagementApiCert.pfx","your_password",X509KeyStorageFlags.MachineKeySet);
JobSubmissionCertificateCredential creds = new JobSubmissionCertificateCredential(new Guid(subscriptionID), cert, clusterName);

Tip of the hat to Tyler Doerksen who’s post led me to setting the MachineKeySet flag.

And, there you go: the ability to submit Hadoop jobs programatically from a WebJob or WorkerRole.

HDInsight Hadoop Hive Job Decompresses CSV GZIP Files By Default

August 8 2014

Been working with Hadoop (2.4.0) and Hive (0.13.0) with HDInsight (3.1) and it decompresses GZIP files into CSV by default.  Nice!  So, loading data with a Hive query in Powershell:

$response = Invoke-Hive -Query @"
LOAD DATA INPATH 'wasb://$container@$' 



No additional work or arguments to pass. I thought I had to do something like specified in this post with the but apparently not.


UPDATE: Just found this link: which goes into keeping compressed data in Hive which has a recommendation to create a SequenceFile.