BlogEngine.NET Provider Migration

December 11 2014

I recently migrated this blog (which runs on Blog Engine.NET 2.8) from the XMLProvider to the DBProvider. I followed the instructions here: http://www.nyveldt.com/blog/page/blogenginenet-provider-migration which almost worked, but I had to make a couple changes:

BlogService.Provider.FillCategories and BlogService.Provider.LoadSettings requires that you pass the current blog, so those lines just got changed to

BlogService.Provider.FillCategories(Blog.CurrentInstance)

BlogService.Provider.LoadSettings(Blog.CurrentInstance)

Then, I had to manually update the GUID of the blog itself in SQL. Basically, after you run the DB create script and run the migration page, you grab the id of your old blog and update the bd_blogs table:

UPDATE [dbo].[be_Blogs]
   SET [BlogId] = 'your new id'
WHERE [BlogId] = '27604F05-86AD-47EF-9E05-950BB762570C'
GO

And, walla, now I’m running on SQL Server!

Simple Unit Tests For HDInsight C# SDK

October 3 2014

I have been working on a project using the .NET SDK for Hadoop.  I wanted to add some unit tests to the project, so I ended up writing some fakes for HDInsightClient, JobSubmissionClientFactory and JobSubmissionClient. I was hoping I might be able to reuse some fakes from the SDK git repo, but it seems like their unit tests actually stand up an instance of Hadoop. I didn’t want to actually stand up an instance; I’m treating Hadoop like a black box and I’m more interested in getting code coverage on all the C# code around the calls to Hadoop.

For my fake of IHDInsightClient, I only implemented CreateCluster() and DeleteCluster(), nothing fancy.

I had to make my own interface and wrapper to have a factory that would make a JobSubmissionClient (which is the same thing that the SDK did for its cmdlets):

public interface IAzureHDInsightJobSubmissionClientFactory
{
    IJobSubmissionClient Create(IJobSubmissionClientCredential credentials);
}    
    

Then, for the service itself, I implement this interface using the static JobSubmissionClientFactory:

public class AzureHDInsightJobSubmissionClientFactory : IAzureHDInsightJobSubmissionClientFactory
{
    public IJobSubmissionClient Create(IJobSubmissionClientCredential credentials)
    {
        return JobSubmissionClientFactory.Connect(credentials);
    }
}

Whenever I need a JobSubmissionClient, I get one using my wrapper.

In the case of my fake, I have the factory return a new fake job submission client:

public class FakeJobSubmissionClientFactory : IAzureHDInsightJobSubmissionClientFactory
{
    public Microsoft.Hadoop.Client.IJobSubmissionClient Create(Microsoft.Hadoop.Client.IJobSubmissionClientCredential credentials)
    {
        return new FakeJobSubmissionClient();
    }
}

Finally, for my FakeJobSubmissionClient, I do need to fake the work that the job does in Hadoop. In this case, it writes a file to blob storage as a result of the Hive query it runs. So, since my fixture has a static reference to a fake blobClient, I was able to fake the work that Hadoop would do in my implementation of CreateHiveJob(HiveJobCreateParameters hiveJobCreateParameters).

With all these fakes, I then wired up dependency injection in my UnityContainer and I was good to go. And now I have much more confidence that future changes to this codebase won’t cause regressions.

Prototyping With Hive Using HDInsight

September 19 2014

I’ve been doing a lot of prototyping with Hive lately and I really wanted to use an emulator to do the work.  For awhile, I was trying to do the prototyping using Powershell but kept getting NotSupported exceptions. 

I finally decided to simply program against Hive itself which has turned out to work just great. I’m using Visual Studio to write my HQL as .sql files and getting a nice color coded editing experience, since HQL and SQL are fundamentally the same as far as keywords.

Then, I have a cmd line open with a Hive prompt and I run my HQL queries as follows:

hive> source c:\hdp\temp\test2.sql;

Working great! Fast and free – I like it…

Connecting To The Azure Storage Emulator From The HDInsight Emulator

September 18 2014

I was following the Getting Started instructions on using the HDInsight emulator and got stuck trying to connect to the Azure Storage Emulator:

 

hadoop fs -ls wasb://temp@storageemulator 
ls: `wasb://temp@storageemulator': No such file or directory

Turns out that you must have a trailing slash, like this:

hadoop fs -ls wasb://temp@storageemulator/ 

Maybe that’ll help someone out there…

A Simple Box.Com C# API Wrapper

August 25 2014

I had a need to access Box.com programmatically to do a daily pull of log files that were posted to Box from a third party service. At first I thought the Box .NET SDK would be helpful, but I quickly realized it is entirely oriented to be used by apps with a UI, not headless apps like this.  So, I dove into the documentation

My first stumble was that the Box developer token expires in one hour. I was hoping I’d be able to use it as a const for my headless server application, but no luck.

So, I need to actually generate an access_token and a refresh_token. The only way to do that is go through their UI. Thanks to a very helpful post on StackOverflow I was able to generate a code that could be used to generate both an access_token and refresh_token (which last for 60 days).

By persisting the refresh_token, you can write code that gets a fresh access_token programmatically. So, basically, my wrapper has a method called Bootstrap which you pass the code that you get from copy/pasting it out of the querystring. And then it has RefreshBoxToken, which gets a new access_token if the access_token is expired.

Then, there are two additional wrappers that actually do work which I called GetFolderByID and GetFileAsStream. GetFolderByID assumes you have the folderID, which you can figure out from the Box UI itself.  Then, with a little JSON.NET, you can parse the response and get the list of files as a JArray:

JObject jObject = JObject.Parse(folderContents);
JArray jArray = jObject["item_collection"]["entries"] as JArray;

Then, you’ve got the power do download files!

I wrapped both calls in a generic DoBoxAPICall method. Below is the entire class that encapsulates the logic:

using ExportConvivaLogsToHadoopWebJob.Properties;
using Newtonsoft.Json.Linq;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Net.Http;
using System.Text;
using System.Threading.Tasks;

namespace ExportConvivaLogsToHadoopWebJob
{

    public static class BoxAPIHelper
    {
        private const string boxApiUrl = "https://api.box.com/2.0/";
        private const string boxClientId = "YOUR_ID";
        private const string boxClientSecret = "YOUR_SECRET";
        private static readonly HttpClient _httpClient = new HttpClient();
        private static int retryCount = 0;
        private static Stream DoBoxCall(string url, HttpMethod httpMethod)
        {
            Stream stream;
            var request = new HttpRequestMessage() { RequestUri = new Uri(url), Method = httpMethod };
            request.Headers.Authorization = new System.Net.Http.Headers.AuthenticationHeaderValue("Authorization", "Bearer " + Settings.Default.boxAccessToken);
            var response = _httpClient.SendAsync(request).Result;
            if (!response.IsSuccessStatusCode && response.StatusCode == System.Net.HttpStatusCode.Unauthorized)
            {
                if (retryCount < 2)
                {
                    RefreshBoxToken();
                    retryCount++;
                    stream = DoBoxCall(url, httpMethod);
                    return stream;
                }
                else
                {
                    throw new Exception("Failed to connect to Box.");
                }

            }
            retryCount = 0;

            return response.Content.ReadAsStreamAsync().Result;
        }
        private static void RefreshBoxToken()
        {
            using (var request = new HttpRequestMessage() { RequestUri = new Uri("https://www.box.com/api/oauth2/token"), Method = HttpMethod.Post })
            {
                HttpContent content = new FormUrlEncodedContent(new[] 
                { 
                 new KeyValuePair<string, string>("grant_type", "refresh_token"), 
                 new KeyValuePair<string, string>("refresh_token", Settings.Default.boxRefreshToken),
                 new KeyValuePair<string, string>("client_id", boxClientId),
                 new KeyValuePair<string, string>("client_secret", boxClientSecret)
                
                }


                );
                request.Content = content;
                using (var response = _httpClient.SendAsync(request).Result)
                {
                    if (!response.IsSuccessStatusCode)
                    {
                        throw new Exception("Box refresh token failed. A human needs to go to a browser and generate a fresh authorization code.");
                    }
                    JObject jObject = jObject = JObject.Parse(response.Content.ReadAsStringAsync().Result);
                    Settings.Default.boxAccessToken = (string)jObject["access_token"];
                    Settings.Default.boxRefreshToken = (string)jObject["refresh_token"];
                    Settings.Default.Save();
                }
            }



        }
        public static string GetFolderById(string folderId)
        {
            string url = string.Format("{0}folders/{1}", boxApiUrl, folderId);
            Stream stream = DoBoxCall(url, HttpMethod.Get);

            StreamReader reader = new StreamReader(stream);


            return reader.ReadToEnd();
        }
        public static void Bootstrap(string boxAccessCode)
        {
            using (var request = new HttpRequestMessage() { RequestUri = new Uri("https://www.box.com/api/oauth2/token"), Method = HttpMethod.Post })
            {
                HttpContent content = new FormUrlEncodedContent(new[] 
                { 
                 new KeyValuePair<string, string>("grant_type", "authorization_code"), 
                 new KeyValuePair<string, string>("code", boxAccessCode),
                 new KeyValuePair<string, string>("client_id", boxClientId),
                 new KeyValuePair<string, string>("client_secret", boxClientSecret)
                
                }


                );
                request.Content = content;
                var response = _httpClient.SendAsync(request).Result;
                if (response.IsSuccessStatusCode)
                {
                    JObject jObject = jObject = JObject.Parse(response.Content.ReadAsStringAsync().Result);
                    Settings.Default.boxAccessToken = (string)jObject["access_token"];
                    Settings.Default.boxRefreshToken = (string)jObject["refresh_token"];
                    Settings.Default.Save();

                }

            }

        }
        public static Stream GetFileAsStream(string fileId)
        {
            string url = string.Format("{0}files/{1}/content", boxApiUrl, fileId);
            return DoBoxCall(url, HttpMethod.Get);
        }
    }
}

Maybe that’ll help someone out there…

Beware: Hadoop C# SDK Inserts Line Breaks, Tabs That Break Your Queries

August 22 2014

After banging my head against the wall for many hours, I finally figured
out that .NET is adding escaped carriage returns, aka \r\n when the
queries are sent to HDInsight, which is causing the queries to fail. My code was loading the queries from files on disk like this:

string query = string.Empty;
using (var fs = new StreamReader("CreateTempTable.hql"))
{
    query = fs.ReadToEnd();
}

I figured this out by looking at userArgs
file in the templeton-hadoop directory to see what the jobs looked
like, and they appear like this:

"ADD JAR wasb:///user/jars/csv-serde-1.1.2-0.11.0-all.jar;
\r\nDROP TABLE IF EXISTS temp;\r\nCREATE EXTERNAL TABLE temp
(viewerId string, asset string, device_os string, country string, state 
string, city string, asn string, isp string, start_time_unix_time bigint,
startup_time_ms int) \r\nROW FORMAT serde 'com.bizo.hive.serde.csv.CSVSerde'
\r\nSTORED AS TEXTFILE LOCATION 'wasb:///temptable';\r\n\r\n               "

As you can see, the query is littered with escaped characters which causes the HIVE query to fail.
These same queries can be submitted via PowerShell no problem.

So, basically, I removed all the linebreaks in Notepad for my HQL and everything worked.

Submitting HDInsight Jobs From An Azure Webjob or WorkerRole Using the C# Hadoop SDK

August 22 2014

All the samples for submitting jobs programmatically to HDInsight assume that you are doing so from a desktop working station that has been set up with a management certificate. The code gets your cert out of the cert store and creates a JobSubmissionCertificateCredential as such:

// Get the certificate object from certificate store using the friendly name to identify it
X509Store store = new X509Store();
store.Open(OpenFlags.ReadOnly);
X509Certificate2 cert = store.Certificates.Cast<X509Certificate2>().First(item => item.FriendlyName == certFriendlyName);
JobSubmissionCertificateCredential creds = new JobSubmissionCertificateCredential(new Guid(subscriptionID), cert, clusterName);
// Submit the Hive job
var jobClient = JobSubmissionClientFactory.Connect(creds);
JobCreationResults jobResults = jobClient.CreateHiveJob(hiveJobDefinition);
 

This is all well and good, but what if you need to submit jobs programmatically from, say, an Azure WebJob or a worker role.

The way I solved this was generating my own management cert with a private key and then uploading it with the exe, placing the cert in the bin with the .exe. Here’s the code to generate a cert (tip of the hat to David Hardin’s post)

makecert -r -pe -a sha1 -n "CN=Windows Azure Authentication Certificate" -ss my -len 2048 -sp "Microsoft Enhanced RSA and AES Cryptographic Provider" -sy 24 -sv ManagementApiCert.pvk ManagementApiCert.cer
pvk2pfx -pvk ManagementApiCert.pvk -spc ManagementApiCert.cer -pfx ManagementApiCert.pfx -po password
 
 
 
ob

Then, after uploading the .cert to the Azure Management Certificate store (see here for doing that) and adding the .pfx to your project (be sure to set copy local to true) you can use the following code to create a JobSubmissionCertificateCredential:

var cert = new X509Certificate2("ManagementApiCert.pfx","your_password",X509KeyStorageFlags.MachineKeySet);
JobSubmissionCertificateCredential creds = new JobSubmissionCertificateCredential(new Guid(subscriptionID), cert, clusterName);
 
 

Tip of the hat to Tyler Doerksen who’s post led me to setting the MachineKeySet flag.

And, there you go: the ability to submit Hadoop jobs programatically from a WebJob or WorkerRole.

Adding JAR Files To Hive Queries In HDInsight That Reference WASB Can’t Be At The Root Of The Container

August 13 2014

Just discovered that if you want to add a JAR file to an HQL statement, the JAR file can’t be at the root of your container. It has to be in a virtual directory. So, for example, this code will not work:

ADD JAR wasb:///csv-serde-1.1.2-0.11.0-all.jar;

But, this code will:

ADD JAR wasb:///user/hdp/share/lib/hive/csv-serde-1.1.2-0.11.0-all.jar;

And, annoyingly, the blob storage browser in Visual Studio doesn’t allow you to create directories, so you’ll need to download something ClumsyLeaf CloudXplorer or the like.

HDInsight Hadoop Hive Job Decompresses CSV GZIP Files By Default

August 8 2014

Been working with Hadoop (2.4.0) and Hive (0.13.0) with HDInsight (3.1) and it decompresses GZIP files into CSV by default.  Nice!  So, loading data with a Hive query in Powershell:

$response = Invoke-Hive -Query @"
    
LOAD DATA INPATH 'wasb://$container@$storageAccountName.blob.core.windows.net/file.csv.gz' 

INTO TABLE logs;
               

"@ 

No additional work or arguments to pass. I thought I had to do something like specified in this post with the io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec but apparently not.

 

UPDATE: Just found this link: https://cwiki.apache.org/confluence/display/Hive/CompressedStorage which goes into keeping compressed data in Hive which has a recommendation to create a SequenceFile.

“Why We Need The Indie Web” by Tantek Celik

June 25 2014

On the open web…