Prototyping With Hive Using HDInsight

September 19 2014

I’ve been doing a lot of prototyping with Hive lately and I really wanted to use an emulator to do the work.  For awhile, I was trying to do the prototyping using Powershell but kept getting NotSupported exceptions. 

I finally decided to simply program against Hive itself which has turned out to work just great. I’m using Visual Studio to write my HQL as .sql files and getting a nice color coded editing experience, since HQL and SQL are fundamentally the same as far as keywords.

Then, I have a cmd line open with a Hive prompt and I run my HQL queries as follows:

hive> source c:\hdp\temp\test2.sql;

Working great! Fast and free – I like it…

Connecting To The Azure Storage Emulator From The HDInsight Emulator

September 18 2014

I was following the Getting Started instructions on using the HDInsight emulator and got stuck trying to connect to the Azure Storage Emulator:

 

hadoop fs -ls wasb://temp@storageemulator 
ls: `wasb://temp@storageemulator': No such file or directory

Turns out that you must have a trailing slash, like this:

hadoop fs -ls wasb://temp@storageemulator/ 

Maybe that’ll help someone out there…

A Simple Box.Com C# API Wrapper

August 25 2014

I had a need to access Box.com programmatically to do a daily pull of log files that were posted to Box from a third party service. At first I thought the Box .NET SDK would be helpful, but I quickly realized it is entirely oriented to be used by apps with a UI, not headless apps like this.  So, I dove into the documentation

My first stumble was that the Box developer token expires in one hour. I was hoping I’d be able to use it as a const for my headless server application, but no luck.

So, I need to actually generate an access_token and a refresh_token. The only way to do that is go through their UI. Thanks to a very helpful post on StackOverflow I was able to generate a code that could be used to generate both an access_token and refresh_token (which last for 60 days).

By persisting the refresh_token, you can write code that gets a fresh access_token programmatically. So, basically, my wrapper has a method called Bootstrap which you pass the code that you get from copy/pasting it out of the querystring. And then it has RefreshBoxToken, which gets a new access_token if the access_token is expired.

Then, there are two additional wrappers that actually do work which I called GetFolderByID and GetFileAsStream. GetFolderByID assumes you have the folderID, which you can figure out from the Box UI itself.  Then, with a little JSON.NET, you can parse the response and get the list of files as a JArray:

JObject jObject = JObject.Parse(folderContents);
JArray jArray = jObject["item_collection"]["entries"] as JArray;

Then, you’ve got the power do download files!

I wrapped both calls in a generic DoBoxAPICall method. Below is the entire class that encapsulates the logic:

using ExportConvivaLogsToHadoopWebJob.Properties;
using Newtonsoft.Json.Linq;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Net.Http;
using System.Text;
using System.Threading.Tasks;

namespace ExportConvivaLogsToHadoopWebJob
{

    public static class BoxAPIHelper
    {
        private const string boxApiUrl = "https://api.box.com/2.0/";
        private const string boxClientId = "YOUR_ID";
        private const string boxClientSecret = "YOUR_SECRET";
        private static readonly HttpClient _httpClient = new HttpClient();
        private static int retryCount = 0;
        private static Stream DoBoxCall(string url, HttpMethod httpMethod)
        {
            Stream stream;
            var request = new HttpRequestMessage() { RequestUri = new Uri(url), Method = httpMethod };
            request.Headers.Authorization = new System.Net.Http.Headers.AuthenticationHeaderValue("Authorization", "Bearer " + Settings.Default.boxAccessToken);
            var response = _httpClient.SendAsync(request).Result;
            if (!response.IsSuccessStatusCode && response.StatusCode == System.Net.HttpStatusCode.Unauthorized)
            {
                if (retryCount < 2)
                {
                    RefreshBoxToken();
                    retryCount++;
                    stream = DoBoxCall(url, httpMethod);
                    return stream;
                }
                else
                {
                    throw new Exception("Failed to connect to Box.");
                }

            }
            retryCount = 0;

            return response.Content.ReadAsStreamAsync().Result;
        }
        private static void RefreshBoxToken()
        {
            using (var request = new HttpRequestMessage() { RequestUri = new Uri("https://www.box.com/api/oauth2/token"), Method = HttpMethod.Post })
            {
                HttpContent content = new FormUrlEncodedContent(new[] 
                { 
                 new KeyValuePair<string, string>("grant_type", "refresh_token"), 
                 new KeyValuePair<string, string>("refresh_token", Settings.Default.boxRefreshToken),
                 new KeyValuePair<string, string>("client_id", boxClientId),
                 new KeyValuePair<string, string>("client_secret", boxClientSecret)
                
                }


                );
                request.Content = content;
                using (var response = _httpClient.SendAsync(request).Result)
                {
                    if (!response.IsSuccessStatusCode)
                    {
                        throw new Exception("Box refresh token failed. A human needs to go to a browser and generate a fresh authorization code.");
                    }
                    JObject jObject = jObject = JObject.Parse(response.Content.ReadAsStringAsync().Result);
                    Settings.Default.boxAccessToken = (string)jObject["access_token"];
                    Settings.Default.boxRefreshToken = (string)jObject["refresh_token"];
                    Settings.Default.Save();
                }
            }



        }
        public static string GetFolderById(string folderId)
        {
            string url = string.Format("{0}folders/{1}", boxApiUrl, folderId);
            Stream stream = DoBoxCall(url, HttpMethod.Get);

            StreamReader reader = new StreamReader(stream);


            return reader.ReadToEnd();
        }
        public static void Bootstrap(string boxAccessCode)
        {
            using (var request = new HttpRequestMessage() { RequestUri = new Uri("https://www.box.com/api/oauth2/token"), Method = HttpMethod.Post })
            {
                HttpContent content = new FormUrlEncodedContent(new[] 
                { 
                 new KeyValuePair<string, string>("grant_type", "authorization_code"), 
                 new KeyValuePair<string, string>("code", boxAccessCode),
                 new KeyValuePair<string, string>("client_id", boxClientId),
                 new KeyValuePair<string, string>("client_secret", boxClientSecret)
                
                }


                );
                request.Content = content;
                var response = _httpClient.SendAsync(request).Result;
                if (response.IsSuccessStatusCode)
                {
                    JObject jObject = jObject = JObject.Parse(response.Content.ReadAsStringAsync().Result);
                    Settings.Default.boxAccessToken = (string)jObject["access_token"];
                    Settings.Default.boxRefreshToken = (string)jObject["refresh_token"];
                    Settings.Default.Save();

                }

            }

        }
        public static Stream GetFileAsStream(string fileId)
        {
            string url = string.Format("{0}files/{1}/content", boxApiUrl, fileId);
            return DoBoxCall(url, HttpMethod.Get);
        }
    }
}

Maybe that’ll help someone out there…

Beware: Hadoop C# SDK Inserts Line Breaks, Tabs That Break Your Queries

August 22 2014

After banging my head against the wall for many hours, I finally figured
out that .NET is adding escaped carriage returns, aka \r\n when the
queries are sent to HDInsight, which is causing the queries to fail. My code was loading the queries from files on disk like this:

string query = string.Empty;
using (var fs = new StreamReader("CreateTempTable.hql"))
{
    query = fs.ReadToEnd();
}

I figured this out by looking at userArgs
file in the templeton-hadoop directory to see what the jobs looked
like, and they appear like this:

"ADD JAR wasb:///user/jars/csv-serde-1.1.2-0.11.0-all.jar;
\r\nDROP TABLE IF EXISTS temp;\r\nCREATE EXTERNAL TABLE temp
(viewerId string, asset string, device_os string, country string, state 
string, city string, asn string, isp string, start_time_unix_time bigint,
startup_time_ms int) \r\nROW FORMAT serde 'com.bizo.hive.serde.csv.CSVSerde'
\r\nSTORED AS TEXTFILE LOCATION 'wasb:///temptable';\r\n\r\n               "

As you can see, the query is littered with escaped characters which causes the HIVE query to fail.
These same queries can be submitted via PowerShell no problem.

So, basically, I removed all the linebreaks in Notepad for my HQL and everything worked.

Submitting HDInsight Jobs From An Azure Webjob or WorkerRole Using the C# Hadoop SDK

August 22 2014

All the samples for submitting jobs programmatically to HDInsight assume that you are doing so from a desktop working station that has been set up with a management certificate. The code gets your cert out of the cert store and creates a JobSubmissionCertificateCredential as such:

// Get the certificate object from certificate store using the friendly name to identify it
X509Store store = new X509Store();
store.Open(OpenFlags.ReadOnly);
X509Certificate2 cert = store.Certificates.Cast<X509Certificate2>().First(item => item.FriendlyName == certFriendlyName);
JobSubmissionCertificateCredential creds = new JobSubmissionCertificateCredential(new Guid(subscriptionID), cert, clusterName);
// Submit the Hive job
var jobClient = JobSubmissionClientFactory.Connect(creds);
JobCreationResults jobResults = jobClient.CreateHiveJob(hiveJobDefinition);
 

This is all well and good, but what if you need to submit jobs programmatically from, say, an Azure WebJob or a worker role.

The way I solved this was generating my own management cert with a private key and then uploading it with the exe, placing the cert in the bin with the .exe. Here’s the code to generate a cert (tip of the hat to David Hardin’s post)

makecert -r -pe -a sha1 -n "CN=Windows Azure Authentication Certificate" -ss my -len 2048 -sp "Microsoft Enhanced RSA and AES Cryptographic Provider" -sy 24 -sv ManagementApiCert.pvk ManagementApiCert.cer
pvk2pfx -pvk ManagementApiCert.pvk -spc ManagementApiCert.cer -pfx ManagementApiCert.pfx -po password
 
 
 
ob

Then, after uploading the .cert to the Azure Management Certificate store (see here for doing that) and adding the .pfx to your project (be sure to set copy local to true) you can use the following code to create a JobSubmissionCertificateCredential:

var cert = new X509Certificate2("ManagementApiCert.pfx","your_password",X509KeyStorageFlags.MachineKeySet);
JobSubmissionCertificateCredential creds = new JobSubmissionCertificateCredential(new Guid(subscriptionID), cert, clusterName);
 
 

Tip of the hat to Tyler Doerksen who’s post led me to setting the MachineKeySet flag.

And, there you go: the ability to submit Hadoop jobs programatically from a WebJob or WorkerRole.

Adding JAR Files To Hive Queries In HDInsight That Reference WASB Can’t Be At The Root Of The Container

August 13 2014

Just discovered that if you want to add a JAR file to an HQL statement, the JAR file can’t be at the root of your container. It has to be in a virtual directory. So, for example, this code will not work:

ADD JAR wasb:///csv-serde-1.1.2-0.11.0-all.jar;

But, this code will:

ADD JAR wasb:///user/hdp/share/lib/hive/csv-serde-1.1.2-0.11.0-all.jar;

And, annoyingly, the blob storage browser in Visual Studio doesn’t allow you to create directories, so you’ll need to download something ClumsyLeaf CloudXplorer or the like.

HDInsight Hadoop Hive Job Decompresses CSV GZIP Files By Default

August 8 2014

Been working with Hadoop (2.4.0) and Hive (0.13.0) with HDInsight (3.1) and it decompresses GZIP files into CSV by default.  Nice!  So, loading data with a Hive query in Powershell:

$response = Invoke-Hive -Query @"
    
LOAD DATA INPATH 'wasb://$container@$storageAccountName.blob.core.windows.net/file.csv.gz' 

INTO TABLE logs;
               

"@ 

No additional work or arguments to pass. I thought I had to do something like specified in this post with the io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec but apparently not.

 

UPDATE: Just found this link: https://cwiki.apache.org/confluence/display/Hive/CompressedStorage which goes into keeping compressed data in Hive which has a recommendation to create a SequenceFile.

“Why We Need The Indie Web” by Tantek Celik

June 25 2014

On the open web…

Null Value When Calling CloudQueue ApproximateMessageCount With Windows Azure Storage Client

June 25 2014

Was recently perplexed by why the nullable ApproximateMessageCount property was always null. Then discovered that you have to call FetchAttributes() before accessing this property.  Solved!

Microsoft Azure and Visual Studio Online Server Builds – Tips & Tricks

April 22 2014

I’m all about deploying web sites and cloud services via server builds. Say goodbye to deployments from a developer’s box that can’t be reproduced on another box. Say goodbye to deploying code that isn’t checked in. Say goodbye to deployments that aren’t fully tested.  It is indeed super cool.

So how to get it set up? There are some good tutorials out there, but it can get a little tricky. Here’s how I did it.

First, make sure that you’ve linked your Azure account to your Visual Studio Online repository as explained in Step 3 here: http://azure.microsoft.com/en-us/documentation/articles/cloud-services-continuous-delivery-use-vso/

Then, if you open VS from the Azure website, it will generate a build template for you that is the name of your Azure deployment with an _CD at the end. You’ll have to tweak some things to get it happy though.

First, go the Build section of Team Explorer window:

0

Then, right click on the Build definition file and click Edit Build Definition.

There’s a bunch of things you will need to change.

In the general tab, make sure you enable the build definition. By default it is disabled:

11

In the source settings tab, make sure it is pointing to the right repository:

1

In the Trigger tab, you may want to tweak when the deployments happen:

111

And, in the Process tab, make sure you point the Project it to the right .sln to build as well as the Configuration you want aka Release | Any CPU.

And, in the deployment settings, make sure that the the Path To Deployment Settings points to your .pubxml file and the Windows Azure Deployment Environment points to the name of the Cloud Service in Windows Azure. 

4

With that all set, you can now build and deploy using server builds!

VSAchievements
Visual Studio Achievements
Karsten Januszewski (207 Points)