Using Parallel.ForEach To Aggregate Results From JSON Files Stored in Windows Azure Blob Storage

April 17 2012

I love NoSQL, except when it comes to reporting.  Then I miss those handy SQL aggregation calls. I recently had a situation where I needed to look at a whole bunch of JSON files stored in Windows Azure Blob Storage and aggregate values from within those JSON files. The exact scenario was to get a count of how many people had achieved each achievement as part of the Visual Studio Achievements project.

Turned out to be an ideal use case for using some of the parallel programming features in .NET 4.0. Rather than download and process N JSON blobs linearly, I could throw the loop in a Parallel.ForEach block and gain speed from the multi-core machine I was using. The code was pretty straightforward, especially once I discovered the System.Collections.Concurrent namespace with its handy ConcurrentDictionary.

And, most significantly, it increased performance by 400 percent!

Below is the code in it’s entirety; I’ll walk through it here:

For JSON deserialization, I’m using a library provided by the WCF team up on CodePlex which is now part of the ASP.NET Web API. It provides some nifty features for turning JSON into dynamic objects.

The first thing I do is download a JSON file that has all the achievements, which is publically available (line 25).

I then put all the achievements in a ConcurrentDictionary<string, int> which I’ll use to build my report (line 33).

Then, I get the blobs and start my Parallel.ForEach loop (line 41). Inside the Action(TSource), I walk the list of the user’s earned achievements, incrementing the count in the ConcurrentDictionary of each achievement (line 57).

Finally, once I exit the loop, I turn the dictionary into an Excel spreadsheet (line 69) – folks tend to like that format.

You’ll notice the remmed out code; that’s the old non-parallelized code if you’d like to compare.

1 using System; 2 using System.Collections.Concurrent; 3 using System.IO; 4 using System.Json; 5 using System.Net; 6 using System.Threading.Tasks; 7 using Microsoft.WindowsAzure; 8 using Microsoft.WindowsAzure.StorageClient; 9 10 namespace AchievementsReporting 11 { 12 class Program 13 { 14 15 static void Main(string[] args) 16 { 17 var cloudBlobClient = new CloudBlobClient(new Uri("https://---.blob.core.windows.net", UriKind.Absolute), 18 new StorageCredentialsAccountAndKey("---", 19 "---")); 20 var container = cloudBlobClient.GetContainerReference("users"); 21 22 string masterJson = string.Empty; 23 using (var webClient = new WebClient()) 24 { 25 masterJson = 26 webClient.DownloadString(new Uri("http://channel9.msdn.com/achievements/visualstudio?json=true")); 27 } 28 dynamic masterList = JsonValue.Parse(masterJson); 29 var statisticsDictionary = new ConcurrentDictionary<string, int>(); 30 //var statisticsDictionary = new Dictionary<string, int>(); 31 foreach (var achieve in masterList.Achievements) 32 { 33 statisticsDictionary.GetOrAdd(achieve.Name.ToString(), 0); 34 //statisticsDictionary.Add(achieve.Name.ToString(), 0); 35 } 36 BlobRequestOptions options = new BlobRequestOptions(); 37 options.UseFlatBlobListing = true; 38 options.BlobListingDetails = BlobListingDetails.Snapshots; 39 Console.WriteLine("Starting..."); 40 DateTime start = DateTime.Now; 41 Parallel.ForEach(container.ListBlobs(options), blobListItem => 42 //foreach (var blobListItem in container.ListBlobs(options)) 43 { 44 CloudBlob blob = 45 container.GetBlobReference( 46 blobListItem.Uri.AbsoluteUri); 47 string json = blob.DownloadText(); 48 if (!string.IsNullOrEmpty(json)) 49 { 50 dynamic achievementsDynamic = 51 JsonValue.Parse(json) as dynamic; 52 foreach ( 53 var achieve in achievementsDynamic.Achievements) 54 { 55 if (achieve.DateEarned != null) 56 { 57 statisticsDictionary[achieve.Name.ToString()] = 58 statisticsDictionary[achieve.Name.ToString()] + 1; 59 } 60 } 61 } 62 } 63 ); 64 65 using (StreamWriter writer = new StreamWriter("report.xls")) 66 { 67 foreach (var key in statisticsDictionary.Keys) 68 { 69 writer.WriteLine(key + "\t" + statisticsDictionary[key].ToString()); 70 } 71 } 72 TimeSpan diff = DateTime.Now - start; 73 Console.WriteLine("done - took: "); 74 Console.WriteLine(diff.TotalMinutes); 75 } 76 } 77 } 78