I recently started investigating Apache Spark as a framework for data mining. Spark builds upon Apache Hadoop, and allows a multitude of operations more than map-reduce. It also supports streaming data with iterative algorithms.
Since Spark builds upon Hadoop and HDFS, it is compatible with any HDFS data source. Our server uses MongoDB, so we naturally turned to the mongo-hadoop connector, which allows reading and writing directly from a Mongo database.
However, it was far from obvious (at least for a beginner with Spark) how to use and configure mongo-hadoop together with Spark. After a lot of experimentation, frustration, and a few emails to the Spark user mailing list, I got it working in both Java and Scala. I wrote this tutorial to save others the exasperation.
Read below for details. The impatient can just grab the example application code. Continue reading
I’ve recently started listening to audiobooks. They’re a convenient way to enjoy books on your way to work or while driving. After listening to Mika Waltari’s The Egyptian, I took on The Hunger Games, read by Carolyn McCormick.
Like many who have reviewed the audiobook, I had an immediate disliking of the narration. It was not so much her voice, but her pacing. She does not give time for the words sink in. It was a constant, mild irritation — the book could be so much better if the reader took just a little more pauses. Rather than giving up on the book, I started coding.
I wrote a Ruby script, Audiobook Pacer, that can change the pace of reading of an audiobook. (I first tried writing a LADSPA plugin, but it seems they cannot modify the length of the audio.) The script works by adjusting the length of pauses the reader takes between sentences and paragraphs. All pauses longer than a specified time are lengthened or shortened by a set percentage. Breaks between words shouldn’t be modified, as this may break the flow of the sentence.
In the case of The Hunger Games, I increased by 25% the length of all pauses longer than 0.6 seconds. The change is subtle, but it makes all the difference between constant irritation and enjoyment.
Update: After listening to the Games for a few hours, I started getting irritated by the narration once more. It turned out I had converted only half of the files. After modifying the pace of the rest of the files enjoyment prevailed.
One of the great things about Cucumber and Watir is that it allows you to write functional tests that are decoupled of the UI. By using page objects, the definition of how the UI works is decoupled from the tests themselves. If the UI changes, you only need to update the corresponding page object, and all of your tests still run.
Such tests provide an excellent safety harness in which changes can be made with the confidence of not breaking other features. The only problem is that the tests verify the functionality, but not the visuals of the pages. We were missing the safety harness for CSS changes.
For this purpose I implemented a set of tests that verify the visual appearance of certain core pages. This prevents someone from accidentally making a CSS change that affects other pages as well.
Since these tests are very brittle by definition, I do not recommend having a lot of them. You need to identify a few core pages from your application that rarely change their visual appearance, but which still cover the most important parts of your CSS.
For the impatient, the example code is on GitHub.
Many applications use the current time in their functionality. For example, they can show data for a certain period of time or show the current date within the application. Writing functional tests for such applications can be tedious. How do you write a repeatable test for functionality that only occurs on Thursdays?
For this purpose I wrote TimeShift.js. It is a mock implementation of the normal Date constructor which allows you to set the current time and time zone.
Date = TimeShift.Date;
// => "Tue Jun 01 2010 17:00:00 GMT+0500"
This way you can write repeatable test cases that still depend on the current time.
With the prevalence of smart phone cameras today, they are often used instead of scanners as quick digitization methods for documents. Unfortunately this leads to excessive vignetting (darkened areas at the edges), which makes it hard to print the document legibly.
For simple text documents and line drawings, however, it just takes four simple steps to correct the image. The following describes the steps in GIMP, but the same should apply to PhotoShop and probably other image manipulation software as well.