Categories

apache spark

Setting up Scala for Spark App Development

4 January 2016

Apache Spark is a popular framework for distributed computing, both within and without the Hadoop ecosystem. Spark of...

A Quickie on Spark Actions, Laziness, and Caching

1 April 2016

Time is important when thinking about what happens when executing operations on Spark’s RDDs. The documentation on ac...

Shell Scripts to Ease Spark Application Development

25 April 2016

When creating Apache Spark applications the basic structure is pretty much the same: for sbt you need the same build....

A Quickie on Batch-Updating Non-Key Columns in Apache Phoenix

15 July 2016

Apache Phoenix is a SQL skin for HBase, a distributed key-value store. Phoenix offers two flavours of UPSERT, but fro...

A Quickie on Playing with Spark in sbt

13 January 2017

Unless you have a cluster with Apache Spark installed on it at your disposal, you may want to play a bit with Spark o...

A Quickie on Reading JSON Resource Files in Apache Spark

17 February 2017

If you need to read a JSON file from a resources directory and have these contents available as a basic String or an ...

big data

Data ≠ Information

3 August 2014

Big Data. Every time I hear that buzz phrase whizzing around I have the inexplicable urge to smack it back into the ...

An Overview of File and Serialization Formats in Hadoop

7 December 2015

Apache Hadoop is the de facto standard in Big Data platforms. It’s open source, it’s free, and its ecosystem is garga...

Shell Scripts to Ease Life with Hadoop

10 February 2017

Interaction with HDFS via the file system shell commands or YARN’s commands is cumbersome. I have collected several h...

Shell Scripts to Check Data Integrity in Hive

11 February 2017

Apache Hive does not come with an out-of-the-box way to check tables for duplicate entries or a ready-made method to ...

blockchain

Blockchains and Music

3 March 2017

There are two crucial pieces of information for each song: who performed it (performing rights) and who owns the righ...

bytesized

A Quickie on Connecting to Oracle Database VM From A Host

26 April 2015

Although the pre-built Oracle Database 12c VMs come with Oracle SQL Developer and APEX, you may not want to leave the...

A Quickie on Oracle Date Arithmetic Weirdness

8 July 2015

Although the date arithmetic in Oracle Database is well documented, it is not always as clear as it could be. In this...

A Quickie on Spark Actions, Laziness, and Caching

1 April 2016

Time is important when thinking about what happens when executing operations on Spark’s RDDs. The documentation on ac...

A Quickie on Batch-Updating Non-Key Columns in Apache Phoenix

15 July 2016

Apache Phoenix is a SQL skin for HBase, a distributed key-value store. Phoenix offers two flavours of UPSERT, but fro...

A Quickie on Playing with Spark in sbt

13 January 2017

Unless you have a cluster with Apache Spark installed on it at your disposal, you may want to play a bit with Spark o...

A Quickie on Collecting Table and Column Statistics in MemSQL

5 February 2017

MemSQL is a distributed in-memory database that is based on MySQL. As of the latest version (5.7), MemSQL does not au...

A Quickie on Reading JSON Resource Files in Apache Spark

17 February 2017

If you need to read a JSON file from a resources directory and have these contents available as a basic String or an ...

data engineering

An Overview of Apache Streaming Technologies

12 March 2016

There are many technologies for streaming data: simple event processors, stream processors, and complex event process...

The Problems with Visual Programming Languages in Data Engineering

22 September 2016

Recently I had a conversation about the value proposition of visual programming languages, especially in data enginee...

data governance

The Challenges of Data Integration: Data Governance (Part 4/4)

6 July 2014

The fourth and final part of the series on the challenges of data integration is about data governance. Data governan...

The Data-Quality Doctor

31 August 2014

When it comes to data quality, many companies are still stuck in firefighting mode: whenever a problem arises, a quic...

Frustration: The Data Governance Sinkhole

23 November 2014

Plentiful are the companies that revel in fancy descriptions of data-driven decision-making cultures on their corpora...

Why Govern Your Data?

1 February 2015

The way a company looks at its data is indicative of its readiness to embrace a data governance programme: is data a ...

data integration

The Challenges of Data Integration: Technical (Part 1/4)

25 May 2014

Data integration is a formidable challenge. For one, data integration is never the goal of an organization. Similarly...

The Challenges of Data Integration: Project Management (Part 2/4)

8 June 2014

In this second post of a four-part series on the challenges of data integration I want to talk about project manageme...

The Challenges of Data Integration: People (Part 3/4)

22 June 2014

The third part of the four-part series on the challenges of data integration deals with people. I have already hinted...

The Challenges of Data Integration: Data Governance (Part 4/4)

6 July 2014

The fourth and final part of the series on the challenges of data integration is about data governance. Data governan...

data science

The Football Fortune Tellers

13 July 2014

It was of course a public relations gimmick by Goldman Sachs and PricewaterhouseCoopers to ‘predict’ the outcome the ...

Homelessness: A Problem with Data but Few Solutions

14 September 2014

A while ago I had the pleasure to visit Strasbourg, the capital of the Alsace region and home of the European parliam...

What Makes A New York Times Article Popular? (Part 2/2)

22 May 2015

In the previous post I talked about the details of the data set for the Kaggle challenge to build a model that predic...

What Makes A New York Times Article Popular? (Part 1/2)

22 May 2015

I completed the MITx course The Analytics Edge on edX, which I can wholeheartedly recommend to anyone who is interest...

The Case for Industrial Data Science

1 February 2016

It has – perhaps somewhat prematurely – been called the sexiest job of the twenty-first century, but whether you buy ...

memsql

A Quickie on Collecting Table and Column Statistics in MemSQL

5 February 2017

MemSQL is a distributed in-memory database that is based on MySQL. As of the latest version (5.7), MemSQL does not au...

music

How Do Musicians Make Money? (Part 2/2)

12 February 2017

2016 was a good year for the music business: revenue from streaming and digital downloads exceeded that of physical s...

How Do Musicians Make Money? (Part 1/2)

12 February 2017

For most of human history, music has been composed and performed by amateurs, enslaved individuals, and professionals...

Blockchains and Music

3 March 2017

There are two crucial pieces of information for each song: who performed it (performing rights) and who owns the righ...

oracle

Oracle Database Optimization for Developers

17 August 2014

Let’s talk about my goings-on over at Read The Docs (RTD).

How to Multiply Across a Hierarchy in Oracle: SQL Statements (Part 1/2)

28 September 2014

Hierarchical and recursive queries pop up from time to time in the life of a database developer. Complex hierarchies ...

How to Multiply Across a Hierarchy in Oracle: Performance (Part 2/2)

12 October 2014

In the first part I have shown several options to calculate aggregation functions along various branches of a hierarc...

Searching The Oracle Data Dictionary

9 November 2014

Databases and especially data warehouses typically consist of many dozens of tables and views. Good documentation is ...

Tuning Distributed Queries in Oracle

7 December 2014

When it comes to SQL statements and optimizing queries on relational databases, probably the first thing developers (...

Unit Testing PL/SQL Code?

21 December 2014

In almost all areas of software development, unit testing is not only common sense but also common practice. After al...

Oracle SQL and PL/SQL Coding Guidelines

4 January 2015

Coding standards are important because they reduce the cost of maintenance. To enable database developers on the same...

An Overview of PL/SQL Collection Types

1 March 2015

Collections are core components in Oracle PL/SQL programs. You can (temporarily) store data from the database or loca...

Checking Data Type Consistency in Oracle

29 March 2015

In large databases it can be a challenge to have data type consistency across many tables and views, especially since...

A Quickie on Connecting to Oracle Database VM From A Host

26 April 2015

Although the pre-built Oracle Database 12c VMs come with Oracle SQL Developer and APEX, you may not want to leave the...

A Quickie on Oracle Date Arithmetic Weirdness

8 July 2015

Although the date arithmetic in Oracle Database is well documented, it is not always as clear as it could be. In this...

ETL: A Simple Package to Load Data from Views

29 February 2016

A common, native way to load data into tables in Oracle is to create a view to load the data from. Depending on how t...

round-up

Weekend Reading Round-Up

15 September 2017

Below I have collected an initial batch of recent research articles and posts on various topics, such as deep learnin...

Weekend Reading Round-Up

13 October 2017

In this month’s reading round-up I look at businesses in Africa, category theory and Scala, fancy copy-pasting of cod...

Weekend Reading Round-Up

17 November 2017

How easy is it to deceive a deep neural network? Does gender of leaders affect team cohesion? Can music be classified...

Weekend Reading Round-Up

21 December 2017

Hot off the press! Articles on deep learning, early-warning signs of critical transitions, recommendation engines, s...

software

Numerical Algorithms: Variational Integrators

26 October 2014

When I was rummaging through my digital attic I found code that I had worked on a few years ago. It is not related to...

Mapping a Value Stream in Neo4j

16 August 2015

The canonical use cases of a graph database such as Neo4j are social networks. In logistics and manufacturing network...

The Fast Track to Julia

13 September 2015

Julia is a fairly new and promising programming language that is designed for technical computing. Here I present a c...

A Glossary of Some Software Terminology

8 December 2016

Continuous integration, Docker, Jenkins, Vagrant, DevOps, PaaS, serverless… If you are sometimes confused as to what ...

How to Stay Up to Date with Trends in Tech

23 January 2017

As technology evolves at a rapid rate, it may sometimes be difficult to keep up. While I am certainly not the world’s...