Posts from Engineering on topicjava

Dremel made simple with Parquet

Topics:

Columnar storage is a popular technique to optimize analytical workloads in parallel RDBMs. The performance and compression benefits for storing and processing large amounts of data are well documented in academic literature as well as several commercial analytical databases.Read more…

Drinking from the Streaming API

Topics:

Today we’re open-sourcing the Hosebird Client (hbc) under the ALv2 license to provide a robust Java HTTP library for consuming Twitter’s Streaming API. The client is full featured: it offers support for GZip, OAuth and partitioning; automatic reconnections with appropriate backfill counts; access to raw bytes payload; proper retry schemes, and relevant statistics.Read more…

Visualizing Hadoop with HDFS-DU

Topics:

We are a heavy adopter of Apache Hadoop with a large set of data that resides in its clusters, so it’s important for us to understand how these resources are utilized. At our July Hack Week, we experimented with developing HDFS-DU to provide us an interactive visualization of the underlying Hadoop Distributed File System (HDFS).Read more…

Introducing the Open Source Twitter Text libraries

Topics:

Over time Tweets have acquired a language all their own. Some of these have been around a long time (like @username at the beginning of a Tweet) and some of these are relatively recent (such as lists) but all of them make the language of Tweets unique. Extracting these Tweet-specific components from a Tweet is relatively simple for the majority of Tweets, but like most text parsing issues the devil is in the details.Read more…