The Houzz Blog

Houzzer Profile: Amanda Gentle, Account Manager

Thursday, August 31, 2017

Based in our Orange County office, Amanda is part of the Houzz Pro Plus team, which helps home professionals to build their brands and connect with homeowners. When she’s not at work, Amanda can be found at the beach with her husband and their golden retriever, named Scout.

What’s a typical workday like for you?
My days are diverse and busy; I love it! As an Account Manager, I talk to my clients about their goals and use data to help them maximize their presence on Houzz. It’s really rewarding to see how Pro+ helps them drive brand awareness and grow their business. They look to me for advice, knowledge, and problem-solving. Knowing that I’ve helped them achieve their goals is probably my favorite part of the day.

Why did you decide to join Houzz?
I knew I wanted to work in the tech industry and after doing research I saw how much Houzz was paving the way and wanted to align myself with this company. More importantly, though, I decided to join Houzz because when I walked in the office, it felt like home. I love the culture.

What’s it like to be a part of the sales team?
The sales culture at Houzz is fast-paced and collaborative. Hard work is valued in this role and we are given the tools to succeed from encouraging sales managers and teammates. We are always pushing one another to get better and celebrate one another’s achievements.

How has your career evolved at Houzz?
Houzz has provided many opportunities to grow in my career. I first started as an Account Coordinator helping new professionals create their Houzz profiles. I was promoted to Account Executive and began helping some of those same professionals expand their presence through the Pro+ program. I am now in a blended role as an Account Manager where I work with both current pros in our network and help new pros discover the benefits we can offer. It’s a great mix. I get to see the success of my professionals, brainstorm with them, and build new relationships.

What’s your favorite part of working at Houzz?
The lifelong friendships I’ve formed while working at Houzz make it such an enjoyable place to work, which is important to me because I moved from Texas to take this job. I knew very few people in California and Houzz has an amazing team that makes California feel like home. We work hard but I love waking up and coming to this environment.

Beyond my role as an Account Manager, I am also the Chair of the Philanthropy Committee at the Orange County office. We recently put on a very successful philanthropy week raising money, collecting food items, and gathering volunteers, to remodel a children’s home in the area. I appreciate being part of a company that cares about the community we live in. The Philanthropy Committee has broadened my purpose at work and I am extremely grateful.

What do you do when you’re not working?
I LOVE to travel. My husband and I have been able to take some great vacations this past year including our most recent trip to Italy. I enjoy working out, playing with my golden retriever, and going to the beach. We are still exploring California, so you’ll often find us wandering through different neighborhoods to find delicious new restaurants.

Engineering Interns Share What They’ve Learned at Houzz

Friday, August 25, 2017

Houzz welcomed a talented group of interns to the engineering team at our HQ in Palo Alto this summer. Here is what a few of them had to say about what they learned during their internship and their favorite thing about working at Houzz.

Ziran Ling (Back-End Engineer)
What I learned: I’ve learned many things from my experience with Houzz, including a new programming language and different aspects of web servers and APIs that I hadn’t been exposed to previously.

Favorite thing about Houzz: I really like the working environment. People work closely together here. If you talk with the project leads and show your interest in a project, they will likely let you work on it, which is a great way to learn. You have more opportunities to be exposed to different projects and acquire new skill sets.

Theresa Nguy (3D Artist)
What I learned: I learned how to use different tools to work on 3D projects, like making dynamic furniture models. My team is very supportive and helps me whenever I have questions.

Favorite thing about Houzz: I really enjoy the company culture and focus on teamwork at Houzz. As a recent graduate, the work environment made me feel very comfortable.

Wenqin Wang (Growth Engineer)
What I learned: I’m working on the growth team as a full-stack web developer to help users get the most out of their Houzz experience. As an intern, I’ve learned a lot about Web development (Javascript, Ajax, http, php), and since growth is very consumer facing, I’ve gained some experience using marketing and design tools. I also now know more about Game of Thrones than I thought possible because our team is so into the show that we often have Q&A sessions about it!

Favorite thing about Houzz: People and culture! I definitely love all of the friendly people who offer help, support and happiness. I also love the food (different dinner caterers every day and birthday desserts every Friday), the location (downtown Palo Alto), and our office (it feels very homey, and I wear Houzz slippers every day!).

Chang Liu (Front-End Engineer)
What I learned: This is my first time working as a front-end engineer in an industry setting. Prior to this, my web projects have been personal ones and small in scope. With that, it’s been a meaningful experience learning how to familiarize myself with and write professional client-side code. Coming into this internship, I had little experience with React and Flux patterns, but I definitely feel comfortable using them now.

Favorite thing about Houzz: I think Houzz is the ideal size company for me. It is still growing at a rapid rate, so the scope of the projects we take on are large and ambitious. That, plus the size of the engineering department, really makes you feel like an integral part of the company. It also helps that we receive mentorship from some of the nicest engineers I’ve ever met and work in an aesthetically pleasing office (of course, since Houzz is an interior design platform!).

Parents Draw on Nature Themes to Infuse Creativity in Kids’ Room Design

Wednesday, August 23, 2017

With kids heading back to school, we surveyed* the Houzz community on how they design their kids’ rooms and tackle child-related home design challenges. More than two-thirds of respondents said their kids’ rooms have a decor theme (69%). Of those, a nature theme lead the pack (30%), followed by animals (23%), sports (17%) and princesses (15%). Other notable themes include nautical, superheroes and geometric patterns.

Colors are another form of expression, with more than half of kid’s bedrooms painted blue (59%), followed by white (31%), gray (30%), green (25%) and pink (20%). We found that kids also get a vote in home decor, with one in five respondents allowing children younger than six years old to decorate their own rooms (20%).

While the design elements are important, parents also prioritize functionality in their kids’ rooms. Top considerations include creating a space that is easy to clean/maintain (71%), having a functional setup (64%), and using durable materials (47%). One in five respondents also place significance on non-toxic products (20%).

These priorities follow the top two child-related challenges in the home, which are keeping things clean (74%) and managing clutter (69%). In fact, only 8% feel their children’s things are nearly always organized, while 22% feel their children’s things are mostly or completely disorganized. As with many things in life, 42% say it totally depends on the day.

The most popular solutions are bookshelves and toy boxes for the storage of things not related to clothing in children’s rooms (71% and 70%, respectively). Beyond cleanliness and clutter, other child-related challenges in the home include overflowing laundry (35%), tripping or stepping over child-related stuff (30%), limited play space (29%) and damage to furniture (25%), carpeting (24%) and walls (22%).

These considerations expand to other areas outside of bedrooms which are also dedicated to children including, outdoor space (49%), playrooms (44%), reading nooks (22%), homework stations (21%), craft stations (20%) and game rooms (20%).

Most respondents enlist the help of their kids to bring order to the home. More than two-thirds of children over three years old clean their own room (68%). Children also help out at home by picking up after themselves (76%), feeding the pets (56%), making their bed (53%) and setting the table (53%).

For storage solutions that fit perfectly within creative kids’ decor, check out this collection of drawer chests, bins and cubbies from the Houzz Shop.

*Houzz survey of more than 200 members of the U.S. Houzz community who have recently completed, are currently working on, or are planning a home project with children in mind, was fielded in July of 2017.

Houzz Scholarship Program: Open for Entries!

Wednesday, August 16, 2017

We’re now accepting entries for the Houzz Scholarship Program, which supports the next generation of residential design and architecture pros: students studying architecture, interior design, and landscape architecture.

Houzz will award four $2,500 scholarships in the categories of ‘Women in Architecture’, ‘Sustainable Design,’ ‘Residential Interior Design’ and ‘Residential Construction Management’. Houzz awards these scholarships biannually in the spring and fall.

High school seniors, undergraduates, and graduate students 17 years of age or older are invited to apply at houzz.com/scholarships. In addition to submitting a brief essay on their design and architecture influences, students are invited to create a Houzz professional student profile, where they can showcase their portfolio of work and network with more than a million professionals around the world.

Fall 2017 winners included:

Women in Architecture: Angelica Gallegos (studying architecture at the Pratt Institute)
Sustainable Residential Design: Sinna Pang (studying interior design at Fullerton College)
Residential Design: Mike Lidwin (studying architecture at the University of Tennessee, Knoxville)
Residential Construction Management: Jace Weber (studying construction management at Louisiana State University)

The deadline for the spring 2018 scholarships is December 15, 2017. Good luck to all applicants!

Helping Houzz Run Smoothly with TAP

Wednesday, August 9, 2017

At Houzz, we’re constantly identifying new ways to ensure the best possible experience for our community. One way to do that is by testing our systems for hidden bugs. Recently, we developed our own tool called TAP (or Type Analyzer for PHP), to add a static analysis to our dynamic language, so that we can catch issues before they are released to the public.

Static vs. dynamic language
There are two main categories of programming languages: statically typed languages like C++ and Java, and dynamically typed languages like Python, PHP and Javascript. Dynamic languages are beneficial in that they have fast development speed, do not require compilation and offer flexible design patterns. However, this flexibility leads to vulnerability. In a large codebase, it’s very easy to make mistakes like passing a map to a function which is expecting a list, or trying to access a member function $a->foo() while $a is actually null. These types of bugs typically hide in a particular code path which can’t be detected until it breaks on production, causing a negative experience for users.

Since a number of components of the Houzz website are written in PHP (a dynamic language), we sought a tool that could identify these types of bugs before new versions of code are released. The only available tools we found either didn’t have the sophistication to catch these bugs or required us to completely re-write our codebase in a new syntax and re-install local and production environments. So, we developed TAP.

The solution
TAP is a C++ tool that and scans PHP codebase, including PHPDoc comments, to identify irregularities. It is less demanding than other available tools, so you don’t have to reinstall your local and production environments to account for new syntax and it can continue to run checks without impacting daily user activities. TAP can search the entire Houzz PHP codebase and provide a report in eight minutes.

How it works
Static analyzing takes advantage of both the flexibility of dynamic languages and rigorousness of static languages. By checking the code statically during the development stage, TAP will deduce the types of local variables, test the compatibility of function arguments and check if a variable is nullable or not to identify existing bugs.

TAP uses PHPDoc for type declaration. PHPDoc is multi-line comments between /** and */, containing annotations that start with @. PHPDoc is also used by IDEs like PHPStorm for a similar purpose.

The most useful annotations are @param (for function param type), @return (for function return type) and @var (for object property type). Here is an example:

This is the resulting example report:

TAP supports four more precise types than PHP itself. To learn more about the types of errors TAP can detect, review the test PHP files at this link.

TAP Usage

TAP supports three running modes, including:

Single: This mode is typically used for demonstrating TAP’s basic functionality, and used for TAP’s self-inspection test. Use -f to specify a single PHP file you want to check. Please note that if this file uses any classes/functions/consts defined otherwhere, TAP won’t know and will report errors like DEFINITION_NOT_FOUND.

Batch: This mode will do a full scan on the whole repository. If the -s argument is specified, TAP will take it as the source root, and check the .tap config file there. If -s isn’t specified, TAP will try to find the first .tap file at or above the current directory, and check all PHP files under the location where .tap resides. Here is an example of .tap config file:

You can specify which directories to skip, and which directories to be scan-only. “ScanOnly” means that TAP will only check the functions’ signature but not the implementation, which is much faster than a full scan. It can be applied to third-party libraries and auto-generated classes.

You can use -r to specify the human-readable error report file, and -d to specify the sqlite db file which is supposed to be read by a web UI tool.

Daemon: Daemon is an experimental mode. It will run interactively during PHP development. After it is started, it will do a quick scan for the whole repository, only recording the types of class properties and function signatures, and skipping the function implementation. It will continue watching for any file changes and updating the recorded signatures in real time.

If, for example, you think your change is ready, and want TAP do a full scan before you commit, you can explicitly tell TAP to do it on the files/directories you touched.
This mode will ultimately become much faster than Batch mode, and more suitable for the development process.
Note: At the time this blog post was published, Daemon mode was under development. Errors may be reported when using this mode.

Open Sourcing TAP
Once we completed TAP and tested it in our environment, we decided to open source the application. Other companies can use TAP on top of their PHP to check their own code for bugs before releasing updates to their customers. Follow this link to access the open source code.

Getting TAP
Developers can use the pre-built binary directly, by downloading the Mac OSX version here. To run the tool, execute chmod +x tap_server.

To build TAP, use cmake. Developers may need to install the dependencies beforehand (use brew, apt-get, yum or whatsoever), including:

Boost
Folly
Glog
Gflags
Sqlite3
Fswatch

Then create a directory for build:

Assuming the code is at /houzz/tap, run cmake to generate the makefile, then build:

Maximizing TAP
To make full use of TAP’s capabilities, it’s important to annotate classes as much as possible. TAP takes these annotations as a source of truth, deducing all local variables inside functions, and reporting type incompatibilities. The more annotations you provide for your methods and properties, the better TAP will be!

We’re always on the hunt for engineers to help us make the Houzz experience even better. Check out opportunities on our team at houzz.com/jobs.

Tour Our New APAC HQ

Tuesday, July 25, 2017

Welcome to our new Asia Pacific headquarters located in Sydney’s artsy suburb, Chippendale. I’m Khanh Nguyen, a member of the community team with Houzz Australia, and I’m excited to share our new home with you!

This office hosts the teams supporting our communities in Australia, New Zealand and Singapore, as well as our APAC operations. The space was designed to highlight the unique design aesthetic of each of the APAC countries where we’ve launched a localized platform, and to create a warm and collaborative space for our growing team.

Let’s start the quick tour, with a focus on the design, of course!

As soon as you open the door, you’ll find a bright Houzz-green chair and photos of team members with their families. You’ll find photos of employees with their families at any Houzz office you visit around the world as we want people to feel like they’re at home when they’re at Houzz.

As we have an open floor plan, our meeting rooms are always coveted spaces. Down the hall you’ll see the first and largest of our meeting rooms, the Aussie Backyard. Inspired by the Aussie’s inherent love for outdoor entertaining, this room features a floor to ceiling decal of a house designed by Aussie architect Matt Gibson Architecture + Design.

The next meeting room you’ll see is our Bollywood Movie Theatre, a homage to the launch earlier this year of our dedicated India site. This room features jewel toned cushions, draped traditional saris and retro Bollywood movie posters from the 70s and 80s.

Our bathroom meeting room is next – styled clean white with timber accents. The bathroom decal is from one of our Houzz professionals, Green Country Developments.

Another favourite space to gather is our wine cellar meeting room, which has an industrial barn door and wine barrel. This room is dedicated to our New Zealand site (fun fact: New Zealand has the second largest wine region in the world!).

As you walk through the office, you may notice the black beams in the ceiling. These get their colour from being charred in a fire 100 years ago. To offset the dark beams we have a network of exposed Houzz-green painted pipes.

On the second floor you’ll find healthy (and sweet) treats, it’s the hub of our office - the kitchen. This is where the team spends most of its time together, whether it be starting the week over breakfast on Mondays or wrapping up with lunches on Fridays. The internal exposed besser brick is offset by illuminated black shelves, white cabinetry and an L-shaped marble white Caesarstone benchtop tying into the overall industrial aesthetic.

You’ll see three more meeting rooms that we love, starting with the Japanese zen room. Lined with ‘tatami’ style carpet and Japanese zabuton cushions this room emulates traditional Japanese ‘washitsu’ rooms. Surround yourself in ‘fusuma’ - traditional sliding doors which flank the walls and channel positive thoughts.

You’ll next find yourself in the Singapore balcony room. The stylistic use of the balcony resonates with the majority (85%) of Singaporeans who live in apartments and depend on their balconies as their private outdoor space. On one wall you’ll see a floor to ceiling decal of the Singapore Skyline including the infamous Marina Bay Sands, home to the infinity pool.

And lastly the cigar room. This UK inspired room pays homage to Australia being British Commonwealth nation. Complete with a whiskey jar set and automated lights triggered when you take a seat on the quilted leather chair, this room is perfect for making those important calls.

Onwards and upwards to our sales team who occupies this space. What started off as a clean slate with house-shaped partitions, our teams have decorated with a colourful array of floral garlands and fairy lights to personalise their workspace and channel some creative energy.

To wrap up our tour, nothing screams Australia more than our open-air rooftop terrace equipped with a barbeque for afternoon sausage sangas (sausage sandwiches). Take a seat under the big umbrella on the outdoor lounge set. This is where you’ll find the team enjoying sunkissed lunches and winding down on Friday night with the skyline backdrop across neighbouring rooftops.

Thanks for stopping by - don’t forget to say bye to our office dog, Crumpet, on your way out!

All images credited to Ute Wegmann Photography.

Home Remodelers Confident in Industry; Backlogs of Over a Month on Average

Thursday, July 20, 2017

The Houzz Renovation Barometer, which tracks confidence in the home renovation market among industry professionals, has shown strong quarter-over-quarter confidence for all sectors for the last 10 quarters. With firms in high demand, we dug into the impact on renovation timelines.

This week, we introduced the Houzz Renovation Barometer Backlog Index, which shows the number of weeks before an average firm can start work on a new mid-sized home renovation or design project given its current project commitments. The index revealed an average backlog of four to seven weeks across industry sectors. General contractors (GCs)/remodelers and design-build companies have the longest average backlogs nationally at seven weeks, while designers show four week backlogs on average.

Backlogs vary significantly among the top 20 metro areas. In the Boston-Cambridge-Quincy, MA-NH metro, Seattle-Tacoma-Bellevue, WA metro, Portland-Vancouver-Beaverton, OR-WA metro, and San Jose-Sunnyvale-Santa Clara, CA metro, GCs and design-build companies face backlogs of more than three months (13-14 weeks). In contrast, remodelers in Detroit-Warren-Livonia, MI, Chicago-Naperville-Joliet, IL-IN-WI, and Houston-Sugar Land-Baytown, TX metro areas have the shortest delays and are able to start work on a new mid-sized project within a month (three to four weeks), on average.

Perhaps long project backlogs are why GCs/remodelers are the most bullish when looking forward to the third quarter of 2017, along with speciality building/renovation professionals. While confident, landscape/outdoor professionals are the most cautious about the upcoming quarter.

For more information from the Q2 2017 Houzz Renovation Barometer, see the full report here.

Team Building at the 2017 Houzz Retreat

Tuesday, July 18, 2017

Every year, Houzzers at our HQ in Palo Alto take a break for an offsite retreat. With more than 200 people and growing in the Palo Alto office, it’s a tradition that helps to foster our sense of family.

The offsite this year was all about friendly and fun competition with Houzzers randomly assigned to 14 teams competing in games like beach volleyball (with your wrists tied to another person!), speed pictionary, costume races and lawn skiing.

Here are a few of our favorite photos from the day:

Team blue at the ready!

Team pink coordinating steps to “ski” to the finish line before their opponents.

This rowdy game of “Wizards, Giants and Elves” takes inspiration from “Rock, Paper, Scissors” and tag.

Houzzers take competition seriously, but know how to have fun, too!

“Speed dating” to get to know more about fellow Houzzers.

Bonding with old friends and new at a family-style lunch.

Houzzers also bonded over activities like jenga, ping pong, flower crowns and corn hole.

Showing off flower crowns.

Team selfie

The winning team (by 2 points!)

Migration to Redis Cluster

Friday, July 14, 2017

At Houzz, we use Redis as the de-facto in-memory data store for our applications, including the web servers, mobile API servers and batch jobs. In order to support the growing demands of our applications, we migrated from an ad hoc collection of single Redis servers to Redis Cluster during the first half of the year.

To date, we have gained the following benefits from the migration:

Ability to scale up without the need to modify applications.
No additional proxies between clients and servers.
Lower capacity requirement and lower operational cost.
Built-in master/slave replication.
Greater resilience to single point of failures.
Functional parity to single Redis servers, including support for multi-key queries under certain circumstances.

Redis Cluster also has limitations. It does not support environments where IP addresses or TCP ports are remapped. Although it has built-in replication, as we ultimately discovered, few client libraries, if any, have support for it. For certain operations such as new connection creation and multi-key operations, Redis Cluster has longer latencies than single servers.

In this post, we will share our experiences with the migration, the lessons we learned, the hurdles we encountered, and the solutions we proposed.

Functional Sharding

Some of our applications use Redis as a permanent data store, while others use it as a cache. In a typical setting, there is a Redis master that processes write requests and propagates the changes to a number of slaves. The slaves serve only read requests. One of the slaves is configured to dump the memory to disk periodically. The dumps are backed up into the cloud. We use Redis Sentinel to do automatic failover from failed masters to slaves. Our applications access Redis through HAProxy.

Historically we scaled up the Redis servers by “functional” sharding. We started with a single shard. When we were about to run out of capacity, we added another shard and moved a subset of the keys from the existing shard to the new one. The new shard is typically dedicated to keys for a specific application or feature, e.g., ads or user data. Code that accessed the moved keys was modified to access the new servers after the move. For example, the marketplace application would access shards that store data about the products and merchants in the marketplace, while the consumer-oriented applications would access shards that store user data such as activities and followed topics. The same process was repeated for several years and the number of servers grew to several dozens. The process remained mostly manual due to the need to modify the applications.

The Redis servers ran on high-end hosts that had a large memory capacity. Such hosts would typically also have a large number of processors. Since each Redis server is a single process, only a small fraction of processors are utilized on each host. In addition, there is imbalance in memory and CPU usages across the shards due to the manual partitioning. Some shards have a large memory footprint and/or serve a high requests per second.

The large memory footprints are problematic to operations such as restart and master-slave synchronization. It can take more than 30 minutes for a large shard to restart or to do a full master-slave sync. Since all our client requests depend on Redis accesses, it poses a risk for a severe site-wide outage should all replicas of the large shard go down.

In the beginning of the year, we evaluated options to scale up the Redis servers with fewer manual processes and a shorter time to production.

Redis Cluster vs. Twemproxy

One option we considered was Redis Cluster. It was released by the Redis community on April 1, 2015. It automatically shards data across multiple servers based on hashes of keys. The server selection for each query is done in the client libraries. If the contacted server does not have the queried shard, the client will be redirected to the right server.

There are several advantages with Redis Cluster. It is well documented and well integrated with Redis core. It does not require an additional server between clients and Redis servers, hence has a lower capacity requirement and a lower operational cost. It does not have a single point of failure. It has the ability to continue read/write operations when a subset of the servers are down. It supports multi-key queries as long as all the keys are served by the same server. Multiple keys can be forced to the same shard with “hash tags”, i.e., sub-key hashing. It has built-in master-slave replication.

As mentioned above, Redis Cluster does not support NAT’ed environments and in general environments where IP addresses or TCP ports are remapped. This limitation makes it incompatible with our existing settings, in which we use Redis Sentinel to do automatic failover, and the clients access Redis through HAProxy. HAProxy provides two functions in this case: It does health checks on the Redis servers so that the client will not access unresponsive or otherwise faulty servers. It also detects the failover that is triggered by Redis Sentinel, so that write requests will be routed to the latest masters. Although Redis Cluster has built-in replication, as we discovered later, few client libraries, if any, have support for it. The open source client libraries we use, e.g., Predis and Jedis, would ignore the slaves in the cluster and send all requests to the masters.

The other option we evaluated was Twemproxy. Twitter developed and launched Twemproxy before Redis Cluster was available. Like Redis Cluster, Twemproxy automatically shards data across multiple servers based on hashes of keys. The clients send queries to the proxy as if it is a single Redis server that owns all the data. The proxy then relays the query to the Redis server that has the shard, and relays the response back to the client.

Like Redis Cluster, there is no single point of failure in Twemproxy if multiple proxies are running for redundancy. Twemproxy also has an option to enable/disable server ejection, which can mask individual server failures when Redis is used as a cache vs. a data store.

One disadvantage of Twemproxy is that it adds an extra hop between clients and Redis servers, which may add up to 20% latency according to prior studies. It also has extra capacity requirement and operational cost for monitoring the proxies. It does not support multi-key queries. It may not be well integrated with Redis Sentinel.

Based on the above comparison, we decided to use Redis Cluster as the scale-up method going forward.

Building the cluster

Before we could migrate to Redis Cluster, we needed to bring Redis Cluster to functional parity to functional shards. We implemented most of the improvements in our client libraries (e.g., the Predis library for PHP clients). We also built automation tools for cluster management in our infrastructure management toolkit salt.

As mentioned earlier, the main missing features in the client libraries for Redis Cluster are master-slave replication, health checks and master-slave failover.

We replaced the active health checks in HAProxy with passive mark down and retries in the PHP client library. When the client gets an error from a Redis server, e.g., connection timeout or unavailability due to loading, the client marks the server down in APC and retries another server. Since APC is shared by all PHP processes in the same web server, the marked down Redis server will not be accessed by another client until it expires from APC a few seconds later.

We also added support for cluster master-slave replication in the client library. It started out as a straightforward refactoring, but ended up with considerable complexity as it interacted with the passive health checks and retries in partial failure modes, especially in pipeline execution. I will discuss this in more detail later in the post.

Other improvements we made to the Predis library include:

Support multi-server commands such as mset and mget
Reduction of memory usage of cluster configuration such as slot-to-server maps
Bug fixes including memory leak fixes, etc.
Added pipeline support to the Java Redis library Jedis.

Figures 1 and 2 show the Redis system architectures before and after the migration to Redis Cluster, respectively.

Figure 1. Functional shards

Figure 2. Redis Cluster

In addition to the client library improvements, we built tools to further automate the creation and maintenance of Redis Cluster.

There is an existing tool (redis-trib.rb) to create a cluster from a set of Redis servers. We built tools to place the masters and slaves in a more deterministic way than what redis-trib.rb does.

For example, we place the servers of the same shard across availability zones for better fault tolerance.

For data persistence, we enable one server per shard to dump its memory to disk and upload the dumps to the cloud storage periodically. Memory dump is a resource intensive operation, therefore, we chose a slave vs. the master for dumping and distributed the masters and dumping slaves evenly across hosts for load balancing.

Another desirable feature of the layout is to have servers in the same shard to have the same port number, which eases manual operations during debugging.

We built a toolkit to implement the desired layout during cluster creation as well as subsequent additions of new cluster nodes. Figure 3 shows an example layout of our Redis Cluster.

Since automatic failover can happen in the cluster from time to time, our toolkit periodically collects the cluster status and reconfigures the servers when necessary.

Figure 3. Example layout of Redis Cluster

Before the migration, we ran a set of performance tests to compare the latency of Redis Cluster vs. functional shard under various conditions.

Figure 4 shows the latencies with non-persistent connections on cluster vs. functional shard. While there is a significant difference for the multi-key commands, we do not expect such cases to be common in practice for Redis Cluster since all Redis accesses in the same client session will be able to share the same connection to the cluster.

Figure 4. Latencies with non-persistent connections

Figure 5 shows the latencies with persistent connections on cluster vs. functional shard. The latencies are measured when there is only one client accessing Redis. In practice, there will be multiple clients and the actual latency per request will the increased by the processing time of the requests that are queued in front of this request. The queues will be shorter in Redis Cluster since the workload is distributed across a larger number of processes.

Figure 5. Latencies with persistent connections

The live migration

According to the Redis Cluster documentation, no automatic live migration to Redis Cluster is currently possible and we have to stop the clients during the migration. Stopping the clients is not an option for us because it means shutting down the whole site.

We built our own live migration tool based on the append-only-file (AOF) feature in Redis. The AOF, when initially created, consists of a sequence of commands that can be replayed on a second server to reconstruct the data set in the first server that creates the file. Commands that the first server receives after the AOF is created will be appended to the file, hence the name “append only file”. Redis can be configured to rewrite the file when it gets too big. After the rewrite, the commands at the end of the old file may be re-ordered.

Our automatic live migration process involves the following steps:

Pick a slave in the functional shard that has dumping disabled.
Enable AOF in the picked slave but disable AOF rewrite, so that subsequent commands will not be re-ordered in the file.
Write a certain key, e.g., “redis:migration:timestamp”, to the functional shard to serve as a bookmark for later use.
Copy the AOF from the functional shard slave to all the hosts in the cluster.
Replay the AOF on each master in the cluster, using the “redis-cli –pipe” command.
Extract the new commands in the functional shard AOF that were added after the last bookmark, and store them in a delta file.
Repeat steps 3 - 5 with the new delta file instead of the full AOF.
When the number of new commands in the delta file drops below a certain threshold, we make a live configuration change to the clients so that they will start to access the cluster instead of the functional shard.
We continue to repeat steps 3 - 7 after the configuration change, until the number of new commands in the delta file drops to a lower threshold.

The live migration of each functional shard took from a few minutes to a few hours, depending on the size of the shard. The process went smoothly modular a few errors from the functional shard slave due to overloading from the AOF writes.

Post migration outage

After we migrated about ¾ of the functional shards to the cluster, something unexpected happened. We created the cluster with the same capacity as the functional shards. However, the size of our data grew faster than we expected.

The Redis cluster was overcommitted in memory and started to swap in early May. An issue with a backup script resulted in high volumes of disk reads and writes, and triggered the first failover when a Redis master on the same host tried to access the swap space. The failover then triggered a sequence of cascading events.

Slaves were unresponsive during failover, triggering cross-shard slave migrations, i.e., healthy slaves changed their masters. The failovers and cross-shard migrations resulted in an imbalanced distribution of masters and dumping slaves across hosts. The hosts with more dumping slaves were overloaded when the slaves started to dump, and more failover/cross-shard migrations followed. Redis clients repeatedly retried when servers were unresponsive, and eventually held up all web server processes and caused site outages.

Recovery and resolution

It took three weeks to fully recover from the outage. During the process, we performed many operations such as resizing and resharding the cluster for the first time in production. We learned lots of lessons and made quite a few improvements to the error handling code in our client library.

The first step to recovery was to rebalance the cluster, i.e., to bring it back to the balanced layout as illustrated in Figure 3. Next, we added 50% more hosts to the cluster.

We ran into several issues while resharding the cluster, i.e., migrating data from existing servers to new servers. For a short period of time during the migration of a key, clients kept receiving “MOVED” responses from both the source and the destination servers of the migration, therefore kept retrying between the two servers until they eventually had stack overflow.

To contain the issue to the affected processes only, we applied a limit on the number of retries in this situation so that the affected processes would not generate too much load in the clients or servers. We also made the passive healthy checks in the client library more robust. The entire resharding process took 12 hours, during which a small percentage of requests failed while the site was functioning overall.

After the resharding, the old Redis servers reported a drop in their data size, but their memory usage did not drop, nor did their swap usage. Since they still had data stored in the swap space, they could have spikes in latencies and client timeouts. We learned that the unchanged memory usage was a result of fragmentation in jemalloc, the memory allocator used in Redis, and that the only way to defragmentation is to restart the servers.

The last step in our recovery process was to rolling-restart all the old servers. For each shard, we first restarted a slave, then force a failover to the restarted slave, causing the old master and the other slaves to resynchronize data from the new master. Resynchronization has the same effect as restart on memory usage. After the resynchronization was completed, all servers in the shard had their memory defragmented and their swap space freed.

The rolling restart was a rigorous stress test on the clients and servers, and prompted us to make more improvements in the error handling of the client library. One improvement is to have different timeout values for masters vs. slaves, and to never mark down a master in the passive health checks. Masters are more heavily loaded than slaves during failover and are a single point of failure for write operations. Therefore, we would rather try a bit more when a master is slow, than give up too soon.

Open Questions

Clustering is a relatively new technology in Redis. Through the experience of building, migrating, and resharding Redis Cluster, we learned about its limitations as well as its potential.

While the ability to automate scaling in Redis Cluster opens up many new opportunities, it also brings up new questions. Will we be able to scale infinitely? What’s the best way of supporting a mixture of permanent store use case and cache use case? How can we minimize the impact of resharding and rolling restart on production traffic? We look forward to experimenting and learning more about Redis Cluster.

If you’re interested in joining us, we’re hiring! Check out opportunities on our team at houzz.com/jobs.

Houzzer Profile: Ella Zhang, Quantitative Analyst

Thursday, July 6, 2017

As a member of the data analytics team, Ella uses data to inform and direct business decisions. When she’s away from work, Ella enjoys exploring local parks with her family.

Why did you choose to become a data scientist?
I actually studied environmental engineering, but when I graduated, data science was just beginning to gain momentum as a career path. Prior to that point, there weren’t many statistician positions available. Through my schooling, I’d had the opportunity to play with data and use different models to answer business-related questions and I realized that it was the perfect blend of my education and interests.

What benefits do data scientists bring to businesses?
All business decisions should be based on a clear understanding of how customers are using their products and services. Data provides the confidence to chart new paths and create greater efficiencies of resources, while evaluating the impact of those decisions. For a company like Houzz, which provides a bounty of products and services to our community, it’s important to understand how all decisions impact the overall dynamic of the experience from the minor ripples to the major waves.

What brought you to Houzz?
I was a big fan of Houzz long before I began working here and used to spend 20-30 minutes per night looking at photos and getting inspiration for my own home. In fact, one of my bathrooms looks very similar to a photo I found on Houzz!

Professionally, I craved the startup experience, which would allow me to broaden my focus and solve new challenges every day.

What do you most enjoy about your role?
It’s exciting! There’s an opportunity to blaze a trail, tap into my own creativity and provide useful information based on non-defined parameters. The many aspects of the Houzz platform mean that I am continuously learning and expanding my skills.

What project are you most proud of at Houzz?
I helped to develop the “Lifetime Value Model” for our marketing team, which is utilized for all campaigns to evaluate performance. I also analyze SEO-optimization and content partnerships to understand and educate the team on how and why people visit Houzz.com. It’s fascinating to learn what piques visitors’ interest about Houzz.

What’s something that has surprised you about working at Houzz?
I work with cross-functional teams across marketing, SEO and data operations, and have found that across the board, people are very data-driven at Houzz. It’s so refreshing. Everyone uses analysis to drive marketing efforts and business growth decisions to achieve greater, more collaborative results.