real time issues when working with Amazon Redshift?
I’ve seen that competitors of Redshift (like Oracle) have picked up on the survey chart below, to use it in their marketing materials.
They use it as a “kill sheet” when competing in deals against Amazon Redshift. Even Larry Ellison even used it in one of his keynotes. I guess I should be flattered. So much for unintended consequences.
But don’t believe them. Oracle is way more expensive and way less powerful than Redshift. We’ve helped many Oracle users migrate to Redshift, and it’s a life-changing event for them.
The main cause for any “problems” is a misconfiguration of your Amazon Redshift cluster. You need to set it up for your workloads. Just like with any other database.
In November 2017, we hosted an Amazon Redshift training in San Francisco. 16 attendees, top engineers from very well-known tech companies. We asked participants to fill out a pre-training survey. One of the questions is:
“Which of these issues with Redshift have you experienced at least once?”
You can see the results in the screenshot below, based on 11 answers. We’ve hosted this training in other cities, and the results are identical.
There are solutions to these problems – and once you’ve tackled them, Redshift is an incredibly powerful tool for your data infrastructure.
Figure 1: Survey results
You can see from the chart that the top three problems when working with Redshift are:
- slow queries
- slow dashboards
- Redshift is a “black box”
A few explanations of these problems, and then suggestions for the reasons and how to solve them.
Slow queries / slow dashboards
Sometimes we’re talking orders of magnitude of “slow”, e.g. what should take seconds takes minutes.
“Slow queries” and “slow dashboards” are somewhat the same thing, let’s say “two sides of the same coin”. Just different audiences. Slow queries are more on the data engineer / data scientist side. They are running transformations / aggregations inside of Redshift, i.e. they process raw data that sits in a source table. And the results don’t necessarily feed a dashboard. Rather, they’re part of some model, algorithm or scoring model, and are stored in a destination table. And that destination table might power some sort of data product.
Analysts are typically the ones who are exposed to slow dashboards. They are looking at their charts in a tool like Chartio, Looker, Mode Analytics or Periscope Data, and all they get are spinning wheels.
The #1 reason for slow queries that we see is improper use of something called “workload management”. Redshift operates in a queuing model. You can assign queries to a specific queue, and then allocate a certain concurrency and memory percentage to that queue.
Problem is that most people use the default configuration for the WLM, which is five concurrent queries (aka “slots”) and 100% of memory, giving each query slot 20% of total memory.
If you’re using an average, you’ll be on average wrong.
There’s a 99% chance that the default configuration will not work for you! Redshift is “greedy”, and a query will try to consume all available resources. So it’s important to protect your important queries (e.g. loads, transforms), from poorly written queries that consume a lot of resources.
Figure 2: Redshift Workload Management Configuration
The key to do is the proper WLM configuration. You can use WLM to define the separation of business concerns (e.g. loads vs. ad-hoc queries) and then optimize their execution by setting the right concurrency and providing the right amount memory per slot.
If you want to understand how to set up your WLM,
Redshift is a “black box”
Redshift is a managed service. So when there’s a problem with Redshift, it’s hard to get visibility into what’s going on. Often, this starts with “Redshift is slow”, and then the next question is “why is Redshift slow?”.
The Redshift consoles provides you with “traditional” database metrics like CPU utilization and network throughput.
Figure 3: Redshift console metrics
These metrics are helpful to understand your resource utilization. And where you might have bottlenecks.
What they don’t tell you is if your data is fresh / recent, or if your transformations are executing the way they should. Which are things that matter if you’re running critical reports. These metrics don’t tell you when a query fails, or a table is locked. Or if you have defined the right distribution and sort keys for your tables, for fast execution of complex joins.
In general, you can avoid “black box” issues if you do proper maintenance of your cluster, e.g. regular vacuuming, table hygiene, user set-up, workload management – then you know where / what to look for when things don’t work.
Or, you use a service like DIH that provides with you detailed query insights for your data pipelines.
And if you’re read all the way to here, I also recommend to check out our top 14 tuning techniques for Amazon Redshift. If you follow the best practices in that post, you will find that Amazon Redshift is one of the most powerful products you can have in your data stack.
I’m a founder at www.divinityindiahosting.com, we provide real user monitoring for your AWS data lake, and your Amazon Redshift clusters. Amazon Redshift is incredibly powerful, and if you configure it the right way, no other warehouse is faster.