Best practices for minimizing database load with Site Streams?

GeneLamarEllis
@GeneLamarEllis Gene Ellis

Hello,

We have successfully implemented Twitter Site Streams, however our database load is so high. We have optimized queries, as well as a ton of other MySQL configurations but it still doesn't seem to have dented the database load. The next option is to upgrade our CPUs, but before we do that, we would like to make sure there are not some other options available.

Have any of you faced a similar issue? If so, what did you do and did it work?

Thanks!

1 year 41 weeks ago

Replies

tapbot_paul
@tapbot_paul Paul Haddad

I don't think you've put enough information in on what you are doing with the data, where it's slow and such to be able to get many concrete answers.

That being said have you looked at alternatives to MySQL? Again depending on exactly what you are doing something like Redis (http://redis.io/) might be a much better fit for a site streams consumer.

1 year 39 weeks ago
GeneLamarEllis
@GeneLamarEllis Gene Ellis

Thanks for the response Paul. Basically our website is a group tweeting website. An (non-personal) twitter handle is set up. And when an authorized twitter account sends a tweet (or DM) to it, it simply resends the tweet out to its own followers. Since we are on Site Streams, we put a bunch of accounts on each connection. When a tweet (to be resent back out) arrives, we store it in the database, then fire off another PHP process (passing it an ID) to process it and then send it back out. We are using prepared statements for the INSERTS. However, our server load is HUGE and MySql is commonly complaining about not having enough connections, even though I upped the value to 2000 (which is way more than a site with our traffic level would even need). Just can't figure out why the load is so high and they connections are being eaten up.

1 year 39 weeks ago
episod
@episod Taylor Singletary

Are you queuing up your writes at all before sending them to MySQL?

A rough somewhat common pattern in this scenario could be to:

  • Receive a tweet or event from the Streaming API
  • Add that tweet/event to a queue for timely, measured storage in the database while also adding a cache (likely some form of memcache) version of the data
  • Your service pulls from the cache instead of directly from the database. If the data is unavailable in cache, it could re-populate the cache from the database (after the queuing service has actually written to the database).
1 year 39 weeks ago
GeneLamarEllis
@GeneLamarEllis Gene Ellis

No, we are writing the data to the database as it comes in. Such as a Tweet comes in, then we store it. So they are going into the database one by one. The owner of the site wants these tweets to come in and get processed to go back out as quickly as possible.

What do you use for a queuing service?

1 year 39 weeks ago
episod
@episod Taylor Singletary

Here at Twitter, we've often used the open-source Kestrel: https://github.com/robey/kestrel

1 year 39 weeks ago
GeneLamarEllis
@GeneLamarEllis Gene Ellis

Gotcha. Thanks a lot for the info Taylor.

1 year 39 weeks ago
IDisposable
@IDisposable Marc Brooks

We're easily handling 140000+ tweet insertions daily on our single EC2 instance. We're using .Net via C# to a SQL Server machine. We DO insert each tweet singly for timeliness reasons, but that transaction rate is nowhere near a big enough load on the SQL Server box. If you want a simplified schema of what we're doing, I can craft it up... but FYI it IS easily possible to handle decent tweet rates. You can see the outputs of what we're doing on http://stltweets.com and http://stlindex.com (both running off the same database server and each about 0.6TB of data these days)

1 year 39 weeks ago
maxf3r
@maxf3r massimo ferrari

Here my personal experience on a real time processing of Twitter Stream.

http://twittaculous.com

I prefer to handle all with a few Redis instance (and a lot of batch code). I use PHP-CLI .. but I'm sure quite every scripting language will do the same

Here below a one-day figures of what we read from twitter stream (1% firehose stream + 1 stream for 400 tracks)

all_tweets_reads : 5.376.835
real_tweets_reads : 5.138.543

tweet_parsed_with_entity : 5.138.543
tweet_parsed_with_place : 3.163.113
tweet_parsed_with_coord : 1.711.822
tweet_parsed_with_coord_and_place : 1.554.161

For our service logic we do post elaboration (quite in near real time)

Here one day internal figures to gain what you can see online.

parsed_tweet : 25.692.531
new_instant_word : 32.574.175
(this figure are greater than real_tweets_reads cause for post processing I use to divide workload over different batch)

All of our processing use a few of asyncronous queue (redis list) to lower the impact on Twitter stream peak.

By the way, queue are always processed in a reasonable time. In more than 90% of the day, queue tend to zero len, so I can assume to have a near-real-time information to publish online

massimo

1 year 39 weeks ago
GeneLamarEllis
@GeneLamarEllis Gene Ellis

Thank you so much everyone for all the information. This is a big help. I am going to start looking into Memcached and Reddit as a way to reduce our load even further. Thanks so much.

1 year 39 weeks ago
Se7enNgo
@Se7enNgo Seven Ngo

Hi Gene. Can you help me to implement get data from Site Streams by C# (ex: Console Application). I tried but it've worked with User Stream, Site Stream still have not worked.

1 year 30 weeks ago
episod
@episod Taylor Singletary

Seven,

Have you applied for and been approved for the limited Site Streams beta?

1 year 30 weeks ago
yolandasilvarom
@yolandasilvarom yolanda silva romero

thank you everyone for guiding,assist ing with all the support outlets.

1 year 28 weeks ago
Xisbi_jamhuuri
@Xisbi_jamhuuri Xisbi_jamhuuri

Xisbiga midnimadda jamhuuriga soomaaliyeed waa xisbiga shacabka somalia.

1 year 17 weeks ago