Hello,
We have successfully implemented Twitter Site Streams, however our database load is so high. We have optimized queries, as well as a ton of other MySQL configurations but it still doesn't seem to have dented the database load. The next option is to upgrade our CPUs, but before we do that, we would like to make sure there are not some other options available.
Have any of you faced a similar issue? If so, what did you do and did it work?
Thanks!
Replies
I don't think you've put enough information in on what you are doing with the data, where it's slow and such to be able to get many concrete answers.
That being said have you looked at alternatives to MySQL? Again depending on exactly what you are doing something like Redis (http://redis.io/) might be a much better fit for a site streams consumer.
Thanks for the response Paul. Basically our website is a group tweeting website. An (non-personal) twitter handle is set up. And when an authorized twitter account sends a tweet (or DM) to it, it simply resends the tweet out to its own followers. Since we are on Site Streams, we put a bunch of accounts on each connection. When a tweet (to be resent back out) arrives, we store it in the database, then fire off another PHP process (passing it an ID) to process it and then send it back out. We are using prepared statements for the INSERTS. However, our server load is HUGE and MySql is commonly complaining about not having enough connections, even though I upped the value to 2000 (which is way more than a site with our traffic level would even need). Just can't figure out why the load is so high and they connections are being eaten up.
Are you queuing up your writes at all before sending them to MySQL?
A rough somewhat common pattern in this scenario could be to:
No, we are writing the data to the database as it comes in. Such as a Tweet comes in, then we store it. So they are going into the database one by one. The owner of the site wants these tweets to come in and get processed to go back out as quickly as possible.
What do you use for a queuing service?
Here at Twitter, we've often used the open-source Kestrel: https://github.com/robey/kestrel
Gotcha. Thanks a lot for the info Taylor.
We're easily handling 140000+ tweet insertions daily on our single EC2 instance. We're using .Net via C# to a SQL Server machine. We DO insert each tweet singly for timeliness reasons, but that transaction rate is nowhere near a big enough load on the SQL Server box. If you want a simplified schema of what we're doing, I can craft it up... but FYI it IS easily possible to handle decent tweet rates. You can see the outputs of what we're doing on http://stltweets.com and http://stlindex.com (both running off the same database server and each about 0.6TB of data these days)
Here my personal experience on a real time processing of Twitter Stream.
http://twittaculous.com
I prefer to handle all with a few Redis instance (and a lot of batch code). I use PHP-CLI .. but I'm sure quite every scripting language will do the same
Here below a one-day figures of what we read from twitter stream (1% firehose stream + 1 stream for 400 tracks)
all_tweets_reads : 5.376.835
real_tweets_reads : 5.138.543
tweet_parsed_with_entity : 5.138.543
tweet_parsed_with_place : 3.163.113
tweet_parsed_with_coord : 1.711.822
tweet_parsed_with_coord_and_place : 1.554.161
For our service logic we do post elaboration (quite in near real time)
Here one day internal figures to gain what you can see online.
parsed_tweet : 25.692.531
new_instant_word : 32.574.175
(this figure are greater than real_tweets_reads cause for post processing I use to divide workload over different batch)
All of our processing use a few of asyncronous queue (redis list) to lower the impact on Twitter stream peak.
By the way, queue are always processed in a reasonable time. In more than 90% of the day, queue tend to zero len, so I can assume to have a near-real-time information to publish online
massimo
Thank you so much everyone for all the information. This is a big help. I am going to start looking into Memcached and Reddit as a way to reduce our load even further. Thanks so much.
Hi Gene. Can you help me to implement get data from Site Streams by C# (ex: Console Application). I tried but it've worked with User Stream, Site Stream still have not worked.
Seven,
Have you applied for and been approved for the limited Site Streams beta?
thank you everyone for guiding,assist ing with all the support outlets.
Xisbiga midnimadda jamhuuriga soomaaliyeed waa xisbiga shacabka somalia.