How to use Twitter Streaming API (Site Streams?) with PHP?

@BenPaster Ben Paster

Sorry if my lack of experience is showing, I'm pretty good with PHP and have a solid understanding of PHP object orientation, but this streaming API is making me tear my hair out! I really appreciate any help!

Here's what I want to do: I'm creating a social media management app of sorts, and I want to stream all @mentions and search terms my users have selected into a MySQL database. I think I most likely need Site Streams to do this, as the app will have many users accessing many different Twitter accounts.

1) In the writing I have been able to find, it seems PHP is not the ideal language for this. I have used the Phirehose lib and have had some success, but it still leaves me with many questions...

2) How can I use the account's oauth token and secret to connect to their account and retrieve @mentions and whatever search terms they've defined?

3) How can I run this process all the time? I've tested phirehose from the browser and had some success with it printing tweets from the people I'm following (via basic auth), but how do I run it all the time? I'm comfortable with the command line, but not ultra-experienced. I've seen things about job queuing and stuff like that, but no idea what to do there.

Thanks so much for any help in advance!

EDIT: I wanted to add I'm not against not using php, but it's the serve side language I know so i would prefer to use it.

2 years 3 weeks ago

Replies

@Info_Kredit Kredit Info Online

use a library (php class) with oauth support
https://dev.twitter.com/docs/twitter-libraries#php
:)

2 years 3 weeks ago

@kurrik Arne Roomann-Kurrik

1.) There's nothing wrong with using PHP for this, but what most people think of when they hear PHP is a set of pages running in response to web requests, probably on IIS or an Apache server. In the case of the streaming API, you need to keep a long running request open to Twitter, so launching that connection in response to a HTTP request is not really a good idea. Most likely, you would write a script which would be run from the command line (see http://www.php.net/manual/en/features.commandline.usage.php) and would just run for hours or days at a time. In fact, the Phirehose documentation states that it is explicitly not intended for use on a webserver.

2.) That's a pretty complex question. See the documentation for OAuth: https://dev.twitter.com/docs/auth/oauth and consider using a library to sign your requests, like the excellent tmhOAuth: https://github.com/themattharris/tmhOAuth

3) See #1

2 years 3 weeks ago

@BenPaster Ben Paster

Thanks for your responses! I'm familiar with developing on the twitter API, so the oauth flow and makIng requests with the API isn't difficult for me to understand. I'm really struggling with the concept of how to leave the process running all the time and efficiently manage the tweets and get them into a daabase. The streaming API format looks really similar to the format of the REST API, so my biggest problem is understanding how to make it run; keep running after failures; manage all the tweets without crashing my server; etc. Also, I don't understand how to get streams for all of my apps users and input their individual tweet streams the db.

Thanks for the help so far and for bearing with me. I really think a simplified tutorial on this would help a lot of people, I can't be the only one :)

2 years 3 weeks ago

@kurrik Arne Roomann-Kurrik

Keeping a process running can be difficult but luckily there's at least a few tools which help manage this kind of thing. If you're on some sort of *nix system, I'd suggest checking out Monit or upstart:
http://mmonit.com/monit/download/
http://upstart.ubuntu.com/

Managing a large stream of data without running out of memory or running into threading issues or otherwise crashing your system is one of those difficult problems of programming - it's a bit out of scope for this forum, unfortunately. My advice would be to write until you hit a performance or stability problem and then figure out how to solve it.

As for multiplexing a bunch of streams into your database, this is a use case that is not really well suited for user streams. You can get pretty far by using the statuses/filter streaming API and track/follow parameters, though: https://dev.twitter.com/docs/streaming-api/methods - once you hit your parameter count limits, then look into the site streams API.

2 years 3 weeks ago

@BenPaster Ben Paster

I'm on a CentOS system. What is your opinion of beanstalkd? I saw a number of resources suggesting I use that.

2 years 3 weeks ago

@kurrik Arne Roomann-Kurrik

Haven't ever used it, but in general work queues are pretty well suited for this kind of thing.

2 years 3 weeks ago

@BenPaster Ben Paster

Okay, I'm going to work towards getting this going, but I really think a step by step "for dummies" tutorial would really help a lot of people, or a streaming API SDK with a tutorial about how to set it up. If twitter isn't willing to do this, maybe someone else will? I will if/when I figure it out.

2 years 3 weeks ago

@kurrik Arne Roomann-Kurrik

We have streaming clients on such a variety of platforms that I think maintaining a set of SDKs would not be an effective use of effort. I do agree about improving the starter documentation, though - we've been having a lot of discussions about that internally and will be working on projects in this area. So hopefully things will get better in the future!

2 years 3 weeks ago

@BenPaster Ben Paster

Yeah, I guess that might be what I'm really looking for. The streaming API docs are pretty bad. I guess they're trying to say it's similar to the format of the REST API, but there are big differences, and those are what confuse me.

Do you know of a good explanation of data streams and processing them in general?

2 years 3 weeks ago

@kurrik Arne Roomann-Kurrik

I don't agree that the docs are bad, they're pretty good quality but assume the reader has familiarity with the subject. This is a gap we want to fill, like I was saying.

Message queue systems seem to have only gotten popular with enthusiasts in the past few years - a lot of the reading materials I've seen are pretty dense. However, as libraries become more available and accessible, it seems like finding good introductory blogs is possible. Here's a post linked to from the beanstalk FAQ: http://nubyonrails.com/articles/about-this-blog-beanstalk-messaging-queue

2 years 3 weeks ago

@BenPaster Ben Paster

Well to compare it to the REST docs, I didn't find them to be difficult to pick up. I learned the oauth flow originally from the twitter API and the exact parameters needed along with examples of what's returned are included. I just find them to be much easier to understand. But I digress, I imagine I'll be able to pick up the streaming API with a little more thought

2 years 3 weeks ago

@maxf3r massimo ferrari

I've implemented a my own data-collector from Twitter Stream (statuses/sample and statuses/filter) in PHP

Googling round and around, I didn't found the perfect library, so I start to write my own code.

I wrote some PHP-CLI script, running forever and a daemonized father to keep alive all the children scripts. I control the father with a "silly watchdog" trying to restart the father every minute, via cron. So, at least i loose one minute data !

The connection script manage directly the raw socket connection to the Twitter Stream (right now I haven't implemented the OAuth 'cause Basic Auth is still present) doing no more than store tweets in some Redis queue http://redis.io.

I adopt a fast key-value engine and not a RDBMS for performance reason and to honor Twitter request about decoupling Collection and Processing https://dev.twitter.com/docs/streaming-api/concepts#collecting-processing

My weet decoders script (twitter json -> my internal structure) retrieve data from Redis queue and do the "Dirty Wor"k to get information I need to my service, rewrite eventually my information in MySQL (but I'm looking for a MongoDB migration)

Some problems to keep in mind with this approch:
- Catching ALL system notifications (Handling POSIX signals)
- Capping memory (Redis do this in a beautiful manner)
- Tune PHP-CLI values
- Write a useful application LOG and some tools to monitor your scripts :)

I know, I write a lot of code without reuse some third party implementations, but I control completely whats going on in my application

2 years 3 weeks ago

@BenPaster Ben Paster

thank you for the fantastic response. I don't have time to completely digest it now, as I'm traveling, however I will try it out when I'm home and post any questions I have. Reading through the "collecting processing" doc you linked to, I saw twitter mentioned the possibility of inserting raw responses from the API into a database, then processing those as a secondary process (run by a cron job?). Would be kind of like making my own job manager. What are your thoughts on that?

2 years 3 weeks ago

@maxf3r massimo ferrari

Yes, I don't underline the difference between "using Twitter Stream vs Site Stream". You speak about "my users" so I suppose you are intended to use Site Stream. My service try to analyze all user, so I need Twitter Stream. Maybe (but I don't really know deeply Site Stream) the amount of data you'll retrieve from Site Stream could be lower than my case :). So, should be enough to store data in a RDBMS and not in a fast storing engine like Redis.

2 years 3 weeks ago