Rss aggregator background notes
Content aggregation by RSS in and out - background notes**#
Project manager is Richard Hering.
Technical advice from Hamish Campbell, and, if needed, his brother Tom.
Email for both: info (AT) vision (DOT) tv
skype: richarddirecttv, hamishcampbell
mob: Richard 07894350478, Hamish 07931165452
visionOntv's mission is to distribute video for social change as widely as possible, and to create social media toolkits so that real communities of action can be built using that content.
"Conversation is king. Content is just something to talk about." (Cory Doctorow) To that we add "and then take action about". That way leads to social change.
But first of all, we need to enhance the profile of video for social change and to distribute it to as many places as possible. To this end, we are a "golden ladle in the data-soup" of video, collating the best social change films from around the world by pulling in both rss feeds of streams of quality video content and individual films. To enhance this content, we also create our own shows to put news in context and distribute these by media rss as well. We apply metadata (tags) to all of these films so that they can be sent back out via rss, and thus found and enjoyed easily by anyone who wants to. This facility should be available to any trusted member of the project, enabling anyone to build their own flow of video on their own site with customised strands of content, based on tags.
In this way visionOntv aims to build flows of lasting and automatically-updating video content, not the "here today, and gone tomorrow".
The scripting challenge**#
Our liferay cms is already able to create an RSS feed from a tag query rather than from a whole site.
The aggregator takes the articles from an rss feed injects them into the liferay database as web content, and deals with any error issues this may bring up (eg mismatching fields etc).
(This will lead to the capability of reading sites by tag queries, rather than whole sites.)
The tag-based rss feeds will be used as source feeds for open source media players on our site and third-party sites.
Thus by boolean logic it should be able to automate feeds in and out of each site's articles to query a site for an RSS feed based on a tag rather than the whole site. This feature is key for empowerment of the content producer or individual website owner in the network. Rather than the techie at the centre of the network, it enables the content producer or website editor at any point in the network to influence which content ends up where.
We also need to build a stand-alone application which polls the rss feeds and de-dupes them before injecting them into the liferay cms. This needs to be error-robust and to deal with different rss formats.
And remember: "The enemy of progress is complexity". (Dave Winer)
Tech notes - suggestions for things to be added later
These features should not be implemented in the first version, but in subsequent versions they will be necessary to deal with scaling issues.
Real-time scaling with Pub Sub Hubbub - which does realtime notification rather than us having to poll the whole feed repeatedly after a set time period.
Fat pings - changes in the feed are sent.
RSS feed timed caching (queries are only updated once every set time period).
Tech advice on rss into liferay**#
Test Media RSS02 feed http://visionontv.blip.tv/rss
Each video has a title, a description and a video file link enclosure. So if you look at this feed in XML editor we should be able to map each field to a field in a new liferay article and add the tags to that article.
Q: How do you parse the feed and create articles?
A: We can add tags directly in Liferay articles - there is an API for creating articles and adding tags . The asset publisher portlet then allows you to publish article based on the tags and generate RSS feeds based on the tags .
Q: For any article, must it be stored somewhere first, on the database..?
A: Yes, as a web content article in the liferay CMS
1. Take RSS feed as input
2: Add tags by Article using api
3: Save to database
4: Publish it by portlet
You can then use Velocity templates to publish and format the articles using the asset publisher.
You create templates that match the media RSS2 format and structures using the structure editor.
Q: How about the syncing and time management?
A: You would need an application that polled all the RSS and kept a list of GUIDs that had been imported. If it found a new GUID it would create a new liferay article with the RSS data. You could configure it by creating a list of default articles with the default tags and the URLs of the feeds. The app would request a list of these articles and then poll each RSS.
The whole system would then be driven by the tagging.
You would need to add the IP of the server running the RSS app, so it can call the liferay APIs
To call the HTTP API to add article you need to add the server the API calls are coming from. If it runs on the same server it would be its IP address .
The best way to build this is to get an end-to-end prototype working first (= REQ-1)
You need to store the RSS items so the next time you read them you know which ones have already been added as liferay web articles.
The next step is to to integrate this code output into the liferay article and store them in the database.
You will need to find the Liferay api docs for remote API calls, using help in the complete book for Liferay 6.0, as well as the api docs.
The Liferay 6.0 manual is here: http://visionon.tv/resources/-/document_library/view/29352
You need to join and be logged in to http://visionon.tv to access it.
It uses tunnel web and you need to enable the IP that is calling the api.
1) use the APi to read an article that just has a list of RSS feeds and default articles
2) read the feed xml into memory using the 1st rss url
3) convert to media rss
4) loop through the articles and for each article check to see if the guid is in your db
5) if it is not found save it in your db
If not found, load the default article for the feed. merge in the new data and write it back as a new article using liferay API Save the liferay article ID in the db
6) if the article was found, load in, then load the liferay version with the stored ID. Update and write the article back to liferay. Done!
In order to communicate to the remote server and moreover, to protect HTTP connection, we need to set up tunnel web in the portal-ext.properties. That is, we need to add the following lines at the end of portal-ext.properties.
The above code shows a property tunnel.servlet.hosts.allowed with a list of allowed hosts, e.g., 18.104.22.168, 22.214.171.124, 126.96.36.199.
As stated above, we used these hosts as examples only. You can have your own real hosts. Meanwhile, it specifies property tunnel.servlet.https.required. By default, it is set as a value false. You can set it as a value true if you want to use HTTPS. This just opens the API to so you can use them remotely. Tom is searching for a doc about the APIs you need.
Install liferay locally, unlock the API, then try this URL it should list all the APIs.
Q: As long as the original tags from the video are added - can the user use other parts of liferay to update and change them?
A: Yes, if the video is coming from a server which has rss media 2.0 then the tag information in that will automatically store in our system too
Q: Will it be stored in the tag category?
A: Yes, it will be stored in the tag category. The key thing is to get the rss items as web content items in liferay - the rest is for later, and we should not think about it now.
Q: So what are the problems with getting the rss into liferay as web content automatically?
A: Well, the code must break the items in the rss and process it and forward it to the articles. This needs some tuning.
The next step is that code automatically syncs with other servers so it can retrieve the rss live from other servers as they are uploaded or updated.
For now it would be very useful to just get the RSS in working then we could run it once to import all the films for blip into our database. (REQ-1)
After running once - we would then have to make sure the de-duping was working before running it again.
At first we need to use the rss aggregator to import all the rss from blip and convert them to articles.
Later we should update the code such that it could do all of this on its own.
The syncing with other servers is important and the non-duping, but the essential thing is to get the tuning done to get the items to the articles.
Would be good if the whole system worked real-time by supporting http://code.google.com/p/pubsubhubbub/ and http://en.wikipedia.org/wiki/RSS_Cloud
This is important so that we can scale up.
We have set out:
How the rss items go into liferay articles.
How we need to use the asset publisher to get them out.
How to map the rss fields into liferay article fields.
How to add default tags per feed.
How to add the rss items into liferay as liferay articles, and the necessity of doing this.
How to configure which feeds are imported and how often they are read.
What needs to happen for scalability.