This is the long-winded tale of the adventures of one Hasnain Lakhani as he tried to extract all the links he'd ever shared on his facebook wall and tried to move them to a medium more suitable to browsing and searching.
The first thing to do when creating pages full of links is to, naturally, get those links from somewhere. If someone wants to get all their personal data from Facebook, one would naturally try to download your information from facebook.
The data dump looked to be quite promising, as it seemed to contain all the needed posts, despite what this url said. Those included the ones in which links were shared, however the actual URLs were not present. That turned out to be a dead-end.
The next attempt was to use Facebook's Graph API which is intended for developers. This API comes with a very handy Graph API Explorer tool, which lets one run queries for data right there in the browser. With that tool, getting the data out was surprisingly easy, as it was just a few repeated API calls to:
With all that data out and safely saved to a file, the database was almost ready. All that was needed was a small python script to only save posts that had the "link" attribute, and then pretty-print it out to a links.txt file.
Getting the data out was the easy part, the hard part was taking all that data and getting it into an easily browsable form. For the uninitiated, mhlakhani-com is a set of static files that is run through Halwa, a static site generator.
Halwa works by taking data from various sources, running it through various processors, and then producing content based on templates. The data was all ready, the next step was to write a processor and some content.
Writing the ReadersCorner processor was fairly simple. It first loads a data file and keeps entries that pass a certain filter. Then the entries are sorted into different dictionaries based on their timestamps, and these dictionaries are kept around for later use.
The ReadersCornerPage content item was also fairly simple to write. It just loops over all the month dictionaries created by the processor and renders the user-defined template for that month.
Now that the content and processor were ready, it was time to actually run them. That was not difficult at all, since it was just a matter of adding some routes and telling Halwa to run the processor and generate content.. Along with adding a template here and there.
That was all fair and dandy, all the links were up and ready to go online; aside from one teeny problem: a privacy leak. The problem, dear reader, was that the facebook posts sometimes contained tags naming friends. Those had to be scrubbed to remove personally identifying information. And, of course, Reader's corner was no place for silly youtube links. The filter function came to the rescue, removing all URLs that contained a forbidden url/domain component, and replacing any occurrences of names with placeholder names. It's not perfect, since some names within quotes get replaced, but it gets the job done.
The data was easily browsable in HTML format, however it was still quite tough to search the data to find that one link that was shared a while ago that talked about that thing ...
An Inverted Index is one of the standard ways to allow quick full-text searches across a database. At a high level, an inverted index stores each word, and, for each word, a list of documents that contain the given word. Searching is fairly simple, one simply needs to look up the word in the index and then display the appropriate documents (if any).
Generating an inverted index out of the links.txt file and transmitting it to the client seemed to be the way to go. The index wasn't too large either, about 210KB before compression for 784 links.
On a side-note, the Aho-Corasick string matching algorithm is a data structure which allows fast substring searches. However, the generated index takes about three times the space as compared to an inverted index, so it was deemed infeasible.
The first thing to do was update the ReadersCorner Processor so that it would generate the inverted index. The index generation was fairly straightforward. All the text in each entry was concatenated, then tokenized, stopwords (such as "a", "an") were removed, and then each word was associated with the document ID of the current.
A new content type, the ReadersCornerJSONItem was also added, which created one JSON file per entry in the database. These files were later used by the UI to display the search results (since the index only contained IDs).
Once an inverted index is ready, supporting a search query containing multiple terms is fairly straightforward. To get the results of an AND query, one simply just does a set intersection over the results of the document sets for each keyword. For OR, one simply does a set union for those sets. However, the user shouldn't have to type AND or OR between every keyword, so the search UI does two queries; one AND-ing all the keywords, and the other OR-ing all the keywords, and displays the AND results first. This was fairly straightforward to implement, by adding a complexSearch function.
Another problem with the search was that searches weren't being done when the enter key was hit, only when a term was selected from the UI. This was easy to fix by handling the keypress event.
The solution was all ready to be deployed, except for one small problem; the lack of an automatic update. Given the average frequency of posts, poor Hasnain would have to manually update the database multiple times a day, which was clearly not an acceptable solution. Somehow, somewhere, somewhen, ReadersCorner would have to be automatically updated.
The first order of data was to automatically get data from Facebook, programmatically, using the API, and then updating the links.txt file with new data. This was fairly straightforward:
def load(): with open('links.txt') as input: links = json.load(input) return links def get_new_links(): graph = facebook.GraphAPI(ACCESS_TOKEN) query = graph.get_connections("me", "posts", limit=500, fields=','.join(['caption', 'description', 'message', 'id', 'link', 'name', 'type', 'created_time'])) new_links = query.get('data', ) new_links = [l for l in new_links if 'link' in l] return new_links def main(): links = load() new_links = get_new_links() ids = set(l['id'] for l in links) count = 0 for link in new_links: if link['id'] not in ids: links.append(link) count = count + 1 links = sorted(links, key = lambda l: l['created_time'], reverse = True) if count > 0: with open('links.txt.new', 'w') as output: output.write(json.dumps(links, sort_keys=True, indent=4)) print('%s new links!' % count) else: print('No new data!')
What is that magical access token, you may ask?
The wizards of Facebook-land do not give up their data to any warm-blooded adventurer. To access their locked box of data, one requires a mythical key, the fabled "access token". Access tokens obtained through the Graph API explorer are short lived, only lasting a few hours. Clearly, a longer access token was needed.
The solution, unfortunately, is to register for a developer account and create a Facebook app. Once that's done, one can get the two ingredients needed to create the mythical access token: an APP_ID and a SECRET.
The basic access token is received by visiting the following url:
REDIRECT_URI should be a URL you control, the result will be appended as a query string parameter. This token should be copied and fed into the following URL to get a long lived access token:
With that update program in place, all that's needed is to call it repeatedly, and then call halwa to generate the new website. This is easily done with a shell script:
#!/bin/bash FILE=/tmp/update_link_log cd /path/to/folder . env/bin/activate rm -f $FILE python update.py >$FILE 2>&1 if [ $? != 0 ]; then echo "Error updating data!" exit 1; fi if grep --quiet "No new data!" $FILE; then true else now=$(date +"%Y_%m_%d_%H_%M_%S") diff links.txt links.txt.new >>$FILE 2>&1 mv links.txt links.txt.old_$now mv links.txt.new links.txt python -m halwa config.py | grep -v "json" >>$FILE 2>&1 cat $FILE fi
This script is then run every hour by cron, which emails the output using msmtp to provide a notification whenever there is new data, or whenever an error occurs.
And there, dear reader, is the story of ReadersCorner.
PS: At least I got this post out before the 2 year anniversary of the last post.