Introduction

In the final part of Assignment 1, we will extract a set of real friends and top 10 celebrities from the Twitter dataset. For this, we implement our own version of self-join. We will compare the performance of our implementation to the implementation of Join operator in a real database.

The relationships between users of Twitter can be either symmetric or asymmetric.

The symmetric relationship means that if user A followers user B, then user B followers user A back. Let's call such pair of users "true friends": if I follow you, you follow me.

On the other hand, a vertex in a directed social graph with a large in-degree, and a small out-degree may represent a "celebrity": everyone knows the celebrity, but the celebrity does not know anyone. We will measure the degree of fame by taking a difference between the in-degree and the out-degree of each graph vertex.

We have the following two research tasks:

Producing the list and the count of user ID pairs which represent "true friends".
Producing the list of top 10 "celebrities", sorted by (in-degree - out-degree)-difference in descending order (from more famous to less famous).

For both tasks we need to implement a self-join of the Twitter table.

Step 1: Expressing queries in SQL

The first thing to do is to express the two queries in terms of relational algebra operators, convert them to SQL, and run them against a real database system.

Start a new chapter in your research report on Twitter dataset. Describe the meaning of each query, write the relational algebra expressions, and the SQL equivalents for each research question.

Next, use the RDBMS of your choice to create a simple table with 2 columns, import data from an original csv file into this table, and execute both queries. Record the results of each query - the total number of true friends and top 10 celebrities - to be used for checking the correctness of your implementation.

In addition, measure the time taken by the database system to run each query. The simplest way to accomplish this is to use SQLite database, which is very easy to install and to use and which is already installed on the lab machines. Here is an example:

sqlite> .open twitter
sqlite> CREATE TABLE followers (uid1 integer, uid2 integer);
sqlite> .separator ,
sqlite> .import edges.csv followers
sqlite> .timer on
sqlite> select count (*) from followers;
85331845
Run Time: real 1.125 user 0.156250 sys 0.968750
sqlite>

Step 2: Implementing queries

Implement each query as a separate program. Both programs have a large code overlap, so structure your code accordingly.

To simulate a real-life situation, you are constrained by 200MB of the available main memory, as in A1.2. You have to use block reading and writing (with the optimal block size), the buffering techniques and the external-memory algorithms - demonstrating full mastery of the material learned in this course. The input for both programs is a binary file of records which you created in A1.1.

You can choose any algorithm presented in class to achieve your goal: Block Nested-Loop Join, Sort-Merge join, or Hash-join. Of course you could also implement Index Join, but for that you first need to build the index (in fact, two indexes), and building an index from a given static data requires exactly the same techniques of sorting or hashing. If you choose to implement Index Join, you need to include the time for building the index into your total running time.

Note that the original file turned out not to be completely sorted by UID1. For any algorithm other than Sort-Merge-Join (SMJ), you do not need to sort this file. However if you are implementing SMJ, remember that you need to sort your records also by the first column.

This is the time to demonstrate your proficiency in EM algorithms, so everything is left up to you: the choice of the algorithm, the program design, and the implementation. No starter code is provided, however you may reuse the code you wrote for 2PMMS.

Step 3: Reports

Run your programs and produce the list of real friends and top 10 celebrities. Record in your Research report the total number of friends vs. total number of distinct users, and the list of top 10 celebrities with the corresponding in- and out- degrees. Conclude your research report with a brief summary of what you have learned about the users of the social media during this assignment. Are there any other interesting questions you could ask about this data?

Time your programs and compare the running time to the running time of the equivalent database queries which you executed in Step 1. Also do not forget to mention which database system you are comparing to. Plot the comparative performance at the end of the performance report you started in A1.1. Is your implementation faster? If yes, why do you think it is faster? If not, why not? Conclude your performance report with a brief summary of all the performance results. Reflect about and list everything you learned about working with large inputs.

Deliverables

Submit a tar ball in .tar.gz named ass1-3.tar.gz that includes all C/C++ files, Makefile, scripts, and reports. Make sure that you can tar/untar your file, and run your programs on the machine where you are planning to present a demo.

The final Research report should contain the results of all your experiments: both from A1.2 and from A1.3. The final Performance report should contain the numeric results of all your performance experiments (A1.1, A1.2 and A1.3) and the reasoning about the differences (or lack of differences) in performance and memory consumption.

Marking scheme

4 points per implementation of each query. Another 2 points for the performance experiments and the reports - for a total of 10 points for this assignment.

DB Assignments

A1. Part 3. Real friends and celebrities. Join.