A2. Thinking in MapReduce

In this assignment, you will be designing and implementing MapReduce algorithms for a variety of common data processing tasks.

The MapReduce programming model (and a corresponding system) was proposed in a 2004 paper from a team at Google as a simpler abstraction for processing very large datasets in parallel.

The goal of this assignment is to give you experience “thinking in MapReduce”. We will use small datasets that you can inspect directly to determine the correctness of your results and to internalize how MapReduce works.

As always, the first thing to do is to review the lectures to make sure you understand the programming model.

The code, inputs and solutions are provided here: https://github.com/mgbarsky/csc443_winter_2017/tree/master/a2_mapreduce

Python MapReduce Framework

You are provided with a python library called MapReduce.py that implements the MapReduce programming model. The framework faithfully implements the MapReduce programming model, but it executes entirely on a single machine: it does not involve parallel computation.
Here is the word count example discussed in class implemented as a MapReduce program using the framework:

# Part 1
mr = MapReduce.MapReduce()

# Part 2
def mapper(record):
    # key: document identifier
    # value: document contents
    key = record[0]
    value = record[1]
    words = value.split()
    for w in words:
      mr.emit_intermediate(w, 1)

# Part 3
def reducer(key, list_of_values):
    # key: word
    # value: list of occurrence counts
    total = 0
    for v in list_of_values:
      total += v
    mr.emit((key, total))

# Part 4
inputdata = open(sys.argv[1])
mr.execute(inputdata, mapper, reducer)

In Part 1, we create a MapReduce object that is used to pass data between the map function and the reduce function; you won’t need to use this object directly.

In Part 2, the mapper function tokenizes each document and emits a key-value pair. The key is a word formatted as a string and the value is the integer 1 to indicate an occurrence of the word.

In Part 3, the reducer function sums up the list of occurrence counts and emits a count for word. Since the mapper function emits the integer 1 for each word, each element in the list_of_values is the integer 1.
The list of occurrence counts is summed and a (word, total) tuple is emitted where word is a string and total is an integer.

In Part 4, the code loads the json file and executes the MapReduce query which prints the result to stdout.

 

 

Problems

For each problem, you will write a python script, similar to wordcount.py, that solves the problem using the supplied MapReduce framework.

Input data and solutions for each problem are provided in the data and solutions folders respectively.
Your python submission scripts are required to have a mapper function that accepts at least 1 argument and a reducer function that accepts at least 2 arguments. Your submission is also required to have a global variable named mr which points to a MapReduce object. If you solve the problems by simply replacing the mapper and reducer functions in wordcount.py, then this condition will be satisfied automatically.

1. Inverted index

Create an Inverted index. Given a set of documents, an inverted index is a dictionary where each word is associated with a list of the document identifiers in which that word appears.

Mapper Input

The input is a 2 element list: [document_id, text], where document_id is a string representing a document identifier and text is a string representing the text of the document. The document text may have words in upper or lower case and may contain punctuation. You should treat each token as if it was a valid word; that is, you can just use value.split() to tokenize the string.

Reducer Output

The output should be a (word, document ID list) tuple where word is a String and document ID list is a list of Strings.
You can test your solution to this problem using books.json:
python inverted_index.py input/books.json
You can verify your solution against solutions/inverted_index.json.

2. Join

Implement a relational join as a MapReduce query
Consider the following query:
SELECT * 
FROM Orders, LineItem 
WHERE Order.order_id = LineItem.order_id

Your MapReduce query should produce the same result as this SQL query executed against an appropriate database.
You can consider the two input tables, Order and LineItem, as one big concatenated bag of records that will be processed by the mapper function record-by-record.

Mapper Input

Each input record is a list of strings representing a tuple in the database. Each list element corresponds to a different attribute of the table
The first item (index 0) in each record is a string that identifies the table the record originates from. This field has two possible values:
  • "line_item" indicates that the record is a line item.
  • "order" indicates that the record is an order.
The second element (index 1) in each record is the order_id.
LineItem records have 17 attributes including the identifier string.
Order records have 10 elements including the identifier string.

Reducer Output

The output should be a joined record: a single list of length 27 that contains the attributes from the order record followed by the fields from the line item record. Each list element should be a string.
You can test your solution to this problem using records.json:
$ python join.py input/records.json
You can compare your solution with solutions/join.json.

3. Symmetric friends

Consider a simple social network dataset consisting of a set of key-value pairs (personA, personB), representing a friend relationship between two people, where personA considers personB their friend. We consider the relationship "friend" to be symmetric, meaning that if I am your friend, you are my friend. Think of a MapReduce algorithm to generate a list of all mutual friends per person. Implement a MapReduce algorithm to generate a count of all true friends per person.

Mapper Input

Each input record is a 2-element list [personA, personB] where personA is a string representing the name of a person and personB is a string representing the name of one of personA's friends.

Reducer Output

The output should be a pair (person, friend_count) where person is a string and friend_count is an integer indicating the number of true friends associated with this person.
You can test your solution to this problem using friends.json:
$ python friend_count.py input/friends.json

You can verify your solution by comparing your result with the file solutions/friend_count.json.

4. Cleaning DNA sequences

Consider a set of key-value pairs where each key is sequence id and each value is a string of nucleotides, e.g., GCTTCCGAAATGCTCGAA....
Write a MapReduce query to remove the last 10 characters from each string of nucleotides, then remove any duplicates generated.

Mapper Input

Each input record is a 2-element list [sequence id, nucleotides] where sequence id is a string representing a unique identifier for the sequence and nucleotides is a string representing a sequence of nucleotides.

Reducer Output

The output from the reduce function should be the unique trimmed nucleotide strings.
You can test your solution to this problem using dna.json:
$ python unique_trims.py input/dna.json

You can verify your solution by comparing your result with the file solutions/unique_trims.json.

5. Matrix multiplication

Assume you have two matrices A and B in a sparse matrix format, where each record is of the form i, j, value. Design a MapReduce algorithm to compute the matrix multiplication A x B.

Mapper Input

The input to the map function will be a line representing a single entry of the matrix. Each entry will be of the form [matrix, i, j, value] where matrix is a string and i, j, and value are integers.
The first item, matrix, is a string that identifies which matrix the record originates from. This field has two possible values: "a" indicates that the record is from matrix A and "b" indicates that the record is from matrix B

Reducer Output

The output from the reduce function will also be a non-zero entry of the result matrix represented as a tuple. Each tuple will be of the form (i, j, value) where each element is an integer.
You can test your solution to this problem using matrix.json:
$ python multiply.py input/matrix.json
You can verify your solution by comparing your result with the file solutions/multiply.json.



Marking scheme

Submit your python files in a single zip file. 3 points for each correct solution, for a total of 15 points for this assignment.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.