1. Introduction
In this assignment, we investigate the data access characteristics of primary and secondary storage devices. We explore the speed of data transfer between main memory and disks, and performance differences for sequential and random I/Os.This part consists of two main tasks. For the first task, you are going to implement all the required modules using C or C++ programming languages. After implementing each module, you may test its correctness against a small dataset, and add a new compilation unit to the Makefile. After making sure that each program works properly, you perform experiments with a larger input file (Experiment dataset), record your observations and discuss the results.
Both datasets represent the social network between people. Each line consists of a pair of IDs, delimited by a comma:
2. Coding tasks
2.1. Reading and parsing input
Write a routine that parses each line and returns an array of strings. Implement a function that reads a file line-by-line, parses each line, and converts each string array into aRecord
format. Record
can be implemented as a struct containing an array, a Vector (for C++) or just two integers for this simple case:typedef struct record { int uid1; int uid2; } Record;
Test that the conversion works properly by printing each line and each Record of a small data file.
2.2. Block vs. line writing
In this part we convert the entire input text file into a table of records with the following simple schema: Following(UID1:int, UID2:int). For simplicity, we assume that the attribute names are stored as part of the schema, and not included as part of each page. To simplify matters even further, we are going to pack the maximum number of records into a single page, without keeping a page header information.Use functions developed in 2.1 to write a program that reads the input file line-by-line, converts each line to a record, fills-in the output page buffer (the size of the buffer corresponds to the block size, provided as a program parameter), and - when the buffer is full - appends the full page of serialized records to a binary file.
Call this program
write_blocks_seq
and it should accept two command-line arguments:
the input file name and the block size.$ write_blocks_seq <input filename> <block size>
block_size
for initial testing of this part is
1 MB (const MB = 1024 * 1024
). Note that the block size has to be multiples of sizeof(Record)
. write_blocks_seq
with different block size parameters, and outputs the processing rate in Bytes Per Second (BPS) to stdout (can be redirected to a file). Suggested block sizes for experiments:
KB = 1024 MB = 1024 * 1024 sizes = ( 512, 1 * KB, 4 * KB, 8 * KB, 16 * KB, 32 * KB, 1 * MB, 2 * MB, 4 * MB )
write_lines
.2.3. Sequential reads
In this part you implement a query that reads the binary record file in blocks, and computes how many people each user followers at most and on average. The original records are sorted by uid1, which allows to implement this program with a single sequential pass over the binary file. The program outputs two values: max and average. The timing code should surround the entire calculation including reading each block from disk. Call this programread_blocks_seq
. The program parameters are the binary input file name and the block size in bytes (you have to test that the block size provided as a parameter is a multiple of sizeof(Record)
).Now, modify your code to read the entire binary file into a large in-memory buffer, and perform the same query. You will need to determine the file size in order to allocate the buffer of the appropriate size in your code. Time the program in bytes/second after all the records have been loaded into RAM. Call this program
read_ram_seq
. It has only one command-line parameter: the binary input file name.2.4. Random reads
Implement the max and average query, that calculates max and average by sampling data in X random positions of the data file (X is provided as an additional command-line parameter). The program performs a loop with X iterations. In each iteration it reads a block of records into a single-page buffer - starting from a random position in a file - and examines all the records in this block for max and average. Call this programread_blocks_rand.
read_ram_rand
. 2.5. Random writes
Implement two more programs (write_blocks_rand
, write_ram_rand
), that move into
X random locations in file records.dat (X is provided as the second command-line argument),
and override one record with a new record, let's say with an identical record (1,2).
Implement the same random updates after the entire dataset has been loaded into RAM. 3. Experiments
This means user with id "1" followers user with id "2".
3.1. Experiment 1: optimal block size
Before running the experiments to determine optimal block size, ask your operating system for a system disk block size usingdf
command. Mark it for your report.Run
write_blocks_seq
program with different block sizes, using script you wrote in 2.2.,
and record the respective write
data rate in bytes/second. Plot BPS as a function of block size. Run program
write_lines
and compare writing rate for text lines and blocks. 3.2. Experiment 2: sequential vs. random read rate
Runread_blocks_seq
and read_ram_seq
to determine how many people each user followers
at max and on average,
and plot the performance results (logarithm of BPS) on a single graph. Here if you want to explore all your storage options, run the same experiment using
read_blocks_seq
for files located
on optical disk, SSD, or rotational hard drive. Plot results on the same graph. Discuss differences in reading rate. To be able to perform a random-access reading experiment, we have to use an input file which is larger than the available main memory. Determine the amount of available memory
free -m
. If the memory is larger than 4 GB (for a lab machine), please use another machine. Make a large input file
(at least 1.5 times larger than the available main memory) by concatenating edges.dat several times.
$ cat records.dat records.dat records.dat > big.dat
Note: in order to compile your program to work with big files you might need to add two compilation flags to your Makefile:
CC = g++ FLAGS = -D_LARGEFILE_SOURCE FLAGS += -D_FILE_OFFSET_BITS=64 $(CC) $(CFLAGOPT) ... ...
Write the second chapter summarizing reading experiments. Include plots and the discussion.
3.3. Experiment 3: sequential vs. random write rate
You already know the rate of sequential writing to disk - from experiment 1. Now runwrite_blocks_rand
and
write_ram_rand
and measure write rate for sequential and random access for at least two memory types - primary and secondary memory.
Plot all the results of reading and writing on the same plot.Finally, write summary for your report, discuss what have you learned about access patterns for different memory types. Did these experiment persuade you that we need to design different algorithms for primary and for secondary storage?
4. Deliverables
Submit a tar ball named ass1.1.tar.gz that includes all C/C++ files, Makefile, scripts, and separately report1.pdf with plots and discussions. Make sure that you can tar/untar your file on the machine where you are planning to present your work.write_blocks_seq
, write_lines
, read_blocks_seq
, read_ram_seq
, read_blocks_rand
, read_ram_rand,
write_blocks_rand
, write_ram_rand
(8 source files in total). Submit Makefile
which generates all the above executables. The compilation should work without warnings. 5. Marking scheme
Full 5 (3-code, 2-report) points can be earned for this part of Assignment 16. C syntax reminders
Reading text file/* returns: -1 if there is an error. */ #include <stdio.h> ... char current_line[MAX_CHARS_PER_LINE]; FILE *fp_read; ... /* open text file for reading */ if (!(fp_read= fopen ( file_name , "r" ))) { printf ("Could not open file \"%s\" for reading \n", file_name); return (-1); } ... /* reading lines */ while( fgets (current_line, MAX_CHARS_PER_LINE, fp_read)!=NULL ) { current_line [strcspn (current_line, "\r\n")] = '\0'; //remove end-of-line characters ... } ... fclose (fp_read);Writing binary pages
#include<stdio.h> ... /* Our structure */ typedef struct record { int uid1,uid2; } Record; ... int block_size = ...; int records_per_block = ...; FILE *fp_write; ... /* allocate buffer for 1 block */ Record * buffer = (Record *) calloc (records_per_block, sizeof (Record)) ; ... if (!(fp_write = fopen ( file_name , "wb" ))) ... ... /* flush buffer when full */ fwrite ( buffer, sizeof(Record), total_records, fp_write); ... /* force data to disk */ fflush (fp_write); ... fclose(fp_write); ... free (buffer);Reading binary pages
#include<stdio.h> ... int block_size = ...; int records_per_block = ...; FILE *fp_read; ... /* allocate buffer for 1 block */ Record * buffer = (record *) calloc (records_per_block, sizeof (record)) ; ... if (!(fp_read = fopen ( file_name , "rb" ))) ... ... /* read records into buffer */ int result = fread (buffer, sizeof(Record), records_per_block, fp_read); if (result!=records_per_block) ... ... fclose (fp_read); free (buffer);Measuring processing rate
include <sys/timeb.h> ... struct timeb t_begin, t_end; long time_spent_ms; long total_records = 0; ftime(&t_begin); /* code to be timed */ total_records ++; ftime(&t_end); /* time elapsed in milliseconds */ time_spent_ms = (long) (1000 *(t_end.time - t_begin.time) + (t_end.millitm - t_begin.millitm)); /* result in MB per second */ printf ("Data rate: %.3f MBPS\n", ((total_records*sizeof(Record))/(float)time_spent_ms * 1000)/MB); ...Makefile
CC = gcc CFLAGS = -O3 -Wall CFLAGS += -D_LARGEFILE_SOURCE CFLAGS += -fno-exceptions CFLAGS += -finline-functions CFLAGS += -funroll-loops CFLAGS += -D_FILE_OFFSET_BITS=64 # Source files WRITE_BLOCKS_SRC=utils.c write_blocks_seq.c # Binaries all: write_blocks_seq write_lines ... #sequential writing in blocks write_blocks_seq: $(WRITE_BLOCKS_SRC) $(CC) $(CFLAGS) $^ -o $@ ... clean: rm write_blocks_seq write_lines ... ...
For those who choose to do A1 on a lab machine - here is some info:
ReplyDeleteEach of you got an "unlimited" storage for the rest of the term in your own directory /s/csc443/(username). Here you can put and access our input file, and store all intermediate files created during processing. Do not store your code in this directory - it does not have a backup. The content of this directory will be removed after the end of the semester.
In addition, in order to measure the performance of a local disk, each of you has an access to a local directory /data/scratch/(username). The content of this directory is cleared after you end your lab session. This directory can only be accessed if you are physically present in the lab - not through ssh. Run your performance experiments in this directory. The disks are SSD disks.
The variable total_records in the "Writing binary pages" code example refers to the total number of records currently in buffer, not to the total number of records in the entire file
ReplyDeleteThe order of user ids in the small dataset was not correct - they were not sorted by uid1. I have updated the file on github, sorry for the confusion
ReplyDelete