A Course in Data Intensive Computing

[The Course Poster]

[Materials]   [Assignments]   [Schedule]

[Updated Version of the Course is Available Here]

Welcome to a course in Data Intensive Computing ...

In this course, we will cover a wide variety of advanced topics in big data analytics and data intensive computing, including:
  • Distributed Filesystems, e.g., HDFS
  • NoSQL databases, e.g., HBase
  • Execution engines, e.g., MapReduce, Spark, and Stratosphere
  • Query/scripting languages, e.g., Hive and Shark
  • Streaming processing, e.g., Spark Streaming
  • Graph processing, e.g., Pregel, GraphLab, PowerGraph and GraphX
  • Machine learning, e.g., MLlib
  • Resource management, e.g., YARN and Mesos
We will introduce three main big data frameworks, which are Hadoop, Spark, and Stratosphere, and present the Spark framework in detail.

Course Materials

The course is using the Canvas platform for providing contents. You do not need to log in to see the slides and papers at Canvas, just go to the following link: here

Distributed Filesystems
  • The Google File System [pdf]
  • The Hadoop Distributed File System [pdf]

NoSQL databases
  • Bigtable: A Distributed Storage System for Structured Data [pdf]
  • NoSQL Databases [pdf]

Execution engines
  • MapReduce Simplifed Data Processing on Large Clusters [pdf]
  • Nephele: Efficient Parallel Data Processing in the Cloud [pdf]
  • Nephele-PACTs: A Programming Model and Execution Framework for Web-Scale Analytical Processin [pdf]
  • Spark: Cluster Computing with Working Sets [pdf]
  • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing [pdf]
  • A Survey of Large-Scale Analytical Query Processing in MapReduce [pdf]

Query/scripting languages
  • Hive - A Petabyte Scale Data Warehouse Using Hadoop [pdf]
  • Shark: SQL and Rich Analytics at Scale [pdf]

Streaming processing
  • Aurora: A New Model and Architecture for Data Stream Management [pdf]
  • The Design of the Borealis Stream Processing Engine [pdf]
  • S4: Distributed Stream Computing Platform [pdf]
  • Processing Flows of Information: From Data Stream to Complex Event Processing [pdf]
  • Discretized Streams: Fault-Tolerant Streaming Computation at Scale [pdf]
  • Survey of Distributed Stream Processing for Large Stream Sources [pdf]

Graph processing
  • Challenges in Parallel Graph Processing [pdf]
  • Pregel: A System for Large-Scale Graph Processing [pdf]
  • GraphLab A New Framework For Parallel Machine Learning [pdf]
  • Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud [pdf]
  • PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs [pdf]
  • GraphX: Unifying Data-Parallel and Graph-Paralle Analytics [pdf]

Machine learning
  • MLbase: A Distributed Machine-Learning System [pdf]
  • MLI: An API for Distributed Machine Learning [pdf]

Resource management
  • Dominant Resource Fairness: Fair Allocation of Multiple Resource Types [pdf]
  • Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center [pdf]
  • Apache Hadoop YARN: Yet Another Resource Negotiator [pdf]

Course Assignments

We have provided a customized Ubuntu Virtual Machine (VM) for the course. Java, Scala, Eclipse, Hadoop, Spark/Shark, and Stratosphere are already installed in this VM. You can use the VirtualBox to install this VM on any Windows, Mac OS, or Linux platforms. You can download the VM from here (sudo password: sics).

Assignment 1 - HDFS

  • Questions [pdf]
  • Solutions [pdf]

Assignment 2 - HBase
  • Questions [pdf]
  • Solutions [pdf]

Assignment 3 - Spark
  • Questions [pdf]
  • Solutions [pdf]
  • Input file [txt]
  • Word Count - Hadoop [src]
  • Word Count - Stratosphere [src]
  • Word Count - Spark [src]

Assignment 4 - Shark
  • Questions [pdf]
  • Solutions [pdf]

Assignment 5 - Spark Stream
  • Questions [pdf]
  • Solutions [pdf]
  • Word Count - Spark Streaming [src]

Assignment 6 - GraphX
  • Questions [pdf]
  • Solutions [pdf]
  • PageRank - GraphX [src]

Assignment 7 - MLlib
  • Questions [pdf]
  • Solutions [pdf]
  • Input file - sample_svm_data [txt]
  • Input file - lpsa.data [txt]
  • Input file - kmeans_data [txt]
  • Classification [src]
  • Regression [src]
  • Clustering [src]

Course Staff

Amir H. Payberah, PhD, SICS
Seif Haridi, Professor of Computer Systems, SICS/KTH

Course Schedule

Date Time Place Subject
08-04-2014 10:00-11:00 Knuth Introduction [pdf] [latex]
08-04-2014 11:00-12:00 Knuth Distributed Filesystems - HDFS [pdf] [latex]
10-04-2014 9:00-11:00 Knuth NoSQL Databases - HBase [pdf] [latex]
15-04-2014 10:00-12:00 Knuth A Crash Course in Scala [pdf] [latex]
22-04-2014 10:00-11:00 Knuth Execution Engine - MapReduce [pdf] [latex]
22-04-2014 11:00-12:00 Knuth Execution Engine - Stratosphere [pdf] [latex]
24-04-2014 10:00-12:00 Knuth Execution Engine - Spark [pdf] [latex]
06-05-2014 10:00-12:00 Knuth Scripting Languages - Shark [pdf] [latex]
08-05-2014 10:00-12:00 Knuth Data Stream Processing - Spark Streaming [pdf] [latex]
13-05-2014 10:00-12:00 Knuth Graph Processing - Pregel and GraphLab [pdf] [latex]
15-05-2014 10:00-12:00 Knuth Graph Processing - PowerGraph and GraphX [pdf] [latex]
20-05-2014 10:00-12:00 Knuth Machine Learning - MLlib [pdf] [latex]
30-05-2014 10:00-12:00 Knuth Resource Management - Mesos and YARN [pdf] [latex]