Parallel and Distributed Programming 2018
Kenjiro Taura
(the page is encoded in UTF-8)
What's New (in the newest-first order)
Periodically reload this page.
Dates in parentheses indicate when they are posted.
- (Jan. 21)
I got messages from several of you that
your SpMV report submitted to ITC-LMS disappeared.
Please let me know you student ID if it happened to you.
I will look into what happened and keep you updated here.
Stay tuned.
- (Jan. 21)
... And this ends the lecture.
Sorry that
I was not able to finish the GPU part of the lecture
(as I have been horribly busy in this term and
preparing the material for new topics (VGG and GPU)
were tougher than I anticipated (I anticipated that)).
Thank you for those who were sticking with it until the end.
- (Jan. 21)
We do not have a class today
(today is supposed to be the
exam for those lectures that do exam; we do not have an
exam in this lecture) ...
- (Jan. 6)
In response to some questions concerning what you
are supposed to put
in the final report, I wrote
a guideline
- (Jan. 6)
Be sure to send a brief abstract of your final report
before the deadline.
- (Jan. 6)
Almost complete baseline code is now ready (sorry for taking time. It was much harder
than anticipated to make everything ready).
Merge the diff with
the usual procedure.
See an updated README.md to learn how you can work efficiently.
A document of the source code
is also available.
- (Dec. 29)
I updated the baseline code of VGG. Please fetch and merge the fix
by following the usual procedure.
I'll keep working on this, until everything becomes ready.
- (Dec. 17)
Updated the How to Get the Credit
- (Nov. 27)
IMPORTANT I thank Hiroka Ihara
for reporting a bug in matrix generation,
which may have been causing irreproducible performance
problems in your experiments.
Please fetch and merge the fix by
following the usual procedure
and check if the function coo_elem_cmp looks like this.
static int coo_elem_cmp(const void * a_, const void * b_) {
coo_elem_t * a = (coo_elem_t *)a_;
coo_elem_t * b = (coo_elem_t *)b_;
if (a->i < b->i) return -1;
if (a->i > b->i) return 1;
if (a->j < b->j) return -1;
if (a->j > b->j) return 1;
if (a->a < b->a) return -1;
if (a->a > b->a) return 1;
return 0;
}
before the fix, the last line was this orz.
if (a->a > b->a) return -1; // it should be 1
For obvious reasons,
I cannot predict how much this affected your
experiments. The safe bet is to redo your experiments m(_ _)m
- (Nov. 26)
Since a non-negligible number of students do not like
the idea of canceling the class on Dec 26,
I will withdraw the proposal.
We will have a class on Dec 26, as is formally scheduled.
- (Nov. 26)
A quick poll on what to do with another
pseudo Monday at the end of this year (Dec 26th)
which will be terribly unpopular
- (Nov. 19)
I updated
SpMV hands-on page
with the exact spec of the report,
including
- the deadline (Dec 7) and
- how to submit it via
ITC-LMS.
- (Nov. 11)
As I said in the class, the class on Nov. 12 is canceled.
- (Nov. 5)
See
this section and
this section
to retrieve changes to my repository. I updated these sections
with more details.
- (Oct. 29)
Once you have forked my original repository, please paste the URL of the
top level page of your fork, like
https://doss-gitlab.eidos.ic.i.u-tokyo.ac.jp/YOUR_GITLAB_USER_ID/parallel-distributed-handson
into the same spreadsheet.
- (Oct. 29)
Now The login issues spreadsheet is no longer world-editable.
If you want to edit it to report your status, please request a permission to me.
If you have an issue but do not have a google account, please drop me an Email
(I will temporarily make the page world-editable).
- (Oct. 29)
A detailed and updated instruction
for the first hands-on.
- (Oct. 15)
Please
read this page and report your status regarding
login, no matter what your status is.
- (Oct. 15) hands-on material
- (Oct. 1)
before the hands-on in the next class (Oct 15),
go to IST cluster manual and
create your account
- (Sep. 14) Site up
Slides
Course Objectives
The main objectives of this course are to have hands-on experiences on parallel programming and good understanding about how to solve problems in parallel and what determines performance of parallel programs.
Lecture Plan
I wish to cover the topics below, ranging from a very gentle introduction to parallel programming to performance of parallel programs to fundamental topics that may be difficult to understand for first learners (due to time constraints, it's unlikely to fully cover them all).
Languages
- All written materials (slides, home pages, etc.)
will be in English.
- Lectures will be in English
Programming Exercises Week(s)
- You will have an access to latest CPU and GPU machines
NEW Announcement: How to Get the Credit
Here are the requirements for getting the credit.
- You participated in the classes often enough
and generally follow what are covered in the
class (this is a prerequisite to understand
what you are really asked to do in what
follows). As you know, I have not been
keeping track of attendance/absence, but I am
pretty sure I can tell if somebody who has
almost never been in the class suddenly sends
a report.
- Finish the SpMV hands-on exercises we had during the class.
- Write and submit a final report (term paper).
- abstract deadline: January 18th (Sat),
23:59.
- final deadline: February 9th (Sat),
23:59.
Both are to be submitted through ITC-LMS.
The abstract can be just a text
(detailed instructions will be announced later)
of a paragraph or two describing your plan on
what you will be doing for the final report.
The final report must be a logical, consistent,
and sufficiently self-contained document, in a PDF file.
The topic of the final report can be chosen from
the following.
-
Option 1: Optimize Neural Network:
I will provide dataset and a baseline serial
program (C++) of a neural network for image recognition
(VGG,
pubilished at ICLR 2015)
on Cifar-10
dataset). The baseline code is a C++ translation of
VGG code distributed with Chainer.
You optimize the baseline either on CPU, GPU or both.
You are allowed to make a team of up to two.
A team of two must work on both CPU and GPU.
The most time-consuming operations are matrix multiplications.
I will provide a tool to submit records
and display them in real time through a page similar to
this page,
which I made the last year.
You are asked to keep it updated with
your records, so that we can collectively know
how fast they are on each machine.
I will give the toolkit (a baseline code, dataset,
submission tool and more detailed spec of the problem).
I will try to get them ready by the next lecture (Dec 26th).
- Option 2: Solve your problem:
Solve a problem you want to get high
performance for. The problem may be
one that arises in your research or
one that interests you. It must be
one for which you have a good prospect
for applying parallelization or
vectorization. Think of good
parallel algorithms to get good performance.
Apply parallelization and/or
vectorization, understand what is the
maximum performance achievable, and
investigate how close your
implementation is.
Topics
Parallel Programming in Practice
- It's easy! --- a quick and gentle introduction to parallel problem solving
- Some examples of parallel problem solving and programming
Taxonomy of parallel machines and programming models
- What today's machines look like --- parallel computer architecture
- Distributed memory machines
- Multi-core / multi-socket nodes
- SIMD instructions
- Taxonomy of parallel programming models
- Finding and expressing parallelism
- Mapping computation onto compute resources
- Coordination and communication
- Popular programming APIs
- OpenMP
- Intel Threading Building Block
- Cilk
- OpenCL
- OpenACC
- MPI
- UPC
- Chapel
- X10
Understanding performance of parallel programs (and achieving high performance)
- Do you know the maximum performance of your CPU?
- Do you know the maximum performance of memory?
- How to reason about memory traffic of your programs
- Provable bounds of greedy schedulers
- Provable bounds of work-stealing schedulers
- Cache miss bounds of work-stealing schedulers
Fundamental Topics (as time permits)
- Concurrent data structures
- Atomic updates without locks
- Linearlizability: Specifying correctness of concurrent data structures
- Example lock-free concurrent data structures
- Transactional memory
- Memory consistency models
- Sequential consistency
- How costly is it to maintain sequential consistency?
- Optimizations that violate sequential consistency
- Relaxed consistency
- Implementing shared memory in software
Links and References