CLOUD COMPUTING AND BIG DATA
Academic Year 2024/2025 - Docente: Giuseppe PAPPALARDORisultati di apprendimento attesi
- Knowledge and understanding (Conoscenza e capacità di comprensione). Students will acquire a precise knowledge and understanding of fundamental concepts in the field of cloud computing, chiefly through a guided exploration of the main technological solutions available from the public Cloud, focusing on resources and services oriented to data storage, analysis, visualization and machine learning.
- Applying knowledge and understanding (Capacità di applicare conoscenza e comprensione). Based on the operating knowledge acquired, students will develop an effective "toolset" of practical, application-oriented skills in leveraging the Cloud to cater for the typical needs of a data scientist: i.e. processing large datasets with a view to revealing meaningful patterns and relationships. Cloud implementations of state-of-the-art tools and frameworks like, e.g., MapReduce/Hadoop or TensorFlow, will be employed
- Making judgements (Autonomia di giudizio). The student will develop the ability to choose the suitable Cloud-based resource for the Data Science scenario of interest, properly estimating the ensuing costs and performance gains, as well as consciously assessing the tradeoffs involved.
- Communication skills (Abilità comunicative). The student will acquire the communication skills required to express and discuss, at a rigorous technical level, the benefits and (mostly cost-related) downsides of the Cloud for Data Science applications. In addition, the student will gain the ability, for presentation purposes, to effectively highlight the features of very large datasets by means of cloud-based visualization services.
- Learning skills (Capacità di apprendimento). Students will become capable of profitably consulting technical documentation concerning Data Science-oriented Cloud services, in order to concretely put them to effective use
Course Structure
Lectures will mainly consist in live sessions dealing with using the Cloud for the purposes of data analysis and machine learning. These sessions will be carried out by the lecturer and replicated, with suggested variations, by the students, on available equipment. Laboratory practice aims at enabling students to refine their understanding of the technologies presented and acquire autonomous operating skills. As a framework and guidance, lecture notes will be displayed during lectures and shared with students. Notes will provide a precise record of the material presented, as well as pointers to the required reference technical documentation.
Should teaching be carried out in mixed mode or remotely, it may be necessary to introduce changes with respect to previous statements, in line with the programme planned and outlined in the syllabus. Learning assessment may also be carried out on line, should the conditions require it.
Required Prerequisites
Fundamentals of data analysis and machine learning. Basic skills in using a desktop computing environment and the Web.
Attendance of Lessons
Attending classes is not mandatory but strongly recommended.
Detailed Course Content
This course aims at enabling the data scientist to put into practice on the public Cloud principles and methodologies learnt in courses concerned with data storage, processing, analysis, and machine learning. Indeed, in these areas, present day industrial and enterprise applications typically require storage volumes, computing power and bandwidth at a scale impossible or (even for large organizations) impractical to attain with proprietary equipment on premises. In realistic Data Science scenarios, it is therefore hardly avoidable for the data scientist to resort to the Cloud, i.e. storage and computing services offered by third-party providers over the public Internet, with a pay-per-use cost model.
In a nutshell, quoting reference [2], we may say that: “The Cloud turbocharges Data Science” .
Google Cloud Platform (GCP) is the platform of choice, for its ease of use and free availability to students.
A list of the main topics treated in the course follows.
- SQL on Google Cloud and BigQuery: performing structured queries on BigQuery and Cloud SQL. Importing data from CSV files.
- Processing big data with a Google cloud shell: installing and using a Unix-based shell
- Data acquisition into Google Cloud: downloading selected data from a large public data set over the internet, and processing it on GCP.
- Google Cloud Dataflow: processing a real-time, real-world data set, and storing the results on the cloud. Case study: real-time geospatial data.
- Visualization with Google Looker Studio: Visualizing data stored in Google Cloud. Visualizing Real Time Geospatial Data.
- Data Analysis on Google Cloud: analysis with BigQuery; vs. exploratory data analysis with cloud notebooks.
- Evaluating a Data Model: partitioning a data set into a training set and a test set; developing and evaluating prediction models.
- Machine Learning with Spark on the cloud.
- MapReduce and Hadoop on the cloud: exploiting parallelism and machine clusters.
- Machine Learning and Data Discovery using Data Bricks Notebooks
Textbook Information
- Google Inc. Student Training: Kick-Start Your Cloud Trainings. https://edu.google.com/programs/students/training.
- Lakshmanan, V. Data Science on the Google Cloud Platform. O'Reilly Media, Inc. 2018.
- Lecture notes, to be made available through the Studium portal or the University's Teams platform.
Course Planning
Subjects | Text References | |
---|---|---|
1 | Google Cloud (GC): Performing structured queries on BigQuery | Lecture notes |
2 | GC: Performing structured queries on Cloud SQL | Lecture notes |
3 | Processing big data with a cloud (Unix) shell | Lecture notes |
4 | Processing big data with a cloud (Unix) shell | Lecture notes |
5 | GC: Importing big data from CSV files | Lecture notes |
6 | Downloading large public data sets to GC | Lecture notes |
7 | GC: processing a real-time, real-world data set | Lecture notes |
8 | Case study: real-time geospatial data on GC | Lecture notes |
9 | GC Looker Studio: Visualizing data from Google Cloud SQL | Lecture notes |
10 | Data Analysis and Google BigQuery | Lecture notes |
11 | GC notebooks for rapid exploratory data analysis | Lecture notes |
12 | Machine Learning (ML) with Spark on the Cloud | Lecture notes |
13 | ML with Spark on GC | Lecture notes |
14 | MapReduce e Hadoop on the Cloud: exploiting parallelism and machine clusters | Lecture notes |
15 | Machine Learning and Data Discovery using Data Bricks Notebooks | Lecture notes |
Learning Assessment
Learning Assessment Procedures
Laboratory session individually performed by the student vis-à-vis the lecturer. The student will be required to carry out the Cloud-based procedures demonstrated during the lectures, as well as to discuss their significance, and critically assess their outcomes. Learning assessment may also be carried out on line, upon indication of the Academic Senate, should the conditions require it.
Students with disabilities and/or DSA must contact the teacher and the DMI CInAP contact person sufficiently in advance of the exam date to communicate that they intend to take the exam taking advantage of the appropriate compensatory measures.
Grades will normally be given using the following criteria:
- not passed: the student has not acquired basic notions and cannot solve simple practical exercises
- 18-20: the student barely possesses the basic notions, and has difficulties in tackling practical exercises
- 21-24: the student understands and exposes acceptably the basic notions, and only correctly solves simple practical exercises
- 25-27: the student has acquired satisfactorily most course contents, is able to establish relationships among them, and solves most practical exercises with few mistakes
- 28-30 cum laude: the student masters all course contents and is able to fully and critically establish relationships among them, and manages to thoroughly solve all practical exercises with no significant mistakes.
Examples of frequently asked questions and / or exercises
The student will choose one or more datasets, and prepare a project demonstrating the technologies presented in the course. Typically, queries for BigQuery, notebooks, and data ingestion procedures are expected. Datasets and the project contents should be agreed in advance with the course instructor.