# MSB105 Data Science

## Course description for academic year 2024/2025

### Contents and structure

This course will prepare students for quantitative empirical research. The course will not be focused on different econometric models, statistical tests etc., but rather on how to get raw data into a format where these models and tests can be applied. Courses in statistics and econometrics routinely exposes students for different datasets, but these are highly structured and ready to be used. Faced with doing their own independent quantitative research students will soon discover that real world datasets are "dirty" and a far cry from the well-structured datasets that they have previously been exposed to. The main objective of this course will be to give students the necessary skills and tools to, in a reproducible and well documented manner, get from a "dirty" raw dataset into a well-structured "tidy" dataset. Such tidy datasets are well suited to be used in statistical models and tests.

The course will also be concerned with the well-known mantra regarding all empirical research, namely that it should be reproducible. The availability of new tools has made it easier to live up to this high principle. The course will introduce the students to these new tools and give them hands on training in writing structural texts which contains all manipulation of data, modelling and statistical tests. Everything should be contained in one document (or several connected documents), but hidden away as not to clutter the final paper. Given the right tools other researches should, given the document and raw data files, be able to generate the same paper and results. The hidden code in the document will perform all the required manipulations of raw data, run all models and tests and generate the exact same results as in the original. Mastering these new tools will be of great value to students who decides to write an empirical master thesis.

In addition to what´s mentioned above the course will also cover topics like the use of a distributed version control system, some R programming, presentation of results via tables, graphics and maps and also simple manipulations of geographical structures.

### Learning Outcome

**Knowledge**

Upon completion, the students should have:

- knowledge of the principles governing a tidy dataset and the general strategies to get from dirty raw data to a highly structured tidy dataset
- knowledge of the advantages of reproducibility in research and the repercussions of ignoring it
- knowledge of the principles of version control systems
- knowledge of some basic principles of the R statistical programming language
- knowledge of some of the principles for informative and attractive presentation of data and results by graphics

**Skills**

Upon completion of the course, students should:

- be familiar with the R Studio IDE
- be able to solve simple R programming tasks
- be able to read and understand (some) R error messages
- be able use the R integrated help system
- be able to write structured documents containing R code (Quarto documents)
- be able to generate different end formats from structured documents (html, Microsoft Word, pdf (via LaTeX))
- be able to write mathematical symbols and equations in Quarto (LaTeX math syntax)
- be able by code to control the visibility of programming code, plots, tables and results in documents
- be able to present the results of regression models in dynamic tables
- be able to use the "tidyverse" tools to generate tidy data from "dirty" raw data
- be able to use the concept of "pipes" to write clear and compact R code
- Be able to use the R package ggplot2 to generate graphical representations of data and results
- be able to use the git version control system
- be able to use the git system together with net resources to give a distributed version control system. Distributed version control systems have the potential to make the writing of multi-author papers/theses both safer and more convenient.
- be able to use digital tools to simplify citations and the construction of a list of references

**General Competence **

Upon completion of the course the students will be able to use a distributed version control system, do some R programming, present results via tables, graphics and simple maps, use the citation system built into R Studio. They will also be familiar with a distributed version control system.

### Entry requirements

None

### Recommended previous knowledge

None

### Teaching methods

The teaching wil be a combination of lectures and more hands on problem solving seminars. The students will be required to write a number of short term-papers as Quarto documents where the tidying of raw data is a main topic.

If deemed expedient the course will be taught in English.

### Compulsory learning activities

None

### Assessment

During the course the students will build a portfolio on Github of short papers and other exercises. The portfolio will be graded pass or fail.

### Examination support material

All study aids are allowed

More about examination support material