The aim of the assignment is to introduce you to analysis of routine data sets (“wild datasets”). You will need to explore issues such as writing Data dictionaries; assessing data quality, explore the data using visual tools and perform. some data wrangling; consider and perform. data analysis and write a comprehensive report including account on your findings and summarising recommendations.
For the assignment, you will be given a general scenario and a suggestion of a raw dataset. You will need to explore the given problem in more depth – this includes finding more data (datasets) relevant to the job.
You will be working in groups to produce both group and individual deliverables.
Project methodology
Data can be a product of a meticulously planned study, or it can be a side-product of practice (wild datasets). While planned studies typically yield well defined, clean data, these studies are typically expensive both in terms of money and other resources. Such effort is not sustainable in the long term.
Data produced as part of routine activities or observation are, on the other hand, readily available with minimal cost. However, such data are typically incomplete, contain possible errors and require cleansing and transformation before they can be used beyond their primary purpose.
The framework we will be using for this assignment was developed by industry as Cross-Industry Standard Process for data Mining (CRISP-DM). This process has several phases:
Business understanding
Before you start any attempt to collect/analyse data you need to get a good idea why you are doing the exercise – understand the purpose. The main components are:
• Determine business objectives
– Initial situation/problem etc.
– Explore the context of the problem and context of the data collection (…types of organisations generating the data; processes involved in the data creation...)
• Assess situation
– Inventory of resources (personnel, data, software)
– Requirements (e.g. deadline), constraints (e.g. legal issues), risks
Understanding your business will support determining the scope of the project, the timeframe, budget etc.
NB: The direction of your analysis is determined by your business needs. An attempt to analyse a dataset without prior identification of the main directions would lead to extensive exploration. While this may be justified in some cases, in real business it is seldom required. You are NOT doing academic research aiming to create new knowledge – you are trying to get answers to drive your business decisions!
Data understanding
Next step is to look at what data is needed (available) and write data definitions (so that we know exactly what we talking about – this is very important for aggregation of apparently same data: the definitions may not be the same! – blood pressure data may look exactly the same – but there is indeed a difference whether it is data acquired at the ICU via intra-arterial cannula; or it is a casual self-monitoring measure the patient is doing by himself at home; nailing down date format is important – especially when aggregating data from different sources – 02/03/12 can be 2nd of March 2012; 3rd of February 2012, 3rd of December 2002; explicitly describe any coding schemas, etc. …).
• Collect initial data
– Acquire data listed in project resources
– Report locations of data, methods used to acquire them, ...
• Describe data
– Examine "surface" properties
– Report for example format, quantity of data, ... à Data dictionary
– NB: data dictionary summarises your knowledge on each piece of data – this description can be considered to be part of the dataset – each piece of data comes with metadata describing meaning, coding, context of collection etc. In many cases you will be given these descriptions along with the dataset
• Explore data
– Examine central tendencies, distributions, look for patterns (visualisations etc.)
– Report insights suggesting examination of particular data subsets (data selection)
• Determine data quality (consider the dimensions of data quality)
– Completeness
– Uniqueness
– Timeliness
– Validity
– Accuracy
– Consistency
NB: this is an initial exploration – scouting the problem space. It helps you to understand what data is available and it helps to align your approach to the business objectives and the data available. At the same time – this phase can help to verify, whether the project is viable (feasibility) and refine the project scope, budget, resources etc.
This phase is very different to a typical research prospective approach where you design the study in a way you always know what you are getting…
Data preparation
Typically, the data you get is not in the right format for analysis (it was collected for other purposes) and needs to be pre-processed
• Select data
– Relevance to the data mining goals
– Quality of data
– Technical constraints, e.g. limits on data volume
• Clean data
– Raise data quality if possible
– Selection of clean subsets
– Insertion of defaults
• Construct data
– Derived attributes (e.g. age = NOW – DOB; possibly subsequent coding of age into buckets etc.) – do not forget to add these attributes to your data dictionary!
• Integrate data
– Merge data from different sources
– Merge data within source (tuple merging)
• Format data
– Data must conform. to requirements of initially selected mining tools (e.g. input data is different for Weka, and different to Disco).
Modelling
This phase goes hand-in-hand with the data preparation. Here you select what analytic techniques you are planning to use, in which sequence etc. Once you have the analysis design, you execute it.
• Select modelling technique
– Finalise the methods selection with respect to the characteristics of the data and purpose of the analysis
– E.g., linear regression, correlation, association detection, decision tree construction…
• Generate test design
– Define your testing plan – what needs to be done to verify the results from analysis (verify the validity of your model). E.g.:
• Separate test data from training data (in case of supervised learning)
• Define quality measures for the model
• Build model
– List parameters and chosen values
– Assess model
At the end of the Data preparation/Modelling phase you have a set of results coming from the analysis (you have a model).
NB: this needs to be assessed and evaluated from the technical point of view (to mitigate issues such as overfitting etc.).
Evaluation
Here you evaluate the results (model) from the business perspective (Did we learn something new? How do the results fit into knowledge we already have? Does the predictive model work? etc.).
• Evaluate results from business perspective
– Test models on test applications if possible
• Review process
– Determine if there are any important factors or tasks that have been overlooked
• Determine next steps (Recommendations)
– Depending on your analysis (results, interpretations) you need to recommend, what will be the next step. In general, the next step can be:
• Deploy the solution (you reached as stage where you got a viable solution)
• Kill the project (you exhausted all meaningful options and decide, that continuation of the project is not viable/feasible from the business point of view)
• Go into the next iteration.
• Improve the model.
• Build an entirely new model.
NB: Do not jump to decisions without the analytic evidence to support such decisions (recommendations).
Deployment
In this phase you conclude the project.
• Plan deployment
– Determine how results (discovered knowledge) are effectively used to reach business objectives
• Plan monitoring and maintenance
– Results become part of day-to-day business and therefore need to be monitored and maintained.
• Final report
• Project review
– Assess what went right and what went wrong, debriefing
NB: Deployment can be a launch of a new project with its own problems. E.g. you have a static data extract you can use to develop a solution. Once you have a viable solution, deploying it will require connection to live data input feeds. This opens a whole new set of issues to be solved:
· Automate data extraction
· Automate semantic interoperability and data linkage
· Automate data quality monitoring
· Design, develop and deploy security context
· Etc.
Caveats
The CRISP-DM framework describes the phases in a rather linear (cyclic) fashion. In theory, it can be done that way. However in reality, this is an exploration process frequently based on the try-and-err basis. You will work with the data and use frequent visualisation to “see” the patterns. Then you confirm what you “see” with more formal statistics.
General scenario
A US consulting company was engaged to analyse job market for people with business analyst qualifications. They were able to scrape job data from LinkedIn on job listings posted in 2024.
Your task is to look at patterns related to jobs requiring Business Analyst qualification (such as what jobs require this skill, what employers look for people with this skill, where the jobs are located, what are the other skills listed along with Business Analyst skill etc.).
You will have to deal with several challenges, such as size of the datasets, decomposing skills listed in one field, matching skills to job descriptions etc.
You will need to explore the problem space (reading and mind maps), declare the narrowed-down focus (the time and resource limitations do not allow to do a complete study). You will need to decide how to work with a collection of large data sources, extract relevant parts and possibly find additional data (from public sources). In this course you are expected to do the first iteration ( and recommend next steps at the end of it – this typically leads to planning of 2nd iteration of the project) NB: you may not be able to reach a stage when you have a business solution, so do not jump to conclusions!
Business understanding
Explore the dataset and the source of this data. You will discuss this in your group and document the discussion by drawing mind maps (individual as preparation for the group discussion; then final group mind map representing your understanding of the problem).
Annotate (CRAAP) relevant publications (you annotate 2 publications, but you read as many as necessary). Brainstorm and summarise your findings in the group. Decide on the focus for your analysis – what factors you expect to go into the model and why.
Write a brief justification of a project – make your decisions explicit.