Data Warehousing Overview

data-warehouse2

A data warehouse is an analytical database that is used as the foundation of a decision support system. It is designed for large volumes of read-only data, providing intuitive access to information that will be used in making decisions.

A data warehouse is created as ongoing commitment by the organization to ensure the appropriate data is available to the appropriate end user at the appropriate time”

Very large databases and systems require special capabilities, tools

  • To analyse large quantities of data
  • To access data from multiple systems

 

Three key techniques

1.Data warehousing

2.Data mining

3.Tools for accessing internal databases through the Web

img06_13

 

Data warehouse:

Stores current and historical data from many core operational transaction systems

Consolidates and standardizes information for use across enterprise, but data cannot be altered

Data warehouse system will provide query, analysis, and reporting tools

Data marts:

Subset of data warehouse

Summarized or highly focused portion of firm’s data for use by specific population of users

Typically focuses on single subject or line of business

gambar-1

 

Online Transactions Processing (OLTP)

OLTP – Online Transactions Processing is the gathering of input data, processing that data and updating existing information to reflect the gathered & processed information.

Databases which support these functions are often called operational databases (e.g. customer db, product db).

Systems that support the running activities of the organization

Examples:

  • Point of sale in stores;
  • ATM and Bank operations
  • e-commerce (amazon, iTunes, etc)

Some characteristics:

  • Thousand of operations per second
  • Repeated operations dealing with small amounts of data (insert, update, remove)
  • Real Time

Online Analytical Processing (OLAP)

This is the manipulation of information to support decision making.

Used in conjunction with data warehouses.

Data warehouse is special form of a databases that contains information gathered from operational databases for the purpose of supporting decision making.

Data warehouses do not support OLTP.

Supports multidimensional data analysis

Viewing data using multiple dimensions

Each aspect of information (product, pricing, cost, region, time period) is different dimension

E.g., how many washers sold in the East in June compared with other regions?

OLAP enables rapid, online answers to ad hoc queries

Systems that provide the users the necessary capabilities to analyze many and different aspects of organization activities and its performance.

Examples

  • How well certain product is selling in different regions?
  • How well is the evolution in the market from its introduction?
  • Which are the top ten selling product in each region? and globally?

Some characteristics:

  • Small number of queries (per day), when compared with OLTP systems
  • Large amount of data processed in each query, in order to obtain a small output.
  • It is hard to predict the queries and in general they are much more diverse, when compared with OLTP systems

data_warehouse_architecture

 

Data warehouses are optimised for analytic access patterns. Analytic access patterns generally involve selecting specific fields and rarely if ever ‘select *’ as is more common in operational databases. Because of these differences in access patterns, operational databases (loosely, OLTP) benefit from the use of a row-oriented DBMS whereas analytics databases (loosely, OLAP) benefit from the use of a column oriented DBMS. Unlike operational systems which maintain a snapshot of the business, data warehouses generally maintain an infinite history which is implemented through ETL processes that periodically migrate data from the operational systems over to the data warehouse.

Data Governance Challenges, Solutions and Products

data-governance

Data governance is a combination of people, processes and technology that drives high-quality, high-value information. The technology portion of data governance combines data quality, data integration and master data management to ensure that data, processes, and people can be trusted and accountable, and that accurate information flows through the enterprise driving business efficiency.

Some common data governance goals are: improving customer segmentation and customer view; improving reporting and decision making confidence; minimizing the costs associated with low quality data re-work; and addressing government regulations.

oracle-big-data-governance

Obstacles: Data Governance Challenges

Implementing a successful data governance program has several technical challenges. First, companies need to reliably integrate data sources and implement data quality measurement and improvement features. Second, companies need to establish, monitor and improve key quality and performance metrics that support data governance principles. Finally, collaboration is required so teams of people from different departments can work together on, and be accountable for data quality issues.

All of these disciplines are required, otherwise there is a risk of low quality data that can impede business performance.

data governance solutions

Solution for Data Governance

A powerful data management solution is needed, so that data governance is improved across the information assets. You should address data governance challenges through a unified data management platform, key performance indicator tracking, and data quality issue collaboration.

1. Unified Data Management Platform

You must improve data governance by making it easy to start and manage data management processes :

  • Access and improve data in hours rather than months with an intuitive application.
  • Upgrade your technology as a company matures its data governance strategy. Data integration, data quality and MDM are controlled from one familiar user interface, not a series of unrelated ones.
  • Deploy data quality processes and services seamlessly across teams of people, lines of business, applications, and systems.

2. Key Performance Indicator Tracking

Tracking data quality metrics is a standard data governance practice in many corporations. Providing feedback metrics about the business impact of data management is the only way to change the hearts and minds of those who enter data into the system. Showing the impact of poor data quality is the best way to improve data quality at the source, by :

  • Publishing web-based reports that track data quality metrics. Users can log in with a web browser and instantly follow key indicators of data quality.
  • Creating custom Adobe Acrobat, XML and Excel reports that allow the data governance team to shine a light on the data governance process
  • Tracking only the metrics you want to track. Reports that are as custom as your data.

3. Data Quality Issue Collaboration

Once issues are identified, business and technical users can immediately begin to put processes in place to cleanse data, by :

  • Cleaning and normalising all sorts of data, including supply chain, ERP, CRM and transactional data. Diverse data quality processes will help companies meet compliance mandates and reduce corporate risk.
  • Standardising data using a wider variety of resources than any other vendors. Built-in data integration opens up a world of reference data from commercially available, government agency or public domain sources.

Turn to the Talend community for processes, regular expressions, and other shortcuts in achieving data quality. Decrease time spend on managing data anomalies and their impact on the organization.

data quality MDM

Related Products

For Data Quality

These products provide a powerful data quality solution that delivers end-to-end profiling, cleansing, matching and monitoring capabilities with the ability to identify anomalies, standardise data, resolve duplicates and monitor data quality over time. Data consistency is improved as you migrate to new systems.

Where you need to integrate products to make a solution, you can improve your productivity through a unified platform – a common code repository and tooling for scheduling, metadata management, data processing and service enablement.

For Data Management

These products turn disparate, duplicate sources of data into trusted stores of consolidated information, so a business can be more responsive and confident in daily decisions.

For Master Data Management

These products provide a powerful and flexible master data management (MDM) solution to model and master any domain turning disparate, inconsistent information across a business into a single unified view of information.

Business Intelligence: Project Management within a company

 

business intelligence diagram illustration design over white

 

What is Business Intelligence ?

Business Intelligence (BI) is a broad category of skills, technologies, applications and practices that enhance the use of information for better decision-making. The purpose of BI is to transform raw business data into meaningful and useful information that enables people to make informed, fact-based decisions.

Shippers-Business-Intelligence-Project

 

The Company’s Project Goals & Objectives

  • Deliver high-quality, relevant, and accessible data and analytics to the company senior level management according to informational requirements
  • Provide access to meaningful information to enable more effective strategic, tactical, and operational insights and fact-based decision-making
  • Optimise company data for easy comprehension and fast query performance
  • Define and publish consistent definitions for company data assets

 

Basic CMYK

The Company’s Project Scope

Justification

  • Enable fact-based decision-making by providing company access to a concise, trusted repository of company information
  • Provide access to easy-to-use reporting and analysis tools that can help users gain better insights, discover issues, and spot trends quickly

Deliverables

  • High level enterprise data warehouse bus matrix
  • Business-process oriented dimensional data models
  • Detailed dimension and fact table design
  • Extraction, transformation, and loading of source system data into dimensional enterprise data warehouse data marts
  • Web-based reporting tools to develop reports, dashboards, and scorecards

Milestones

  • Conduct interviews and gather project requirements
  • Create enterprise data warehouse bus matrix
  • Design dimensional models for key high-priority business processes
  • Populate data warehouse dimension and fact tables
  • Create a business representation of the data warehouse to facilitate easier user interaction with company data
  • Create web-based reports, dashboards, and scorecards

 

BIProcess

To sum up, the successful completion of such a project should include the following actions :

  • Create Enterprise bus matrix for high-priority company business processes
  • Complete Logical dimensional models for key company business processes
  • Create Data definitions in the Data Repository
  • Create Dimension tables and fact tables for the company business processes
  • Gather Business requirements and create logical dimensional models for key financials business processes
  • Form Report Writing Implementation Team to develop and deliver company business intelligence reports and dashboards
  • Complete Physical design of financial ledger dimensions and fact tables
  • Deliver financial ledger dimensions and fact table
  • Deliver key company business intelligence reports and dashboards

Statistical Analysis – An example

CA 3 – Statistical Analysis

chi-square


Q1: Lift Analysis

Please calculate the following lift values for the table correlating burger and chips below:

  • Lift(Burger, Chips)
  • Lift(Burgers, ^Chips)
  • Lift(^Burgers, Chips)
  • Lift(^Burgers, ^Chips)

Please also indicate if each of your answers would suggest independent, positive correlation, or negative correlation?

 1

Answer Q1:

Lift(Burger,Chips) = c(Burger -> Chips)/s(Chips) = s(Burger u Chips)/(s(Burger) x s(Chips))

Lift(Burger,Chips) = (600/1400)/((1000/1400)*(800/1400)) = (3/7)/((5/7)*(4/7)) = (3*7)/(5*4) = 21/20 = 1.05

Burger and Chips are positively correlated since Lift(Burger, Chips) > 1

 

Lift(Burger, ^Chips) = c(Burger -> ^Chips)/s(^Chips) = s(Burger u ^Chips)/(s(Burger) x s(^Chips))

Lift(Burger, ^Chips) = (400/1400)/((1000/1400)*(600/1400)) = (2/7)/((5/7)*(3/7)) = (2*7)/(5*3) = 14/15 = 0.93

Burger and ^Chips are negatively correlated since Lift(Burger, ^Chips) < 1

 

Lift(^Burgers,Chips) = c(^Burgers -> Chips)/s(Chips) = s(^Burgers u Chips)/(s(^Burgers) x s(Chips))

Lift(^Burgers,Chips) = (200/1400)/((400/1400)*(800/1400)) = (1/7)/((2/7)*(4/7)) = (1*7)/(2*4) = 7/8 = 0.875

^Burgers and Chips are negatively correlated since Lift(Burger, ^Chips) < 1

 

Lift(^Burgers, ^Chips) = c(^Burgers -> ^Chips)/s(^Chips) = s(^Burgers u ^Chips)/(s(^Burgers) x s(^Chips))

Lift(^Burgers, ^Chips) = (200/1400)/((400/1400)*(600/1400)) = (1/7)/((2/7)*(3/7)) = (1*7)/(2*3) = 7/6 = 1.16

^Burgers and ^Chips are positively correlated since Lift(Burger, Chips) > 1

 

Q2: Lift Analysis

Please calculate the following lift values for the table correlating shampoo and ketchup below:

  • Lift(Ketchup, Shampoo)
  • Lift(Ketchup, ^Shampoo)
  • Lift(^Ketchup, Shampoo)
  • Lift(^Ketchup, ^Shampoo)

Please also indicate if each of your answers would suggest independent, positive correlation, or negative correlation?

2

Answer Q2:

Lift(Ketchup,Shampoo) = c(Ketchup -> Shampoo)/s(Shampoo) = s(Ketchup u Shampoo)/(s(Ketchup) x s(Shampoo))

Lift(Ketchup,Shampoo) = (100/900)/((300/900)*(300/900)) = (1/9)/((3/9)*(3/9)) = (1*9)/(3*3) = 9/9 = 1

Ketchup and Shampoo are independent, since Lift(Ketchup,Shampoo) = 1

 

Lift(Ketchup, ^Shampoo) = c(Ketchup -> ^Shampoo)/s(^Shampoo) = s(Ketchup u ^Shampoo)/(s(Ketchup) x s(^Shampoo))

Lift(Ketchup, ^Shampoo) = (200/900)/((300/900)*(600/900)) = (2/9)/((3/9)*(6/9)) = (2*9)/(3*6) = 18/18 = 1

Ketchup and ^Shampoo are independent, since Lift(Ketchup, ^Shampoo) = 1

 

Lift(^Ketchup,Shampoo) = c(^Ketchup -> Shampoo)/s(Shampoo) = s(^Ketchup u Shampoo)/(s(^Ketchup) x s(Shampoo))

Lift(^Ketchup,Shampoo) = (200/900)/((600/900)*(300/900)) = (2/9)/((6/9)*(3/9)) = (2*9)/(6*3) = 18/18 = 1

^Ketchup and Shampoo are independent, since Lift(^Ketchup, Shampoo) = 1

 

Lift(^Ketchup, ^Shampoo) = c(^Ketchup -> ^Shampoo)/s(^Shampoo) = s(^Ketchup u ^Shampoo)/(s(^Ketchup) x s(^Shampoo))

Lift(^Ketchup, ^Shampoo) = (400/900)/((600/900)*(600/900)) = (4/9)/((6/9)*(6/9)) = (4*9)/(6*6) = 36/36 = 1

^Ketchup and ^Shampoo are independent, since Lift(^Ketchup, ^Shampoo) = 1

 

Q3: Chi Squared Analysis

Please calculate the following chi squared values for the table correlating burger and chips below (Expected values in brackets).

  • Burgers & Chips
  • Burgers & Not Chips
  • Chips & Not Burgers
  • Not Burgers and Not Chips

For the above options, please also indicate if each of your answer would suggest independent, positive correlation, or negative correlation?

 3

Answer Q3:

Chi2 = Σ (Observed – Expected)2 / Expected

For Burgers & Chips, Chi2 = {900-800)2 / 800 = 10000/800 = 12.5

For Burgers & Not Chips, Chi2 = {100-200)2 / 200 = 10000/200 = 50

For Chips & Not Burgers, Chi2 = {300-400)2 / 400 = 10000/400 = 25

For Not Burgers and Not Chips, Chi2 = {200-100)2 / 100 = 10000/100 = 100

 

Chi2 = Σ (Observed – Expected)2 / Expected = 12.5 + 50 + 25 + 100 = 187.5

 

The Chi2  values show that  Burgers and Chips are correlated because the answer > 0.

We can test for this positive or negative correlation very simply by noting the observed, and if it was greater than the expected there is a positive correlation and vise versa.

 

For Burgers & Chips, as expected value is 800 but 900 is observed we can say that Burgers & Chips are positively correlated.

For Burgers & Not Chips, as expected value is 200 but 100 is observed we can say that Burgers & Not Chips are negatively correlated.

For Chips & Not Burgers, as expected value is 400 but 300 is observed we can say that Chips & Not Burgers are negatively correlated.

For Not Burgers and Not Chips, as expected value is 100 but 200 is observed we can say that Not Burgers and Not Chips are strongly positively correlated.

 

Q4: Chi Squared Analysis

Please calculate the following chi squared values for the table correlating burger and sausages below (Expected values in brackets).

  • Burgers & Sausages
  • Burgers & Not Sausages
  • Sausages & Not Burgers
  • Not Burgers and Not Sausages

For the above options, please also indicate if each of your answer would suggest independent, positive correlation, or negative correlation?

4

Answer Q4:

Chi2 = Σ (Observed – Expected)2 / Expected

For Burgers & Sausages, Chi2 = {800-800)2 / 800 = 0/800 = 0

For Burgers & Not Sausages, Chi2 = {200-200)2 / 200 = 0/200 = 0

For Sausages & Not Burgers, Chi2 = {400-400)2 / 400 = 0/400 = 0

For Not Burgers and Not Sausages, Chi2 = {100-100)2 / 100 = 0/800 = 0

 

Chi2 = Σ (Observed – Expected)2 / Expected = 0

 

The Chi2  values show that  Burgers & Sausages, Burgers & Not Sausages, Sausages & Not Burgers and Not Burgers and Not Sausages are all independent, because all the answers = 0.

 

Q5: Lift and Chi Squared analysis and others

Under what conditions would Lift and Chi Squared analysis prove to be a poor algorithm to evaluate correlation/dependency between two events?

Please suggest another algorithm that could be used to rectify the flaw in Lift and Chi Squared?

 

Answer Q5:

Too many null transactions would prove that Lift and Chi Squared analysis are a poor algorithm to evaluate correlation/dependency between two events, i.e. too many transactions that contain none of the two events.There are several null invariant measures that can help with datasets that have this problem though.

 

To rectify the flaw in Lift and Chi Squared analysis, we could use other algorithms :

  • AllConf(A,B)
  • Jaccard(A,B)
  • Cosine(A,B)
  • Kulczynski(A,B)
  • MaxConf(A,B)

In class, we discussed the Kulczynski method as this is the algorithm that is the most popular in industry and it is generally the most accurate.

R-Programming and R-Graphics to analyze dataset

This assessment is to assist in learning the R language for Statistical Analysis on data.

Using the Try R course from Code School – http://tryr.codeschool.com/

Please submit an image showing the completion of this R course.

Based on this course please create an example use case based on some data that you have created. Use the R Graphics to visualise the data.

For more information on R Graphics that could be used checkout the R Graphics Cookbook – http://it-ebooks.info/book/1316/

http://www.cookbook-r.com/Graphs/

http://www.r-bloggers.com/getting-going-importing-data-and-plotting-a-simple-graphic/

Please ensure that the image of your R Graphics use case appear on your dbsdataprojects.com blog and that it is accompanied by a commentary blog post – roughly 500 words describing –

  • how you achieved the graphics based on your dataset
  • what information could be gleamed from the dataset
  • what other ideas/concepts could be represented via R Graphics if you had more time

This assignment was based on the Try R online course from Code School. This course was completed as shown below:

R Programming course completed

 

First, I installed the following R packages:

install.packages(‘Rcpp’, dependencies = TRUE)

install.packages(‘ggplot2’, dependencies = TRUE)

install.packages(‘data.table’, dependencies = TRUE)

install.packages(‘stats’, dependencies = TRUE)

install.packages(‘moments’, dependencies = TRUE)

install.packages(‘design’, dependencies = TRUE)

Commonly Used Packages in R.

 

Setting working directory:

setwd(file.path(“D:” , “Data” , “Springboard” , “Data Management and Analytics” ,  “Data Management and Analytics” , “Assignments” , “CA2”))

I created a dataset in CSV format representing the last ten results of my football team ASNL (Association Sportive Nancy Lorraine) against the last remaining teams to play before the end of the League 2 championship. I collected the information from the ASNL official website (www.asnl.net) and the Sportstats website (http://www.sportstats.com/soccer/france/ligue-2/results).

Where:

RCL:Lens

DFCO: Dijon

USCL: Créteil Lusitanos

HAC: Le Havre

FCSM: Sochaux Montbéliard

VAFC: Valenciennes

ETG: Evian Thonon

TFC: Tours

 

The following is the set of commands/scripts used to create an R graph (PLOT) to visualise the dataset:

 

I read the data file in working directory and store the contents in data1:

data1 <- read.csv(file=”ASNL.csv”,head=TRUE,sep=”,”)

To display the data file:

Data1

data1

According to the above result, I created the following four vectors covering the dataset info:

L2Teams <- c(“RCL”,”DFCO”,”USCL”,”HAC”,”FCSM”,”VAFC”,”ETG”,”TFC”)

Losses <- c(4,4,0,3,5,2,4,1)

Draws <- c(3,2,0,5,0,4,2,3)

Wins <- c(3,4,10,2,5,4,4,6)

vectors

Then, I calculated the range of values of the football games, from 0 to max value of 10:

Game_range <- range(0,Wins, Draws, Losses)

And I drew the Wins line (in green), using the plot function:

plot(Wins, type=“o”, col=”green”, ylim=Game_range, axes=FALSE, ann=FALSE)

plot

I defined the X axis, the Y axis and created the box around the Wins line:

axis(1, at=1:8, lab= L2Teams)

axis(2, las=1, at=1*0:Game_range[2])

box()

box

Then, I added the lines for the draws (in yellow) and losses (in red):

lines(Losses, type=o, pch=23, lty=4, col=red)
lines(Draws, type=o, pch=22, lty=2, col=yellow)

2 lines

Finally, I gave a name to the X axis and Y axis (in light purple colour):
title(xlab=League 2 remaining FootballTeams to play, col.lab=rgb(1,0.17,1))
title(ylab=Number of games played against ASNL previously, col.lab=rgb(1,0.17,1))

gave a title (in blue) on top of the chart :

title(main=”French League 2 ASNL: Last 10 Results against the last remaining teams to play “, col.main=”blue”, font.main=3)

 

and created the legend in the top right corner:
legend(7, g_range[2], c(“Wins”, “Draws”, “Losses”), cex=0.8,  col=c(“Green”,”Yellow”,”Red”), pch=21:23, lty=1:1)

So, the visual representation of my dataset using R programming and R Graphics, i.e. the PLOT  function is as follow:

final R chart

 

Before making sense of the above graphical representation, to know if the Association Sportive Nancy Lorraine football Club (ASNL) has a good chance of playing in the League 1 next year, let’s have a look at the current table before the last 8 remaining games until the end of the championship:

L2 table

Taking into account that only the top 3 L2 clubs will be promoted to the League 1, we can analyse the R visualisation to check the chances for the ASNL promotion…

We can notice that there is a gap of 7 points between Nancy the fourth of the table, i.e. Le Havre (HAC) with 8 remaining games to play against the teams shown in the R graph.

Another major aspect to consider is that Nancy will play Le Havre with a 70 percent chance of a win or a draw.

Also, over the last 10 years, a team with 65 points has always been promoted.

Having 55 points, Nancy is 10 points short of a guaranteed promotion. If we look at the graph, Nancy should get the 3 points against Créteil Lusitanos (USCL) having won all 10 games, the same goes on home ground against  Lens (RCL) and Evian Thonon (ETG) with 4 wins and 2 or 3 draws.

With 5 home wins against Sochaux Montbéliard (FCSM), the 3 points are looking good as Nancy is playing this game at home.

As for the rest of the teams, we have a 90 percent chance of not losing (one point minimum) against Tours (TFC), a 80 percent chance against Valenciennes (VAFC) and a 60 percent chance against  Dijon (DFCO)…

To sum up, this R plot chart shows that the Association Sportive Nancy Lorraine football Club (ASNL) is very much in command to be promoted to the League 1 with games against Paris Saint-Germain, Lyon or Marseille. As an ASNL fan, let’s hope so !!!

Apart from interpreting this R visualisation, we could have looked into the statistical functions of the R programming, in particular with regression analysis (generalized linear models or non-linear regression) and the analysis of variance.

Predictive modelling would also be another interesting part of the R programming to develop and to analyse maybe a more comprehensive dataset.

 

An Irish population and density heatmap based on the 2011 census data

google-fusion-table

Assignment 1 – Using Fusion Tables to create an interactive heatmap

This assignment is to submit an image of a fusion table outlining an Irish population heatmap based on the 2011 census data, accompanied by a commentary blog post – roughly 500 words describing –

  • how the heat map was achieved
  • what information could be gleamed from the heat map
  • what other ideas/concepts could be represented in the heat map

Optional: creation of a random distribution of counties based on population density.


 

The first step is to find and import the relevant raw data, i.e. the 2011 census raw data, representing the Irish population par counties, found at CSO website:

http://www.cso.ie/en/statistics/population/populationofeachprovincecountyandcity2011/

A table which contains 4 columns: “Province, County”, “Males”,  “Females” and “Total Persons” in raw format.

 

Also, the Irish KMZ Datafile found at the Independent website :

http://www.independent.ie/editorial/test/map_lead.kml

was used in order to be able to create a heatmap.

 

 

The second step is to prepare data for Fusion Tables.

The 2011 census raw data need to conform to a specific format for use with Fusion Tables; as a result, before I can import data to Fusion Tables, I “cleaned up” the 2011 census raw data, as follow:

  • Header rows were merged into one
  • Empty rows were deleted
  • Rows mentioning provinces and detailed info of counties, like Dublin City, Finglas, Dun Laoghaire for Dublin, were deleted to keep only 26 rows corresponding to each county.
  • Incorrect spelling of Laois was fixed.
  • North and South Tipperary counties were merged.
  • Wrong figures were corrected, i.e. number of males for Kerry should be 72629 instead of 72628 (one missing) and number of males for Waterford should be 56464 instead of 54464 (2000 missing).
  • The last raw with the total of irish population was deleted

Finally, the data table looked like this:

Figure1

Figure.1 : 2011 Irish Population data

 

The original data contain 4 columns, namely province, males, females, total persons. This was filtered and cleaned to remain two columns namely province and total persons.

Next, combining this data table with the geographic KML/KMZ information, the Fusion Tables can plot the locations and associated data into an interactive map.

The typical workflow is to upload the county-by-county population table (cleaned data) and the KML file which contains geographic information of each county (i.e., shape and boundary) to Fusion Tables, merge the two tables, visualized the new table on a map, style the map, then share it, if needed.

 

The cleaned data and the KMZ file were imported into fusion tables by using the following Google Drive website:

https://support.google.com/fusiontables/answer/2571232

The below process was followed:

  • Uploaded the cleaned data table. Hiding Males and Female columns before creating the first fusion table (Not necessary).
  • Uploaded the second KMZ table in a new tab of the browser to create the second fusion table.
  • Merged the table with the KMZ file. I am matching the county names in the “Counties” column of the first table with the county names in the “name” column of the second table.Also, I selected the columns: “Counties”, “Total persons” and “Geometry” to be merged.
  • Converted the data into a heatmap of Ireland.

 

I had to make changes to the map styles to displays a clearer and more meaningful view of the population in the Republic of Ireland, by clicking on “Change feature styles”, “Fill color” and adapting the colours and population number ranges in “Buckets” to show a proper heatmap. I also included the legend.

Figure2

Figure.2 : 2011 Irish Population Heatmap

 

In order to create a random distribution of counties based on population density, I used the same process as above for the Irish population heatmap and adapted the population data spreadsheet (Fig.1) by adding an extra column for the population density.

The population density was calculated by dividing the total number of inhabitants by the areas per square km, for each 26 counties. The areas of each counties were found at the website:

http://www.wesleyjohnston.com/users/ireland/geography/counties.html

and the following spreadsheet for fusion tables was produced:

Figure3

Figure.3 : 2011 Irish Population Density data

Finally, using the Irish population cleaned data (represented in Figure 3) and the KML/KMZ geographic data file, the heatmap of irish population density per county was created the same way as the irish population one (Figure 2):

Figure4

Figure.4 : 2011 Irish Population Density Heatmap

 

The Irish population and Irish population density heatmaps (Fig.2 and Fig.4) are showing that Dublin county is by far the most populated and densified county in Ireland and consequently it has a huge influence and is a big driver in terms of employment, and commercial, business and cultural impacts, compare to the rest of the country.

Cork and then Galway are coming next, even though they are not very densified, compare to counties like Kildare and Louth, which have a lower population. Probably, this higher density is due to the fact that they are both situated near Dublin.

On the other hand, Leitrim is the least populated and least densified county of all of Ireland, followed by counties like Longford or Roscommon.

Other ideas would be to use the spreadsheets to produce pivot tables and charts to look at the proportion of males and females per county. Also, when zooming in the heatmaps, we can notice, for example, that an end-to-end motorway network (M18 and 20) is needed between Galway and Cork, as some parts like Gort to Galway, around Bunratty and from Patrickswell (Limerick) to Cork, have not been built yet…