本文转载翻译自[1] [How to create Fact and Dimension tables from denormalized raw data]参考资料

如何基于非规范化的原始数据创建事实表和维度表

样本数据预览

Name Gender Profession state asOfDate temperature pulse
Rachel Green Female Designer NY 2020-11-01 98.4 60
Sheldon Cooper Male Physicist CA 2020-11-05 98.5 70
Penny Female Waitress CA 2020-11-15 99.2 75
George Costanza Male Agent NJ 2020-05-05 98.7 90
Jerry Seinfeld Male Comedian NY 2020-01-01 98.6 65

数据库、维度和表格创建

作者主要是创建一个sample数据库,和三个模式表(landing着陆?、dim维度表和fact事实表),以及与前两者相关的众多的表格。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
## create database
postgres=> create database sample
postgres=> \c sample
## it will prompt you to enter Postgres login password
## create schemas
sample=> create schema landing;
sample=> create schema dim;
sample=> create schema fact;
## create tables
sample=> create table landing.rawdata (
name varchar(100)
,gender varchar(20)
,profession varchar(50)
,state varchar(2)
,asofdate date
,temperature float
,pulse int
);
## Import CSV data into Postgres table
sample=> \COPY landing.rawdata FROM '/yourpath/sampleData.csv' DELIMITER ',' CSV HEADER;
## Add a surrogate ID column (surrogate column means, column with sequence of numbers, generally auto generated).
sample=> alter table landing.rawdata add id serial;
## create dimension tables
## Fisrt Query
create table dim.gender as
select
row_number() Over(order by gender) as genderid
,gender
from
(select distinct gender from landing.rawdata) t;
## Second Query
create table dim.person as
select
row_number() Over(order by name) as personid
,name
from
(select distinct name from landing.rawdata) t;
## Third Query
create table dim.profession as
select
row_number() Over(order by profession) as professionid
,profession
from
(select distinct profession from landing.rawdata) t;
## Fourth Query
create table dim.state as
select
row_number() Over(order by state) as stateid
,state
from
(select distinct state from landing.rawdata) t;
## Build Fact table based on IDs from Dimension Table.
create table fact.user
as
select
r.id
,p.personid
,g.genderid
,pr.professionID
,s.stateID
,r.asofdate
,r.temperature
,r.pulse
from
landing.rawdata r
JOIN dim.person as p on r.name = p.name
JOIN dim.gender as g on r.gender = g.gender
JOIN dim.profession as pr on r.profession = pr.profession
JOIN dim.state as s on r.state = s.state;

个人理解

这里基于我贫弱的数据库理解去试图解释一下为什么要这么做,它的目的是什么。

做了什么(星座模型)?

sample database content
landing 创建一个landing.rawdata储存全部的数据
dim 创建各个维度表:
- dim.person
- dim.gender
- dim.profession
- dim.state
fact 创建事实表来分别关联到各个维度表,例如创建一个user事实表可以这样:
- fact.user存储
- - landing.rawdata的id
- - dim.gender的gender代码
- - dim.profession的profession代码
- - dim.state的state代码

为什么这么做?

优势

  • 查询效率高
  • 维度多,数据解耦
  • 关系明朗
  • 用户友好

劣势:

  • 表格繁多
  • 数据冗余多

关系可视化

mermaid-diagram-2023-02-14-154702

mermaid图地址

参考资料