SQL Complete Mastery Part 1: What is Data Modeling — Complete Overview from Definition of Modeling to ERD

Introduction

When first studying databases, the first thing that comes up is data modeling. But when you open a book, words like "abstraction, specification, schema..." pour out and you get stuck.

This article starts from that point of confusion. It breaks down difficult concepts as simply as possible so that after reading, you'll think "Oh, so that's what it is."

1. What is Modeling

One-line Definition

Modeling is organizing complex reality in a clear and understandable way.

Just as architects draw blueprints before building a structure, data modeling is expressing data structure visually before creating a database. If you build without a blueprint and the column placement is wrong, you have to tear everything down. The reason for doing data modeling first is exactly the same.

Remember three things: reality isn't captured as it is, but only what's needed is selected; a commonly agreed notation that everyone reads the same way is used; and there's a purpose of communication and design.

Model = A compressed representation of the real world tailored to a specific purpose

Types of Modeling

Modeling isn't a concept used only for data. It's used in various fields depending on how reality is expressed.

Information Systems Modeling — Diagrams business workflows and data structures. ERD and UML belong here.
Mathematical Modeling — Expresses real phenomena with formulas. Physics formulas, economic equations, etc.
Statistical (Probabilistic) Modeling — Expresses data patterns with probabilities. Regression analysis and machine learning models apply here.
Circuit Modeling — Expresses electrical and electronic circuits with diagrams. Circuit diagrams and logic gates belong here.

What we handle is information systems modeling, specifically data modeling that expresses data structures.

Three Characteristics of Modeling

Good models commonly have three characteristics.

Abstraction

Reality isn't captured in its entirety. When modeling customer data, we don't include blood type or hobbies. Selecting only what's necessary for business—like name, contact information, and order history—is abstraction. The ability to judge what to exclude and what to keep is exactly the skill of modeling.

Simplification

Complex reality is expressed using agreed-upon notation so anyone can understand it. Workflow structures that would take dozens of documents to explain can be expressed in one ERD, allowing everyone in the meeting room to see the same picture and discuss it.

Clarity

If person A and person B interpret the same model differently, that's a failed model. Clarity means removing ambiguity and writing precisely so that everyone reads it identically.

Three Perspectives of Information Systems Modeling

It's difficult to understand the whole system by looking at it from just one perspective. You need to see it from three viewpoints together.

Data Perspective (What) — What are we handling?

What data exists and how it connects with each other. ERD is the representative tool of this perspective.

Process Perspective (How) — How is it processed?

How business flows in sequence. When certain conditions occur, what processing happens, and how data is transformed. DFD (Data Flow Diagram) and flowcharts are used here.

Correlation Perspective Between Data and Process (Interaction) — What data does what for?

What data is used in what business, what business creates, changes, and deletes what data. CRUD Matrix is a representative analysis tool.

A good information systems model has a structure where these three perspectives fit together in balance.

2. What is Data Modeling

Definition

Data modeling is the entire process of analyzing data needed in business and defining it as a database structure.

It's not just creating tables. It includes analyzing what data is needed in business, how those data connect, and expressing them. It's also creating a common language so developers, planners, and DBAs can communicate while seeing the same picture.

Six Functions Provided by Data Models

A data model serves more than as a blueprint.

① System Visualization — Makes the invisible system structure visible.

② Specification of Structure and Behavior — Clearly defines what structure the system has and how it operates. Becomes the standard for development.

③ Provision of Structured Framework — Provides criteria for maintaining consistent structure even when multiple people collaborate.

④ Documentation — Becomes core reference material for maintenance and feature additions after development is complete.

⑤ Multiple Perspectives — Planners can view with conceptual models and developers with physical models, each seeing the same system at their own level.

⑥ Concretization of Detail Level — Vague requirements are progressively concretized through conceptual → logical → physical stages.

Three Reasons Why Data Modeling is Important

Ripple Effects Cannot Be Ignored

A data model is the foundation of a system. If the foundation is wrong, everything built on top shakes. If you change the data model when development is halfway through, you have to rewrite all queries, APIs, and screen logic that reference that table. Putting effort into initial design is ultimately the most cost-effective approach.

Expresses Complex Requirements Concisely

Real business requirements are complex. Hundreds of rules and exceptions are mixed together. Data modeling compresses that complexity into one ERD. It replaces dozens of documents with one model and makes everyone share the same understanding.

Data Quality Improves

Well-designed models become a defense line where the structure itself prevents incorrect data from entering. Through constraints and relationship definitions, incorrectly formed data is blocked from the start. Data quality isn't managed after the fact—it's secured at the design stage.

Points to Consider — Core is Reducing Risk

Minimize Duplication

If the same data is stored in multiple places, problems inevitably occur. If customer addresses are stored separately in orders, shipping, and customer tables, when the address changes, all three places must be corrected. If even one is missed, data becomes inconsistent. This is called data anomaly.

Insertion Anomaly — Problem of having to input unnecessary information when adding data
Update Anomaly — Problem of inconsistency when only some duplicated data is updated
Deletion Anomaly — Problem of unintended information disappearing when deleting data

Normalization is the key technique for removing this duplication.

Lower Inflexibility

A model designed to fit only today's business becomes a shackle the moment business changes. A good model meets current requirements while leaving room for future changes. Don't let today's perfect model become tomorrow's constraint.

3. Three Stages of Data Modeling

Data modeling doesn't finish in one go. It descends stage by stage from abstract to concrete.

        ┌──────────────────────┐
        │   Conceptual Data      │  ← Most abstract (business-focused)
        │      Modeling          │
        └──────────────────────┘
                  ↓
        ┌────────────────────────────┐
        │    Logical Data Modeling   │  ← Confirm structure and rules
        └────────────────────────────┘
                  ↓
        ┌──────────────────────────────────┐
        │    Physical Data Modeling        │  ← Actual DB implementation
        └──────────────────────────────────┘

Stage 1. Conceptual Data Modeling

This is the stage of grasping "what data exists and how it connects" in a big picture. Created with planners and business stakeholders, it's not tied to any specific technology. It's at the level of "there are customers, there are products, customers order products."

Stage 2. Logical Data Modeling

This is the stage of concretizing the conceptual model. It confirms what attributes each entity has, what the identifier (PK) is, and how relationships work. Normalization is also applied at this stage. It's still not tied to a specific DBMS.

Stage 3. Physical Data Modeling

This is the stage of implementing the logical model in an actual database. Tables are created to match specific DBMSes like Oracle, MySQL, or PostgreSQL, and indexes and partitioning are considered. DDL (CREATE TABLE, etc.) is the output of this stage.

Connection with Waterfall Methodology

Waterfall is a development methodology where processes flow sequentially from top to bottom like a waterfall. The three stages of data modeling fit perfectly with this flow.

Waterfall Stage          Data Modeling
────────────────────────────────────────
Requirements Analysis →  Conceptual Data Modeling
System Design        →   Logical Data Modeling
Implementation/      →   Physical Data Modeling
Development
Testing
Operations/
Maintenance

The later an error from the previous stage is discovered, the exponentially higher the correction cost. Mistakes in conceptual design mean redoing all of logical design; mistakes in logical design mean redoing all of physical implementation. This is the ripple effect.

4. Data Independence

Why It's Needed

If changing storage from HDD to SSD meant developers had to rewrite queries, there's a problem with the system. Data independence is the property that changes in lower-level structure don't affect upper levels. With independence, even when changes occur, the impact scope is limited and maintenance costs drop.

Three-Schema Architecture

An architecture proposed by ANSI/SPARC to realize data independence. It divides into three layers so each layer doesn't affect the others.

┌─────────────────────────────────────────┐
│           External Schema               │  ← User perspective
│     User A's View  │  User B's View │...│
└──────────────────────┬──────────────────┘
                       ↕  Logical Independence
┌──────────────────────┴──────────────────┐
│          Conceptual Schema               │  ← Integrated perspective
│      Entire Data Structure and           │
│           Relationship Definition        │
└──────────────────────┬──────────────────┘
                       ↕  Physical Independence
┌──────────────────────┴──────────────────┐
│          Internal Schema                 │  ← Physical stage
│    Actual Data Storage Structure          │
│           and Methodology                │
└─────────────────────────────────────────┘

External Schema — The View Users See

Each user sees only the data they need. The marketing team uses a customer analysis view, the logistics team uses a shipping status view. Multiple external schemas can exist for one database.

Conceptual Schema — Integrated Map of All Data

The organization's overall data structure defined by the DBA. It becomes the basis for all external schemas and doesn't worry about physical storage methods. One database must have exactly one conceptual schema.

Internal Schema — Actual Storage Method

Defines how data is actually stored on disk. Index structure, file composition, compression method, etc. It's the layer closest to the DBMS and hardware.

Two Types of Independence

Independence	Meaning	Example
Logical	Conceptual schema changes → External schema unaffected	Table structure change → User screen unchanged
Physical	Internal schema changes → Upper layers unaffected	Storage device replacement → Queries and logic unchanged

The reason for having layers is one: to prevent changes from rippling upward.

5. Three Core Elements of Data Modeling

Data modeling is ultimately answering three questions.

What does business handle? → Entity
What characteristics does it have? → Attribute
How connected are they? → Relationship

Entity — Something Business Manages

An entity is something that must be managed in business. It can be something tangible (customer, product) as well as conceptual (order, contract).

An important distinction exists here.

Category	Concept	Example
Entity Type (plural)	A collection of the same nature	The "customer" set
Instance (singular)	An individual belonging to the set	Specific customer "Hong Gildong"

The boxes drawn in ERD are entity types (plural), while each actual row stored in the database is an instance (singular).

Attribute — Characteristics Something Has

Attributes are concrete characteristics an entity possesses.

Category	Concept	Example
Attribute Type (plural)	Name of a characteristic	"Name" as an attribute item
Attribute Value (singular)	Actual value	Hong Gildong's name = "Hong Gildong"

The attribute type is the table column name, while the attribute value is the actual data in that column.

Relationship — Connection Between Things

Relationship is a business association between two entities. It's not just drawing a line; it includes cardinality (what kind of relationship) and participation (mandatory or optional).

Category	Concept	Example
Relationship Type (plural)	Association rule between entities	"Customers order products"
Relationship Instance (singular)	Individual association that occurred	Hong Gildong ordering MacBook that time

 Entity Type (Customer)
    ├── Attribute Type: Name, Email, Phone
    └── Relationship: Customer ──orders──▶ Product

 Instance (Hong Gildong)
    ├── Attribute Value: "Hong Gildong", "hong@example.com"
    └── Relationship Instance: That specific time Hong Gildong ordered MacBook

Entity is what, attribute is characteristic, relationship is connection. The data model is complete when all three mesh together.

6. ERD — The Map of the Data World

What is ERD

ERD (Entity-Relationship Diagram) is a diagram that visually expresses entities and relationships between entities. Proposed by Peter Chen in 1976, it became the international standard for expressing data models.

With one ERD, you can understand at a glance what data exists in the system, what attributes it has, and how it connects. This is why an ERD is drawn before creating a database.

Order of Modeling with ERD

Step 1. Entity Identification and Placement
        Find things to manage in business and draw them as boxes
               ↓
Step 2. Setting Relationships Between Entities
        Connect entities with business association using lines
               ↓
Step 3. Describing Relationship Names
        Express relationships with verbs ("orders", "belongs to")
               ↓
Step 4. Describing Relationship Participation
        Indicate whether participation is mandatory or optional
               ↓
Step 5. Indicating Cardinality
        Mark 1:1 / 1:N / M:N relationships
               ↓
Step 6. Defining Attributes
        Define attributes and identifiers (PK) for each entity

There's a reason for the sequence. Relationship structure and entities must be confirmed before attributes, or the entire structure will shake. Changing relationship structure later has major ripple effects.

IE Notation vs Barker Notation

IE Notation (Crow's Foot)

Called Crow's Foot because the relationship line endpoints look like crow's feet. It's the most widely used notation today.

Symbol   Meaning
──────────────────────
  |      Exactly 1 (mandatory)
  O      0 (optional)
  <      Multiple (N, many)

Combination Examples
──────────────────────────────────────
  |──|    1 to 1 (mandatory)
  |──O    1 to 0 or 1 (optional)
  |──<    1 to many (1:N mandatory)
  O──<    0 or 1 to many (1:N optional)

Example: Customer and Order

[Customer] |───O< [Order]

→ Customer is exactly 1, orders can be 0 or more

Barker Notation

Developed by Richard Barker and adopted by Oracle CASE tools. It uses visual language more intuitive than IE.

Item	IE Notation	Barker Notation
Entity Shape	Sharp rectangle	Rounded rectangle
Mandatory Relationship	Vertical line `\|`	Solid line `——`
Optional Relationship	Circle `O`	Dotted line `- -`
Relationship Name	One direction	Two-way verb phrase

Barker's greatest feature is that relationships can be read bidirectionally.

[Customer] ──orders──▶ [Order]
          ◀──belongs to──

"Customer orders" / "Order belongs to customer" — both directions read naturally.

ERD is a language. Notation is the grammar of that language. Grammar may differ, but the content is the same.

7. Six Elements of a Good Data Model

A model isn't made just by drawing. The difference between a well-made model and a poorly made one slowly reveals itself as the system operates.

Element	Core Question	Why Important
Completeness	Is there no missing necessary data?	Data not in the model doesn't exist in the system
No Duplication	Is the same data stored in multiple places?	Duplication causes inconsistency and anomalies
Business Rules	Are business constraints in the model?	Rules only in code can always be bypassed
Data Reuse	Can multiple systems use the same data?	Common data design makes system integration easy
Communication	Can both technical and non-technical people read it?	Model is a communication tool, not technical docs
Integration	Does data mean the same everywhere in the organization?	If definitions differ, numbers won't match even when combined

A model with these six elements is one that survives long-term.

Conclusion

Data modeling is not a preparatory step for creating a database. It's core work for understanding business, structuring data, and designing the foundation of systems.

The three characteristics of modeling (abstraction, simplification, clarity), three stages (conceptual, logical, physical), three core elements (entity, attribute, relationship), and six conditions for good models — when you understand how these concepts connect, data modeling becomes not a memorization subject but a thinking subject.

The next article covers an overview of relational databases.