SQL Complete Mastery Part 1: What is Data Modeling — Complete Overview from Definition of Modeling to ERD
Definition of Data Modeling, 3 Stages, Independence, and ERD Notation — A Complete Guide Written in Easy-to-Understand Language Even for Beginners
Introduction
When first studying databases, the first thing that comes up is data modeling. But when you open a book, words like "abstraction, specification, schema..." pour out and you get stuck.
This article starts from that point of confusion. It breaks down difficult concepts as simply as possible so that after reading, you'll think "Oh, so that's what it is."
1. What is Modeling
One-line Definition
Modeling is organizing complex reality in a clear and understandable way.
Just as architects draw blueprints before building a structure, data modeling is expressing data structure visually before creating a database. If you build without a blueprint and the column placement is wrong, you have to tear everything down. The reason for doing data modeling first is exactly the same.
Remember three things: reality isn't captured as it is, but only what's needed is selected; a commonly agreed notation that everyone reads the same way is used; and there's a purpose of communication and design.
Model = A compressed representation of the real world tailored to a specific purpose
Types of Modeling
Modeling isn't a concept used only for data. It's used in various fields depending on how reality is expressed.
- Information Systems Modeling — Diagrams business workflows and data structures. ERD and UML belong here.
- Mathematical Modeling — Expresses real phenomena with formulas. Physics formulas, economic equations, etc.
- Statistical (Probabilistic) Modeling — Expresses data patterns with probabilities. Regression analysis and machine learning models apply here.
- Circuit Modeling — Expresses electrical and electronic circuits with diagrams. Circuit diagrams and logic gates belong here.
What we handle is information systems modeling, specifically data modeling that expresses data structures.
Three Characteristics of Modeling
Good models commonly have three characteristics.
Abstraction
Reality isn't captured in its entirety. When modeling customer data, we don't include blood type or hobbies. Selecting only what's necessary for business—like name, contact information, and order history—is abstraction. The ability to judge what to exclude and what to keep is exactly the skill of modeling.
Simplification
Complex reality is expressed using agreed-upon notation so anyone can understand it. Workflow structures that would take dozens of documents to explain can be expressed in one ERD, allowing everyone in the meeting room to see the same picture and discuss it.
Clarity
If person A and person B interpret the same model differently, that's a failed model. Clarity means removing ambiguity and writing precisely so that everyone reads it identically.
Three Perspectives of Information Systems Modeling
It's difficult to understand the whole system by looking at it from just one perspective. You need to see it from three viewpoints together.
Data Perspective (What) — What are we handling?
What data exists and how it connects with each other. ERD is the representative tool of this perspective.
Process Perspective (How) — How is it processed?
How business flows in sequence. When certain conditions occur, what processing happens, and how data is transformed. DFD (Data Flow Diagram) and flowcharts are used here.
Correlation Perspective Between Data and Process (Interaction) — What data does what for?
What data is used in what business, what business creates, changes, and deletes what data. CRUD Matrix is a representative analysis tool.
A good information systems model has a structure where these three perspectives fit together in balance.
2. What is Data Modeling
Definition
Data modeling is the entire process of analyzing data needed in business and defining it as a database structure.
It's not just creating tables. It includes analyzing what data is needed in business, how those data connect, and expressing them. It's also creating a common language so developers, planners, and DBAs can communicate while seeing the same picture.
Six Functions Provided by Data Models
A data model serves more than as a blueprint.
① System Visualization — Makes the invisible system structure visible.
② Specification of Structure and Behavior — Clearly defines what structure the system has and how it operates. Becomes the standard for development.
③ Provision of Structured Framework — Provides criteria for maintaining consistent structure even when multiple people collaborate.
④ Documentation — Becomes core reference material for maintenance and feature additions after development is complete.
⑤ Multiple Perspectives — Planners can view with conceptual models and developers with physical models, each seeing the same system at their own level.
⑥ Concretization of Detail Level — Vague requirements are progressively concretized through conceptual → logical → physical stages.
Three Reasons Why Data Modeling is Important
Ripple Effects Cannot Be Ignored
A data model is the foundation of a system. If the foundation is wrong, everything built on top shakes. If you change the data model when development is halfway through, you have to rewrite all queries, APIs, and screen logic that reference that table. Putting effort into initial design is ultimately the most cost-effective approach.
Expresses Complex Requirements Concisely
Real business requirements are complex. Hundreds of rules and exceptions are mixed together. Data modeling compresses that complexity into one ERD. It replaces dozens of documents with one model and makes everyone share the same understanding.
Data Quality Improves
Well-designed models become a defense line where the structure itself prevents incorrect data from entering. Through constraints and relationship definitions, incorrectly formed data is blocked from the start. Data quality isn't managed after the fact—it's secured at the design stage.
Points to Consider — Core is Reducing Risk
Minimize Duplication
If the same data is stored in multiple places, problems inevitably occur. If customer addresses are stored separately in orders, shipping, and customer tables, when the address changes, all three places must be corrected. If even one is missed, data becomes inconsistent. This is called data anomaly.
- Insertion Anomaly — Problem of having to input unnecessary information when adding data
- Update Anomaly — Problem of inconsistency when only some duplicated data is updated
- Deletion Anomaly — Problem of unintended information disappearing when deleting data
Normalization is the key technique for removing this duplication.
Lower Inflexibility
A model designed to fit only today's business becomes a shackle the moment business changes. A good model meets current requirements while leaving room for future changes. Don't let today's perfect model become tomorrow's constraint.
3. Three Stages of Data Modeling
Data modeling doesn't finish in one go. It descends stage by stage from abstract to concrete.
┌──────────────────────┐
│ Conceptual Data │ ← Most abstract (business-focused)
│ Modeling │
└──────────────────────┘
↓
┌────────────────────────────┐
│ Logical Data Modeling │ ← Confirm structure and rules
└────────────────────────────┘
↓
┌──────────────────────────────────┐
│ Physical Data Modeling │ ← Actual DB implementation
└──────────────────────────────────┘
Stage 1. Conceptual Data Modeling
This is the stage of grasping "what data exists and how it connects" in a big picture. Created with planners and business stakeholders, it's not tied to any specific technology. It's at the level of "there are customers, there are products, customers order products."
Stage 2. Logical Data Modeling
This is the stage of concretizing the conceptual model. It confirms what attributes each entity has, what the identifier (PK) is, and how relationships work. Normalization is also applied at this stage. It's still not tied to a specific DBMS.
Stage 3. Physical Data Modeling
This is the stage of implementing the logical model in an actual database. Tables are created to match specific DBMSes like Oracle, MySQL, or PostgreSQL, and indexes and partitioning are considered. DDL (CREATE TABLE, etc.) is the output of this stage.
Connection with Waterfall Methodology
Waterfall is a development methodology where processes flow sequentially from top to bottom like a waterfall. The three stages of data modeling fit perfectly with this flow.
Waterfall Stage Data Modeling
────────────────────────────────────────
Requirements Analysis → Conceptual Data Modeling
System Design → Logical Data Modeling
Implementation/ → Physical Data Modeling
Development
Testing
Operations/
Maintenance
The later an error from the previous stage is discovered, the exponentially higher the correction cost. Mistakes in conceptual design mean redoing all of logical design; mistakes in logical design mean redoing all of physical implementation. This is the ripple effect.
4. Data Independence
Why It's Needed
If changing storage from HDD to SSD meant developers had to rewrite queries, there's a problem with the system. Data independence is the property that changes in lower-level structure don't affect upper levels. With independence, even when changes occur, the impact scope is limited and maintenance costs drop.
Three-Schema Architecture
An architecture proposed by ANSI/SPARC to realize data independence. It divides into three layers so each layer doesn't affect the others.
┌─────────────────────────────────────────┐
│ External Schema │ ← User perspective
│ User A's View │ User B's View │...│
└──────────────────────┬──────────────────┘
↕ Logical Independence
┌──────────────────────┴──────────────────┐
│ Conceptual Schema │ ← Integrated perspective
│ Entire Data Structure and │
│ Relationship Definition │
└──────────────────────┬──────────────────┘
↕ Physical Independence
┌──────────────────────┴──────────────────┐
│ Internal Schema │ ← Physical stage
│ Actual Data Storage Structure │
│ and Methodology │
└─────────────────────────────────────────┘
External Schema — The View Users See
Each user sees only the data they need. The marketing team uses a customer analysis view, the logistics team uses a shipping status view. Multiple external schemas can exist for one database.
Conceptual Schema — Integrated Map of All Data
The organization's overall data structure defined by the DBA. It becomes the basis for all external schemas and doesn't worry about physical storage methods. One database must have exactly one conceptual schema.
Internal Schema — Actual Storage Method
Defines how data is actually stored on disk. Index structure, file composition, compression method, etc. It's the layer closest to the DBMS and hardware.
Two Types of Independence
| Independence | Meaning | Example |
|---|---|---|
| Logical | Conceptual schema changes → External schema unaffected | Table structure change → User screen unchanged |
| Physical | Internal schema changes → Upper layers unaffected | Storage device replacement → Queries and logic unchanged |
The reason for having layers is one: to prevent changes from rippling upward.
5. Three Core Elements of Data Modeling
Data modeling is ultimately answering three questions.
- What does business handle? → Entity
- What characteristics does it have? → Attribute
- How connected are they? → Relationship
Entity — Something Business Manages
An entity is something that must be managed in business. It can be something tangible (customer, product) as well as conceptual (order, contract).
An important distinction exists here.
| Category | Concept | Example |
|---|---|---|
| Entity Type (plural) | A collection of the same nature | The "customer" set |
| Instance (singular) | An individual belonging to the set | Specific customer "Hong Gildong" |
The boxes drawn in ERD are entity types (plural), while each actual row stored in the database is an instance (singular).
Attribute — Characteristics Something Has
Attributes are concrete characteristics an entity possesses.
| Category | Concept | Example |
|---|---|---|
| Attribute Type (plural) | Name of a characteristic | "Name" as an attribute item |
| Attribute Value (singular) | Actual value | Hong Gildong's name = "Hong Gildong" |
The attribute type is the table column name, while the attribute value is the actual data in that column.
Relationship — Connection Between Things
Relationship is a business association between two entities. It's not just drawing a line; it includes cardinality (what kind of relationship) and participation (mandatory or optional).
| Category | Concept | Example |
|---|---|---|
| Relationship Type (plural) | Association rule between entities | "Customers order products" |
| Relationship Instance (singular) | Individual association that occurred | Hong Gildong ordering MacBook that time |
Entity Type (Customer)
├── Attribute Type: Name, Email, Phone
└── Relationship: Customer ──orders──▶ Product
Instance (Hong Gildong)
├── Attribute Value: "Hong Gildong", "hong@example.com"
└── Relationship Instance: That specific time Hong Gildong ordered MacBook
Entity is what, attribute is characteristic, relationship is connection. The data model is complete when all three mesh together.
6. ERD — The Map of the Data World
What is ERD
ERD (Entity-Relationship Diagram) is a diagram that visually expresses entities and relationships between entities. Proposed by Peter Chen in 1976, it became the international standard for expressing data models.
With one ERD, you can understand at a glance what data exists in the system, what attributes it has, and how it connects. This is why an ERD is drawn before creating a database.
Order of Modeling with ERD
Step 1. Entity Identification and Placement
Find things to manage in business and draw them as boxes
↓
Step 2. Setting Relationships Between Entities
Connect entities with business association using lines
↓
Step 3. Describing Relationship Names
Express relationships with verbs ("orders", "belongs to")
↓
Step 4. Describing Relationship Participation
Indicate whether participation is mandatory or optional
↓
Step 5. Indicating Cardinality
Mark 1:1 / 1:N / M:N relationships
↓
Step 6. Defining Attributes
Define attributes and identifiers (PK) for each entity
There's a reason for the sequence. Relationship structure and entities must be confirmed before attributes, or the entire structure will shake. Changing relationship structure later has major ripple effects.
IE Notation vs Barker Notation
IE Notation (Crow's Foot)
Called Crow's Foot because the relationship line endpoints look like crow's feet. It's the most widely used notation today.
Symbol Meaning
──────────────────────
| Exactly 1 (mandatory)
O 0 (optional)
< Multiple (N, many)
Combination Examples
──────────────────────────────────────
|──| 1 to 1 (mandatory)
|──O 1 to 0 or 1 (optional)
|──< 1 to many (1:N mandatory)
O──< 0 or 1 to many (1:N optional)
Example: Customer and Order
[Customer] |───O< [Order]
→ Customer is exactly 1, orders can be 0 or more
Barker Notation
Developed by Richard Barker and adopted by Oracle CASE tools. It uses visual language more intuitive than IE.
| Item | IE Notation | Barker Notation |
|---|---|---|
| Entity Shape | Sharp rectangle | Rounded rectangle |
| Mandatory Relationship | Vertical line | | Solid line —— |
| Optional Relationship | Circle O | Dotted line - - |
| Relationship Name | One direction | Two-way verb phrase |
Barker's greatest feature is that relationships can be read bidirectionally.
[Customer] ──orders──▶ [Order]
◀──belongs to──
"Customer orders" / "Order belongs to customer" — both directions read naturally.
ERD is a language. Notation is the grammar of that language. Grammar may differ, but the content is the same.
7. Six Elements of a Good Data Model
A model isn't made just by drawing. The difference between a well-made model and a poorly made one slowly reveals itself as the system operates.
| Element | Core Question | Why Important |
|---|---|---|
| Completeness | Is there no missing necessary data? | Data not in the model doesn't exist in the system |
| No Duplication | Is the same data stored in multiple places? | Duplication causes inconsistency and anomalies |
| Business Rules | Are business constraints in the model? | Rules only in code can always be bypassed |
| Data Reuse | Can multiple systems use the same data? | Common data design makes system integration easy |
| Communication | Can both technical and non-technical people read it? | Model is a communication tool, not technical docs |
| Integration | Does data mean the same everywhere in the organization? | If definitions differ, numbers won't match even when combined |
A model with these six elements is one that survives long-term.
Conclusion
Data modeling is not a preparatory step for creating a database. It's core work for understanding business, structuring data, and designing the foundation of systems.
The three characteristics of modeling (abstraction, simplification, clarity), three stages (conceptual, logical, physical), three core elements (entity, attribute, relationship), and six conditions for good models — when you understand how these concepts connect, data modeling becomes not a memorization subject but a thinking subject.
The next article covers an overview of relational databases.
