Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[공도한] W2 미션 (W1은 closed 해버려서 이것 기준으로 확인해주시면 감사하겠습니다..) #35

Open
wants to merge 131 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
131 commits
Select commit Hold shift + click to select a range
e9a2862
Initial commit
reudekx Jan 2, 2025
794c2dd
Update README.md
reudekx Jan 2, 2025
fc5f8f7
docs: add sample.md
Jan 2, 2025
1fde3fd
chore: add .gitignore
reudekx Jan 2, 2025
a162e66
chore: update .gitignore
reudekx Jan 2, 2025
c087ffb
chore: add .python-version
reudekx Jan 2, 2025
c5a4e07
chore: add pyproject.toml & poetry.lock
reudekx Jan 2, 2025
dc09607
docs: 회고 템플릿 추가
reudekx Jan 2, 2025
8f3d85e
docs: update README.md
reudekx Jan 2, 2025
274ba6a
refactor: move 0주차 회고
reudekx Jan 2, 2025
397abd0
chore: black & isort 패키지 추가
reudekx Jan 2, 2025
cab9aa4
docs: update 2025-01-02.md
reudekx Jan 2, 2025
ec8a388
docs: update 2025-01-02.md
reudekx Jan 2, 2025
202d77b
chore: add .ipynb_checkpoints at .gitignore
reudekx Jan 3, 2025
abd14fa
chore: add pandas and matplotlib package
reudekx Jan 3, 2025
dc9c76b
feat: add W1/M1
reudekx Jan 3, 2025
11d1ebe
feat: save studyings
reudekx Jan 3, 2025
d460053
docs: add 2025-01-03.md
reudekx Jan 3, 2025
00a97e1
docs: update 2025-01-03.md
reudekx Jan 3, 2025
57c7428
refactor: rename directory names to english
reudekx Jan 3, 2025
8b7d657
chore: add package-mode = false at tool.poetry section
reudekx Jan 3, 2025
8b43ba5
feat: update team activity requirement 1
reudekx Jan 3, 2025
e9076ac
refactor: rename to column_info.md
reudekx Jan 3, 2025
622059f
docs: update README.md
reudekx Jan 3, 2025
a8c9494
docs: update README.md
reudekx Jan 3, 2025
566aaca
feat: improve contents
reudekx Jan 5, 2025
fd409bb
feat: update contents
reudekx Jan 5, 2025
f2c961b
chore: update black package to black[jupyter]
reudekx Jan 6, 2025
ee17799
feat: add W1/M2
reudekx Jan 6, 2025
4421420
feat: update contents
reudekx Jan 6, 2025
73e80aa
docs: add 2025-01-06.md
reudekx Jan 6, 2025
99619de
chore: add beautifulsoup4 package
reudekx Jan 7, 2025
b4e1725
fix: fix typo
reudekx Jan 7, 2025
fb3341e
docs: update 2025-01-06.md
reudekx Jan 7, 2025
1874238
docs: add 2025-01-07.md
reudekx Jan 7, 2025
0d010c8
feat: add W1/M3
reudekx Jan 7, 2025
750bd16
docs: update 2025-01-07.md
reudekx Jan 7, 2025
db1fe45
docs: update 2025-01-07.md
reudekx Jan 7, 2025
6abf980
feat: add testing_parallellism.ipynb
reudekx Jan 7, 2025
73955f3
feat: update testing_parallellism.ipynb
reudekx Jan 7, 2025
4df7c9f
chore: remove .ipynb from black.exclude and isort.skip
reudekx Jan 8, 2025
bd4a9ac
feat: complete basic requirements
reudekx Jan 8, 2025
c3976d2
feat: update for additional requirements
reudekx Jan 8, 2025
1a90339
chore: update log and database
reudekx Jan 8, 2025
7a2a147
docs: update README.md
reudekx Jan 8, 2025
93e98b1
docs: update README.md
reudekx Jan 8, 2025
35a7302
docs: update README.md
reudekx Jan 8, 2025
cc603b2
feat: add code for error handling
reudekx Jan 8, 2025
8a0d291
chore: update log and database
reudekx Jan 8, 2025
e37a235
chore: update log message content
reudekx Jan 8, 2025
72db277
chore: update log and database
reudekx Jan 8, 2025
b561ef1
refactor: rename extractor.py to wiki_extractor.py
reudekx Jan 8, 2025
e672a11
refactor: rename wiki_extractor.py to extractor.py
reudekx Jan 8, 2025
d85eca9
refactor: rename wiki_extractor.py to extractor.py
reudekx Jan 8, 2025
531026b
chore: add pyarrow package
reudekx Jan 8, 2025
a925fc5
docs: fix typo in comment
reudekx Jan 8, 2025
460def3
feat: add etl processor using imf api
reudekx Jan 8, 2025
5aa4fac
docs: add 2025-01-08.md
reudekx Jan 8, 2025
5e73178
docs: update 2025-01-08.md
reudekx Jan 8, 2025
f550578
docs: add python multiprocessing learning content
reudekx Jan 9, 2025
230fccd
feat: update codes using imf api
reudekx Jan 9, 2025
135375f
clean before migration
reudekx Jan 9, 2025
55109aa
Merge branch 'main' of ../softeer/orig.bundle
reudekx Jan 9, 2025
156f727
feat: add interactive command-line interface for ETL process
reudekx Jan 9, 2025
32b4de9
feat: change console logging level to 'debug'
reudekx Jan 9, 2025
e3465f0
chore: update log and database
reudekx Jan 9, 2025
cf9a1a3
refactor: update code to reflect database column name changes
reudekx Jan 9, 2025
82c62fd
chore: update log and database
reudekx Jan 9, 2025
5d6849b
docs: add 2025-01-09.md
reudekx Jan 9, 2025
d66ca6f
chore: update log
reudekx Jan 9, 2025
6342dc8
feat: update etl processing code using api
reudekx Jan 9, 2025
f989758
docs: update README.md
reudekx Jan 9, 2025
9ef291a
docs: remove slide files
reudekx Jan 10, 2025
bd9ad0b
docs: fix typo in file name
reudekx Jan 10, 2025
d82432e
docs: save unsaved content
reudekx Jan 10, 2025
95a6b39
feat: update to use regex pattern object
reudekx Jan 10, 2025
ca19b17
feat: improve logics
reudekx Jan 10, 2025
dad7f30
style: apply black and isort
reudekx Jan 10, 2025
15f55a2
chore: add pre-commit file for auto formatting
reudekx Jan 10, 2025
d86b037
chore: update log and data
reudekx Jan 10, 2025
60bd6e0
feat: update etl processing code using api
reudekx Jan 10, 2025
e68e255
feat: update self-study contents
reudekx Jan 10, 2025
9307fff
docs: remove unnecessary comment
reudekx Jan 10, 2025
f5c9cd7
docs: add 2025-01-10.md
reudekx Jan 10, 2025
7381eca
feat: add W2' M1, M2, M3 and M4
reudekx Jan 13, 2025
b2d3963
fix: fix wrong import
reudekx Jan 13, 2025
842a5d4
feat: display result
reudekx Jan 13, 2025
fd1eb9f
docs: remove unnecessary blank line
reudekx Jan 13, 2025
cc3eb89
docs: add 2025-01-13.md
reudekx Jan 13, 2025
399c17d
docs: update 2025-01-13.md
reudekx Jan 13, 2025
e6400f3
refactor: rename to multiprocessing_all_in_one.py
reudekx Jan 14, 2025
1166ef6
merge: resolve conflict
reudekx Jan 14, 2025
2ab1a79
fix: move to correct path
reudekx Jan 14, 2025
5572710
chore: add wordcloud package
reudekx Jan 14, 2025
7e43204
chore: add temp directory at .gitignore
reudekx Jan 14, 2025
03b0108
feat: add W2/M5
reudekx Jan 14, 2025
efd2733
feat: modify to show intermediate output results
reudekx Jan 14, 2025
64aa480
feat: add packages for web scraping
reudekx Jan 14, 2025
94ad390
chore: add some directories at .gitignore
reudekx Jan 14, 2025
6260f8d
feat: add codes for youtube comment scraping and using this
reudekx Jan 14, 2025
6a00c2e
docs: add 2025-01-14.md
reudekx Jan 14, 2025
7922a3c
feat: add Dockerfile and package list for image building
reudekx Jan 15, 2025
82a2c23
docs: add 2025-01-15.md
reudekx Jan 15, 2025
15de5d2
docs: update 2025-01-15.md
reudekx Jan 15, 2025
8e4ff89
chore: add packages to be installed by Dockerfile
reudekx Jan 16, 2025
42e1b8f
chore: add .DS_Store to .gitignore
reudekx Jan 16, 2025
bd068ec
fix: add missing commands
reudekx Jan 16, 2025
666ed5f
feat: add files for build docker image
reudekx Jan 16, 2025
09c1e1a
chore: add data, log at .gitignore
reudekx Jan 16, 2025
27e6a11
chore: remove data files and logs
reudekx Jan 16, 2025
c751630
feat: modify to improve code readability
reudekx Jan 16, 2025
77bc75c
fix: fix wrong import
reudekx Jan 16, 2025
6669fc8
docs: fix typo in command example
reudekx Jan 16, 2025
d678c46
feat: update Dockerfile
reudekx Jan 16, 2025
bbbeef5
fix: fix to ignore correctly
reudekx Jan 16, 2025
0820880
docs: add 2025-01-16.md
reudekx Jan 16, 2025
b738a90
feat: update W2/M5
reudekx Jan 16, 2025
2e12c72
chore: add pytest package
reudekx Jan 17, 2025
03cf6b9
feat: add codes for W2/M5
reudekx Jan 17, 2025
65ec70f
refactor: move old files
reudekx Jan 17, 2025
b6e63c7
chore: add selenium & openpyxl package
reudekx Jan 17, 2025
94a89a9
feat: add codes for scraping news article links
reudekx Jan 17, 2025
5da0620
docs: add 2025-01-17.md
reudekx Jan 17, 2025
2cc1754
docs: fix typo in date
reudekx Jan 17, 2025
d547d23
refactor: move directory
reudekx Jan 17, 2025
2ab6190
fix: fix wrong variable name
reudekx Jan 17, 2025
c1c4238
docs: add w2m5 wiki
reudekx Jan 20, 2025
15ad157
feat: add temporary setup files
reudekx Jan 20, 2025
e364898
docs: add comments
reudekx Jan 20, 2025
c714ad4
docs: add 2025-01-20.md
reudekx Jan 20, 2025
e913f7b
docs: update 2025-01-20.md
reudekx Jan 20, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions .githooks/pre-commit
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/bin/bash

# 스테이징된 Python 파일들만 가져오기
files=$(git diff --cached --name-only --diff-filter=d | grep '\.py$')

if [ -z "$files" ]; then
echo "No Python files to check"
exit 0
fi

echo "Starting code checks..."

# 각 명령어의 실행 결과를 체크
if ! poetry run isort $files; then
echo "isort failed"
exit 1
fi

if ! poetry run black $files; then
echo "black failed"
exit 1
fi

# 수정된 파일들 다시 스테이징
git add $files
echo "Code checks completed successfully"

exit 0
167 changes: 9 additions & 158 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,160 +1,11 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/#use-with-ide
.pdm.toml

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.git
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/
__pycache__
.ipynb_checkpoints

.DS_Store

# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
temp
private
data
log
1 change: 1 addition & 0 deletions .python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.12
90 changes: 90 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Softeer Data Engineering WIKI

## 개요

Softeer Data Engineering 코스를 수강하며 수행한 과제와
학습 내용을 정리한 저장소입니다.

## 디렉터리 구조

```
softeer
├─ .python-version
├─ README.md
├─ missions
│ └─ W1
│ ├─ M1
│ │ ├─ column_info.md
│ │ ├─ data_exploration.ipynb
│ │ └─ mtcars.csv
│ ├─ M2
│ │ ├─ learning_sql.ipynb
│ │ └─ test.db
│ └─ M3
│ ├─ data
│ │ ├─ Countries_by_GDP.json
│ │ ├─ Countries_by_GDP_etl_processed.json
│ │ ├─ World_Economies.db
│ │ ├─ api
│ │ │ └─ extracted
│ │ │ ├─ gdp_2019.parquet
│ │ │ ├─ gdp_2020.parquet
│ │ │ ├─ gdp_2021.parquet
│ │ │ ├─ gdp_2022.parquet
│ │ │ ├─ gdp_2023.parquet
│ │ │ └─ gdp_2024.parquet
│ │ └─ country_regions.json
│ ├─ etl_project_gdp.py
│ ├─ etl_project_gdp_using_api.py
│ ├─ etl_project_gdp_visualization.ipynb
│ ├─ etl_project_gdp_with_sql.py
│ ├─ etl_project_gdp_with_sql_visualization.ipynb
│ ├─ log
│ │ ├─ etl_project_log.txt
│ │ ├─ etl_project_log_etl_processed.txt
│ │ └─ etl_project_log_using_api.txt
│ ├─ processor
│ │ ├─ __init__.py
│ │ ├─ api
│ │ │ ├─ __init__.py
│ │ │ ├─ extractor.py
│ │ │ └─ transformer.py
│ │ ├─ extractor.py
│ │ ├─ io_handler.py
│ │ ├─ json_loader.py
│ │ ├─ sqlite_loader.py
│ │ └─ transformer.py
│ └─ utils
│ ├─ __init__.py
│ └─ logging.py
├─ poetry.lock
├─ pyproject.toml
├─ retrospect
│ ├─ 2025-01-02.md
│ ├─ 2025-01-03.md
│ ├─ 2025-01-06.md
│ ├─ 2025-01-08.md
│ ├─ 2025-01-09.md
│ └─ 2525-01-07.md
├─ self_study
│ ├─ README.md
│ ├─ learning_pandas
│ │ ├─ data.csv
│ │ └─ pandas.ipynb
│ └─ parallellism
│ ├─ files
│ │ ├─ pool_test.py
│ │ └─ queue_test.py
│ ├─ testing_does_pandas_use_multicore.ipynb
│ └─ testing_multiprocessing.ipynb
└─ slides
├─ W1 Introduction to Data Engineering.pdf
├─ W2 Introduction to Big Data.pdf
├─ W3 Introduction to Apache Hadoop.pdf
├─ W4 Introduction to Apache Spark.pdf
├─ W5 How Spark Works Internally - RDD and DAG.pdf
├─ W6 Optimizing Spark Job.pdf
├─ W7 Monitoring and Optimizing Spark Job.pdf
└─ W8 Adaptive Query Execution in Spark.pdf

```
16 changes: 16 additions & 0 deletions missions/W1/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
FROM quay.io/jupyter/base-notebook:latest

# 작업 디렉터리 설정
WORKDIR /app

# Python 패키지 목록 복사
COPY requirements.txt .

# Python 패키지 설치
RUN pip install -r requirements.txt

# 프로젝트 파일 복사
COPY . .

# juptyer notebook 실행
CMD ["jupyter", "notebook", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root"]
19 changes: 19 additions & 0 deletions missions/W1/M1/column_info.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# 데이터셋 이해

* **mpg (연비)**: 1갤런의 연료로 주행할 수 있는 거리를 마일 단위로 나타낸 값. 높을수록 연비가 좋음.
* **cyl (실린더 수)**: 엔진의 실린더 개수. 보통 4기통, 6기통, 8기통 등이 있음.
* 4 ~ 8 (4, 6, 8)
* **disp (배기량)**: 모든 실린더의 총 부피를 입방 인치(cubic inch) 단위로 나타낸 값.
* **hp (마력)**: 엔진이 낼 수 있는 최대 출력.
* **drat (후륜 차축비)**: 엔진 동력을 바퀴로 전달할 때의 기어비.
* **wt (중량)**: 차량의 무게를 1000파운드 단위로 나타낸 값.
* **qsec (1/4마일 주행시간)**: 1/4마일(약 400m)을 주행하는데 걸리는 시간으로, 자동차의 가속 성능을 나타냄.
* **vs (엔진 형태)**: V형 엔진인지 직렬 엔진인지를 나타냄.
* 0 or 1
* **am (변속기 종류)**: 자동변속기(0)인지 수동변속기(1)인지를 나타냄.
* Transmission
* 0 or 1
* **gear (전진 기어 수)**: 전진 기어의 단수.
* 3 ~ 5
* **carb (기화기 수)**: 엔진에 장착된 기화기의 개수. 기화기는 연료와 공기를 혼합하는 장치.
* 1 ~ 8
Loading