Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[공도한] W2 미션 (W1은 closed 해버려서 이것 기준으로 확인해주시면 감사하겠습니다..) #35

Open
wants to merge 196 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
196 commits
Select commit Hold shift + click to select a range
e9a2862
Initial commit
reudekx Jan 2, 2025
794c2dd
Update README.md
reudekx Jan 2, 2025
fc5f8f7
docs: add sample.md
Jan 2, 2025
1fde3fd
chore: add .gitignore
reudekx Jan 2, 2025
a162e66
chore: update .gitignore
reudekx Jan 2, 2025
c087ffb
chore: add .python-version
reudekx Jan 2, 2025
c5a4e07
chore: add pyproject.toml & poetry.lock
reudekx Jan 2, 2025
dc09607
docs: 회고 템플릿 추가
reudekx Jan 2, 2025
8f3d85e
docs: update README.md
reudekx Jan 2, 2025
274ba6a
refactor: move 0주차 회고
reudekx Jan 2, 2025
397abd0
chore: black & isort 패키지 추가
reudekx Jan 2, 2025
cab9aa4
docs: update 2025-01-02.md
reudekx Jan 2, 2025
ec8a388
docs: update 2025-01-02.md
reudekx Jan 2, 2025
202d77b
chore: add .ipynb_checkpoints at .gitignore
reudekx Jan 3, 2025
abd14fa
chore: add pandas and matplotlib package
reudekx Jan 3, 2025
dc9c76b
feat: add W1/M1
reudekx Jan 3, 2025
11d1ebe
feat: save studyings
reudekx Jan 3, 2025
d460053
docs: add 2025-01-03.md
reudekx Jan 3, 2025
00a97e1
docs: update 2025-01-03.md
reudekx Jan 3, 2025
57c7428
refactor: rename directory names to english
reudekx Jan 3, 2025
8b7d657
chore: add package-mode = false at tool.poetry section
reudekx Jan 3, 2025
8b43ba5
feat: update team activity requirement 1
reudekx Jan 3, 2025
e9076ac
refactor: rename to column_info.md
reudekx Jan 3, 2025
622059f
docs: update README.md
reudekx Jan 3, 2025
a8c9494
docs: update README.md
reudekx Jan 3, 2025
566aaca
feat: improve contents
reudekx Jan 5, 2025
fd409bb
feat: update contents
reudekx Jan 5, 2025
f2c961b
chore: update black package to black[jupyter]
reudekx Jan 6, 2025
ee17799
feat: add W1/M2
reudekx Jan 6, 2025
4421420
feat: update contents
reudekx Jan 6, 2025
73e80aa
docs: add 2025-01-06.md
reudekx Jan 6, 2025
99619de
chore: add beautifulsoup4 package
reudekx Jan 7, 2025
b4e1725
fix: fix typo
reudekx Jan 7, 2025
fb3341e
docs: update 2025-01-06.md
reudekx Jan 7, 2025
1874238
docs: add 2025-01-07.md
reudekx Jan 7, 2025
0d010c8
feat: add W1/M3
reudekx Jan 7, 2025
750bd16
docs: update 2025-01-07.md
reudekx Jan 7, 2025
db1fe45
docs: update 2025-01-07.md
reudekx Jan 7, 2025
6abf980
feat: add testing_parallellism.ipynb
reudekx Jan 7, 2025
73955f3
feat: update testing_parallellism.ipynb
reudekx Jan 7, 2025
4df7c9f
chore: remove .ipynb from black.exclude and isort.skip
reudekx Jan 8, 2025
bd4a9ac
feat: complete basic requirements
reudekx Jan 8, 2025
c3976d2
feat: update for additional requirements
reudekx Jan 8, 2025
1a90339
chore: update log and database
reudekx Jan 8, 2025
7a2a147
docs: update README.md
reudekx Jan 8, 2025
93e98b1
docs: update README.md
reudekx Jan 8, 2025
35a7302
docs: update README.md
reudekx Jan 8, 2025
cc603b2
feat: add code for error handling
reudekx Jan 8, 2025
8a0d291
chore: update log and database
reudekx Jan 8, 2025
e37a235
chore: update log message content
reudekx Jan 8, 2025
72db277
chore: update log and database
reudekx Jan 8, 2025
b561ef1
refactor: rename extractor.py to wiki_extractor.py
reudekx Jan 8, 2025
e672a11
refactor: rename wiki_extractor.py to extractor.py
reudekx Jan 8, 2025
d85eca9
refactor: rename wiki_extractor.py to extractor.py
reudekx Jan 8, 2025
531026b
chore: add pyarrow package
reudekx Jan 8, 2025
a925fc5
docs: fix typo in comment
reudekx Jan 8, 2025
460def3
feat: add etl processor using imf api
reudekx Jan 8, 2025
5aa4fac
docs: add 2025-01-08.md
reudekx Jan 8, 2025
5e73178
docs: update 2025-01-08.md
reudekx Jan 8, 2025
f550578
docs: add python multiprocessing learning content
reudekx Jan 9, 2025
230fccd
feat: update codes using imf api
reudekx Jan 9, 2025
135375f
clean before migration
reudekx Jan 9, 2025
55109aa
Merge branch 'main' of ../softeer/orig.bundle
reudekx Jan 9, 2025
156f727
feat: add interactive command-line interface for ETL process
reudekx Jan 9, 2025
32b4de9
feat: change console logging level to 'debug'
reudekx Jan 9, 2025
e3465f0
chore: update log and database
reudekx Jan 9, 2025
cf9a1a3
refactor: update code to reflect database column name changes
reudekx Jan 9, 2025
82c62fd
chore: update log and database
reudekx Jan 9, 2025
5d6849b
docs: add 2025-01-09.md
reudekx Jan 9, 2025
d66ca6f
chore: update log
reudekx Jan 9, 2025
6342dc8
feat: update etl processing code using api
reudekx Jan 9, 2025
f989758
docs: update README.md
reudekx Jan 9, 2025
9ef291a
docs: remove slide files
reudekx Jan 10, 2025
bd9ad0b
docs: fix typo in file name
reudekx Jan 10, 2025
d82432e
docs: save unsaved content
reudekx Jan 10, 2025
95a6b39
feat: update to use regex pattern object
reudekx Jan 10, 2025
ca19b17
feat: improve logics
reudekx Jan 10, 2025
dad7f30
style: apply black and isort
reudekx Jan 10, 2025
15f55a2
chore: add pre-commit file for auto formatting
reudekx Jan 10, 2025
d86b037
chore: update log and data
reudekx Jan 10, 2025
60bd6e0
feat: update etl processing code using api
reudekx Jan 10, 2025
e68e255
feat: update self-study contents
reudekx Jan 10, 2025
9307fff
docs: remove unnecessary comment
reudekx Jan 10, 2025
f5c9cd7
docs: add 2025-01-10.md
reudekx Jan 10, 2025
7381eca
feat: add W2' M1, M2, M3 and M4
reudekx Jan 13, 2025
b2d3963
fix: fix wrong import
reudekx Jan 13, 2025
842a5d4
feat: display result
reudekx Jan 13, 2025
fd1eb9f
docs: remove unnecessary blank line
reudekx Jan 13, 2025
cc3eb89
docs: add 2025-01-13.md
reudekx Jan 13, 2025
399c17d
docs: update 2025-01-13.md
reudekx Jan 13, 2025
e6400f3
refactor: rename to multiprocessing_all_in_one.py
reudekx Jan 14, 2025
1166ef6
merge: resolve conflict
reudekx Jan 14, 2025
2ab1a79
fix: move to correct path
reudekx Jan 14, 2025
5572710
chore: add wordcloud package
reudekx Jan 14, 2025
7e43204
chore: add temp directory at .gitignore
reudekx Jan 14, 2025
03b0108
feat: add W2/M5
reudekx Jan 14, 2025
efd2733
feat: modify to show intermediate output results
reudekx Jan 14, 2025
64aa480
feat: add packages for web scraping
reudekx Jan 14, 2025
94ad390
chore: add some directories at .gitignore
reudekx Jan 14, 2025
6260f8d
feat: add codes for youtube comment scraping and using this
reudekx Jan 14, 2025
6a00c2e
docs: add 2025-01-14.md
reudekx Jan 14, 2025
7922a3c
feat: add Dockerfile and package list for image building
reudekx Jan 15, 2025
82a2c23
docs: add 2025-01-15.md
reudekx Jan 15, 2025
15de5d2
docs: update 2025-01-15.md
reudekx Jan 15, 2025
8e4ff89
chore: add packages to be installed by Dockerfile
reudekx Jan 16, 2025
42e1b8f
chore: add .DS_Store to .gitignore
reudekx Jan 16, 2025
bd068ec
fix: add missing commands
reudekx Jan 16, 2025
666ed5f
feat: add files for build docker image
reudekx Jan 16, 2025
09c1e1a
chore: add data, log at .gitignore
reudekx Jan 16, 2025
27e6a11
chore: remove data files and logs
reudekx Jan 16, 2025
c751630
feat: modify to improve code readability
reudekx Jan 16, 2025
77bc75c
fix: fix wrong import
reudekx Jan 16, 2025
6669fc8
docs: fix typo in command example
reudekx Jan 16, 2025
d678c46
feat: update Dockerfile
reudekx Jan 16, 2025
bbbeef5
fix: fix to ignore correctly
reudekx Jan 16, 2025
0820880
docs: add 2025-01-16.md
reudekx Jan 16, 2025
b738a90
feat: update W2/M5
reudekx Jan 16, 2025
2e12c72
chore: add pytest package
reudekx Jan 17, 2025
03cf6b9
feat: add codes for W2/M5
reudekx Jan 17, 2025
65ec70f
refactor: move old files
reudekx Jan 17, 2025
b6e63c7
chore: add selenium & openpyxl package
reudekx Jan 17, 2025
94a89a9
feat: add codes for scraping news article links
reudekx Jan 17, 2025
5da0620
docs: add 2025-01-17.md
reudekx Jan 17, 2025
2cc1754
docs: fix typo in date
reudekx Jan 17, 2025
d547d23
refactor: move directory
reudekx Jan 17, 2025
2ab6190
fix: fix wrong variable name
reudekx Jan 17, 2025
c1c4238
docs: add w2m5 wiki
reudekx Jan 20, 2025
15ad157
feat: add temporary setup files
reudekx Jan 20, 2025
e364898
docs: add comments
reudekx Jan 20, 2025
c714ad4
docs: add 2025-01-20.md
reudekx Jan 20, 2025
e913f7b
docs: update 2025-01-20.md
reudekx Jan 20, 2025
306ed08
docs: add studying_hadoop.md
reudekx Jan 21, 2025
2988778
docs: update studying_hadoop.md
reudekx Jan 21, 2025
155b863
docs: update studying_hadoop.md
reudekx Jan 21, 2025
cb8fff3
docs: update studying_hadoop.md
reudekx Jan 21, 2025
6eb683a
feat: add W3/M2
reudekx Jan 21, 2025
b43a37f
docs: add 2025-01-21.md
reudekx Jan 21, 2025
5077e6c
feat: update W3/M2
reudekx Jan 21, 2025
f0b89d4
chore: add mrjob package
reudekx Jan 22, 2025
bdc16f8
feat: add hadoop-setup files
reudekx Jan 22, 2025
41cd375
feat: add W3/M3
reudekx Jan 22, 2025
da53005
docs: update studying_hadoop.md
reudekx Jan 22, 2025
291bdeb
feat: remove unnecessary files
reudekx Jan 22, 2025
27e5c65
docs: add 2025-01-22.md
reudekx Jan 22, 2025
90d5e3c
docs: add command for print
reudekx Jan 22, 2025
2ca39cd
feat: update hadoop setup files
reudekx Jan 23, 2025
1b92cdc
docs: update README.md
reudekx Jan 23, 2025
a29fa54
feat: update hadoop setup files
reudekx Jan 23, 2025
f49ccc2
docs: add README.md
reudekx Jan 23, 2025
6caee02
docs: add README.md
reudekx Jan 23, 2025
834e47d
docs: update README.md
reudekx Jan 23, 2025
f31ae3a
docs: update cmd_example.md
reudekx Jan 23, 2025
4fb44ec
docs: modify comment
reudekx Jan 23, 2025
4643083
feat: modify to install python3-pip package
reudekx Jan 23, 2025
92c5212
docs: add 2025-01-23.md
reudekx Jan 23, 2025
058f338
feat: add openai, soynlp, networkx, nltk package
reudekx Jan 25, 2025
1530841
feat: prototype code for mini project
reudekx Jan 25, 2025
656c3e6
docs: add 2025-01-24.md
reudekx Jan 25, 2025
74002d7
docs: fix typo in 2025-01-24.md
reudekx Jan 25, 2025
caac0a9
Update 2025-01-24.md
reudekx Jan 25, 2025
d3c0876
chore: add pyspark package
reudekx Jan 29, 2025
fa4f3bc
feat: add W4M1
reudekx Jan 29, 2025
636684e
docs: add 2025-01-29.md
reudekx Jan 29, 2025
a3659e0
docs: update 2025-01-29.md
reudekx Jan 29, 2025
ec683cb
feat: add W4M2
reudekx Jan 30, 2025
4719283
docs: temporarily save file in progress
reudekx Jan 30, 2025
33e65c4
feat: add sample code
reudekx Jan 30, 2025
effca16
docs: add 2025-01-30.md
reudekx Jan 30, 2025
992d84e
docs: update 2025-01-30.md
reudekx Jan 30, 2025
40cb7d5
docs: add studying_spark.md
reudekx Feb 2, 2025
7f47bd7
feat: temporarily save files in progress
reudekx Feb 2, 2025
5c74451
feat: change memory usage
reudekx Feb 2, 2025
9b16d33
feat: update spark setup files
reudekx Feb 3, 2025
5e25b2d
feat: add W5/M1
reudekx Feb 3, 2025
ab36ac6
docs: add 2025-02-03.md
reudekx Feb 3, 2025
b23c142
feat: update W5/M1
reudekx Feb 3, 2025
fdd4791
feat: update spark app codes
reudekx Feb 3, 2025
9595fa1
chore: add some packages
reudekx Feb 4, 2025
8a33619
feat: update spark missions
reudekx Feb 4, 2025
35f01d7
docs: add 2025-02-04.md
reudekx Feb 4, 2025
6ee9732
docs: update 2025-02-04.md
reudekx Feb 4, 2025
2cb03dc
feat: update spark missions
reudekx Feb 5, 2025
b3e9359
docs: add 2025-02-05.md
reudekx Feb 5, 2025
4f85010
docs: update 2025-02-04.md
reudekx Feb 6, 2025
78b14d5
docs: 2025-02-06.md
reudekx Feb 6, 2025
2302c32
docs: 2025-02-07.md
reudekx Feb 7, 2025
961ad46
docs: add 2025-02-10.md
reudekx Feb 10, 2025
2abbd28
docs: update 2025-02-10.md
reudekx Feb 10, 2025
cfd7189
docs: add 2025-02-11.md
reudekx Feb 12, 2025
e9e4fb4
feat: add ipynb file for studying spark optimization
reudekx Feb 12, 2025
113c870
docs: add 2025-02-12.md
reudekx Feb 12, 2025
5add9d1
docs: add 2025-02-13.md (in progress)
reudekx Feb 13, 2025
212269d
feat: update 2025-02-13.md
reudekx Feb 13, 2025
48c57ed
docs: add 2025-02-14.md
reudekx Feb 15, 2025
fc64db5
docs: update 2025-02-14.md
reudekx Feb 15, 2025
9df062e
docs: add 2025-02-18.md
reudekx Feb 18, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
docs: update studying_hadoop.md
reudekx committed Jan 21, 2025
commit 298877816f700c2640fe89b6d53465b473bfd0db
14 changes: 5 additions & 9 deletions wiki/studying_hadoop.md
Original file line number Diff line number Diff line change
@@ -36,12 +36,6 @@ HDFS > YARN > MapReduce

MapReduce는 Hadoop을 최종적으로 사용할 때 수행되는 연산일테니, 결국 Hadoop 클러스터의 컨트롤 센터는 YARN이다.

찾아보니까 Kerberos라는 게 있다. 서비스 계층에서의 계정 및 권한을 관리하는 건가?
네트워크 계층에서의 보안은 TLS, SSH 등으로 달성되어야 할 것 같기도 한데, 이걸 Hadoop 관리자가 각 노드 별로 키를 교환하고.. 이런 걸 할 것 같진 않음.
-> 결국 YARN이 제어하는 건가?

위 질문을 AI에게 하니 그렇다고 한다.

> 흐름을 생각해보자.
```
1. 최초에 hadoop을 설치한다.
@@ -51,7 +45,7 @@ MapReduce는 Hadoop을 최종적으로 사용할 때 수행되는 연산일테
* 컨테이너 초기 생성 시 각 컨테이너 간 ssh 키를 생성한 뒤 교환하는 과정이 필요하다. (스크립트로 자동화..)
* 당연히 아무 서버나 접속하고 그럴 순 없으니까 필요함.
* Host OS에서도 접속을 해야하니까 (최소한 ResourceManager 컨테이너?) 마찬가지로 키 교환 해준다.
5. 또한 각 노드(도커 환경을 사용한다면 도커 컨테이너, 그렇지 않다면 하나의 OS가 설치된 물리 서버)에 HDFS 유저를 생성해야 한다.
4. 또한 각 노드(도커 환경을 사용한다면 도커 컨테이너, 그렇지 않다면 하나의 OS가 설치된 물리 서버)에 HDFS 유저를 생성해야 한다.
* Dockerfile을 작성할 때 hadoop 데몬을 실행하는 superuser를 hadoop으로 지정한다.
* root를 그대로 쓰지 않는 이유는.. hadoop 외의 구성 요소에 대한 권한까지 가지고 있기 때문에
* 그리고 다들 hadoop이라는 이름으로 superuser를 생성하는 게 관례인 듯.
@@ -71,5 +65,7 @@ MapReduce는 Hadoop을 최종적으로 사용할 때 수행되는 연산일테
RUN useradd -m -s /bin/bash hadoop && \
echo "hadoop ALL=(ALL) NOPASSWD: /usr/sbin/useradd, /usr/sbin/usermod" >> /etc/sudoers
```
6. 이렇게 세팅이 완료되었으면, 어플리케이션 코드를 작성하여 hadoop에게 Map/Reduce 연산 수행을 적절히 요청하면 된다.
```
5. 이렇게 세팅이 완료되었으면, 어플리케이션 코드를 작성하여 hadoop에게 Map/Reduce 연산 수행을 적절히 요청하면 된다.
```

Kerberos 라는 것으로 YARN layer (아마도..?)에서의 계정 관리를 할 수 있다는데 이것도 찾아보자.