normalizer

unicode?

  • μˆ«μžμ™€ κΈ€μžκ°€ 1(key):1(value)둜 λ§€ν•‘λœ ν‘œ

  • μ•„μŠ€ν‚€μ½”λ“œμ˜ 0x41 = A둜 λ§΅ν•‘λœ 것 처럼, μ•„μŠ€ν‚€ μ½”λ“œλ‘œ ν‘œν˜„ν•  수 μ—†λŠ” λ¬Έμžλ“€μ„ 숫자둜 맡핑해놓은 것 (전세계 문자 λŒ€μƒ)

  • ν•œκΈ€μ˜ 경우 μ‘°ν•©ν˜•μ„ μœ„ν•œ 자λͺ¨, μ™„μ„±ν˜•μ„ μœ„ν•œ ν•œκΈ€μ΄ λͺ¨λ‘ ν¬ν•¨λ˜μ–΄μžˆμŒ.

  • μˆ«μžμ— U+ 접두어가 λΆ™μ–΄μžˆμœΌλ©΄ μœ λ‹ˆμ½”λ“œλ‘œ λ§΅ν•‘λ˜μ–΄μžˆλŠ” μˆ«μžλΌλŠ” μ˜λ―ΈμΈλ“―.

    • U+0041 = A

UTF-8, UTF-16 ?

  • 숫자둜 이루어진 ν‚€λ₯Ό μ–΄λ–»κ²Œ ν‘œν˜„ν•˜λŠ” μ§€ κ²°μ •ν•˜λŠ” 인코딩 방식

  • A λΌλŠ” λ¬ΈμžλŠ”..

    • UTF-8 μ—μ„œλŠ” 0x41둜 ν‘œν˜„λ¨

      • κ°€λ³€λ°”μ΄νŠΈ 1~4 byte μ‚¬μš©ν•˜κ³  μžˆλŠ”λ°,A의 경우 1λ°”μ΄νŠΈλ‘œ ν‘œν˜„μ΄ κ°€λŠ₯ν•˜κΈ° 떄문에)

      • 즉 U+0041 -> 0x41 둜 인코딩 λ˜λŠ” μ˜λ―ΈμΈλ“―

    • UTF-16 의 경우, 0x0041 둜 ν‘œν˜„λ¨.

      • λ§ˆμ°¬κ°€μ§€λ‘œ κ°€λ³€λ°”μ΄νŠΈ 2~4 byte

μ°Έκ³ 

  • https://norux.me/31

  • https://namu.wiki/w/UTF-8

normalizer?

  • ν‘œν˜„ 방법이 λ‹€λ₯Έ λ¬Έμžλ“€μ„ 톡합 ν˜Ήμ€ 정리λ₯Ό μ˜λ―Έν•˜λŠ”λ“―.

  • 즉 이λͺ¨μ§€, νŠΉμˆ˜λ¬Έμžλ“±μ„ 제거 ν˜Ήμ€ 곡백으둜 λŒ€μ²˜ν•˜λŠ” μž‘μ—…μ„ λ§ν•˜λŠ” λ“―.

  • μ™œ ν•„μš”ν•¨?

    • OS λ³„λ‘œ μ‚¬μš©ν•˜λŠ” unicode μ •κ·œν™” 방식이 λ‹€λ₯Έ κ²½μš°κ°€ μžˆλŠ”λ“―. (i.g max osμ—μ„œ λ§Œλ“  ν•œκΈ€ νŒŒμΌμ„, μœˆλ„μš°μ— μ „μ†‘ν•˜λ©΄ 자λͺ¨κ°€ λΆ„λ¦¬λ˜μ–΄μžˆλŠ” 경우)

    • κ·Έλž˜μ„œ 이 뢀뢄은 ν•˜λ‚˜λ‘œ ν†΅ν•©ν•΄μ£ΌλŠ” 게 ν•„μš”ν•¨.

  • 4κ°€μ§€ μœ λ‹ˆμ½”λ“œ ν…μŠ€νŠΈλ₯Ό μ •κ·œν•˜λŠ” 방식이 있음.

    1. NFC: Normalization Form Canonical Composition

      • μ½”λ“œλ₯Ό μ •μ€€ λΆ„ν•΄ -> μ •μ€€ κ²°ν•©

    2. NFD: Normalization Form Canonical Decomposition

      • μ½”λ“œλ₯Ό μ •μ€€ λΆ„ν•΄

      • 즉 발음 ꡬ별 κΈ°ν˜Έκ°€ 뢙은 λ¬Έμžκ°€ ν•˜λ‚˜λ‘œ μ²˜λ¦¬λ˜μ—ˆμ„ κ²½μš°μ— λ‚˜λˆ„μ–΄ 처리(μ •κ·œν™”) 함.

    3. NFKC: Normalization Form Compatibility Composition

    4. NFKD: Normalization Form Compatibility Decomposition

  • μœ„ μ²¨λΆ€λœ μŠ€ν¬λ¦°μƒ·μ„ 보면 o λ¬Έμžμœ„μ— λΆ€ν˜Έκ°€ 있음.

  • μ΄λŸ¬ν•œ λ¬Έμžκ°€ 4κ°€μ§€ 방식에 따라 μ–΄λ–»κ²Œ μ •κ·œν™” λ˜λŠ”μ§€ λ³΄μ—¬μ€Œ

  • μžλ°”μ—μ„œλŠ” java.text.Normalizer ν΄λž˜μŠ€μ—μ„œ ν•΄λ‹Ή κΈ°λŠ₯을 μ œκ³΅ν•˜κ³  있음.

    • This class provides the method normalize which transforms Unicode text into an equivalent composed or decomposed form

μ°Έκ³ 

  • https://docs.oracle.com/javase/7/docs/api/java/text/Normalizer.html#:~:text=This%20class%20provides%20the%20method,%2315%20%E2%80%94%20Unicode%20Normalization%20Forms.

  • https://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html

  • https://velog.io/@leejh3224/%EB%B2%88%EC%97%AD-%EC%9C%A0%EB%8B%88%EC%BD%94%EB%93%9C-%EC%8A%A4%ED%8A%B8%EB%A7%81%EC%9D%84-%EB%85%B8%EB%A9%80%EB%9D%BC%EC%9D%B4%EC%A7%95-%ED%95%B4%EC%95%BC%ED%95%98%EB%8A%94-%EC%9D%B4%EC%9C%A0

  • https://www.hungrydiver.co.kr/bbs/detail/develop?id=68&scroll=comment

Last updated

Was this helpful?