Skip to content

Migrate languages.py to spec + code generation #186

@rtibbles

Description

@rtibbles

This issue is not open for contribution. Visit Contributing guidelines to learn about the contributing process and how to find suitable issues.

Overview

Migrate le_utils/constants/languages.py from the legacy JSON-as-data approach to the modern spec + code generation system. This is the most complex module with 1,141 lines of language data, custom namedtuple properties, and multiple helper functions.

Context

Currently, le_utils/constants/languages.py uses the legacy approach:

  • Loads resources/languagelookup.json (22,820 bytes, 1,141 lines!)
  • Custom Language namedtuple with code, id, and first_native_name properties
  • Multiple helper functions: getlang(), getlang_by_name(), getlang_by_native_name(), getlang_by_alpha2()
  • RTL language list: RTL_LANG_CODES
  • No JavaScript export available

Current Structure

File: le_utils/resources/languagelookup.json (1,141 language entries)

{
  "aa": {
    "name": "Afar",
    "native_name": "Afaraf"
  },
  "en": {
    "name": "English",
    "native_name": "English"
  },
  "es-MX": {
    "name": "Spanish (Mexico)",
    "native_name": "Español (México)"
  },
  ...
}

Python module has:

  • Custom Language namedtuple with properties:
    • code property: combines primary_code and subcode (e.g., "en-US")
    • id property: alias for code
    • first_native_name property: first name from comma-separated list
  • Helper functions for lookups by various criteria
  • RTL language codes list

Target Spec Format

Create spec/constants-languages.json with all language data:

{
  "namedtuple": {
    "name": "Language",
    "fields": ["native_name", "primary_code", "subcode", "name"],
    "properties": {
      "code": "return '{}-{}'.format(self.primary_code, self.subcode) if self.subcode else self.primary_code",
      "id": "return self.code",
      "first_native_name": "return self.native_name.split(',')[0]"
    }
  },
  "rtl_codes": ["ar", "arq", "dv", "he", "fa", "ps", "ur", "yi"],
  "constants": {
    "aa": {
      "name": "Afar",
      "native_name": "Afaraf"
    },
    "en": {
      "name": "English",
      "native_name": "English"
    },
    "es-MX": {
      "name": "Spanish (Mexico)",
      "native_name": "Español (México)"
    }
  }
}

Copy all 1,141 entries from languagelookup.json. The generation script will parse language codes (e.g., "es-MX") into primary_code="es" and subcode="MX".

Note: The properties metadata tells the generation script to add @property methods to the namedtuple class.

Generation Script Enhancement

Update scripts/generate_from_specs.py to handle:

  1. Namedtuple properties from properties metadata
  2. RTL codes list from rtl_codes metadata
  3. Helper functions for language lookups:
    • getlang(code) - lookup by code
    • getlang_by_name(name) - case-insensitive lookup by English name
    • getlang_by_native_name(native_name) - case-insensitive lookup
    • getlang_by_alpha2(alpha2) - lookup by 2-letter code

Generated Output Example

Python (le_utils/constants/languages.py):

# Generated by scripts/generate_from_specs.py
from collections import namedtuple

class Language(namedtuple("Language", ["native_name", "primary_code", "subcode", "name"])):
    @property
    def code(self):
        return "{}-{}".format(self.primary_code, self.subcode) if self.subcode else self.primary_code
    
    @property
    def id(self):
        return self.code
    
    @property
    def first_native_name(self):
        return self.native_name.split(",")[0]

RTL_LANG_CODES = ["ar", "arq", "dv", "he", "fa", "ps", "ur", "yi"]

LANGUAGELIST = [
    Language(native_name="Afaraf", primary_code="aa", subcode=None, name="Afar"),
    Language(native_name="English", primary_code="en", subcode=None, name="English"),
    Language(native_name="Español (México)", primary_code="es", subcode="MX", name="Spanish (Mexico)"),
    # ... (1,141 total)
]

_LANGUAGELOOKUP = {lang.code: lang for lang in LANGUAGELIST}
_LANGUAGELOOKUP_BY_NAME = {lang.name.lower(): lang for lang in LANGUAGELIST}
_LANGUAGELOOKUP_BY_NATIVE_NAME = {lang.native_name.lower(): lang for lang in LANGUAGELIST}
_LANGUAGELOOKUP_BY_ALPHA2 = {lang.primary_code: lang for lang in LANGUAGELIST if not lang.subcode}

def getlang(code, default=None):
    return _LANGUAGELOOKUP.get(code) or default

def getlang_by_name(name, default=None):
    return _LANGUAGELOOKUP_BY_NAME.get(name.lower()) or default

def getlang_by_native_name(native_name, default=None):
    return _LANGUAGELOOKUP_BY_NATIVE_NAME.get(native_name.lower()) or default

def getlang_by_alpha2(alpha2, default=None):
    return _LANGUAGELOOKUP_BY_ALPHA2.get(alpha2) or default

JavaScript (js/Languages.js):

// Generated by scripts/generate_from_specs.py

export const RTL_LANG_CODES = ["ar", "arq", "dv", "he", "fa", "ps", "ur", "yi"];

export const LanguagesList = [
    { native_name: "Afaraf", primary_code: "aa", subcode: null, name: "Afar", code: "aa", first_native_name: "Afaraf" },
    { native_name: "English", primary_code: "en", subcode: null, name: "English", code: "en", first_native_name: "English" },
    { native_name: "Español (México)", primary_code: "es", subcode: "MX", name: "Spanish (Mexico)", code: "es-MX", first_native_name: "Español (México)" },
    // ...
];

export const LanguagesMap = new Map(
    LanguagesList.map(lang => [lang.code, lang])
);

export function getLanguage(code) {
    return LanguagesMap.get(code) || null;
}

export function getLanguageByName(name) {
    return LanguagesList.find(lang => lang.name.toLowerCase() === name.toLowerCase()) || null;
}

export function getLanguageByNativeName(nativeName) {
    return LanguagesList.find(lang => lang.native_name.toLowerCase() === nativeName.toLowerCase()) || null;
}

export function getLanguageByAlpha2(alpha2) {
    return LanguagesList.find(lang => lang.primary_code === alpha2 && !lang.subcode) || null;
}

Testing Updates

Files: tests/test_languages.py and tests/test_getlangs.py

Update to test against spec:

spec_path = os.path.join(os.path.dirname(__file__), "..", "spec", "constants-languages.json")
with open(spec_path) as f:
    spec = json.load(f)
    languagelookup = spec["constants"]

# Verify all 1,141 languages
# Test helper functions
# Test Language properties (code, id, first_native_name)
# Test RTL_LANG_CODES list

How to Run Tests

pytest tests/test_languages.py -v
pytest tests/test_getlangs.py -v
pytest tests/ -v

Acceptance Criteria

  • spec/constants-languages.json created with all 1,141 language entries
  • Added properties metadata for code, id, first_native_name
  • Added rtl_codes metadata
  • scripts/generate_from_specs.py enhanced to generate namedtuple properties
  • make build successfully generates Python and JavaScript files
  • Generated le_utils/constants/languages.py has:
    • Language namedtuple with 4 fields and 3 properties
    • RTL_LANG_CODES list
    • LANGUAGELIST with all 1,141 languages
    • Helper functions (getlang, getlang_by_name, etc.)
    • Lookup dicts
  • Generated js/Languages.js has:
    • RTL_LANG_CODES export
    • LanguagesList with computed properties (code, first_native_name)
    • LanguagesMap for lookups
    • Helper functions (getLanguage, getLanguageByName, etc.)
  • Tests updated to test against spec
  • All tests pass
  • resources/languagelookup.json deleted

Disclosure

🤖 This issue was written by Claude Code, under supervision, review and final edits by @rtibbles 🤖

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Task.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions