Skip to content

sohelshekhIn/AWS_ShiftPick

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Automated Job Application System – Comprehensive Technical Documentation

Table of Contents

  1. 1. Introduction
    1. 1.1. Document Purpose
    2. 1.2. System Purpose & Goals
    3. 1.3. Scope
    4. 1.4. Definitions & Acronyms
  2. Recent Changes & Architectural Enhancements
    1. Pre-created Applications System
    2. GlobalJobIdMap System
    3. Optimized User-Job Matching Logic
    4. Enhanced Error Handling
    5. Performance Monitoring & Logging
  3. 2. System Architecture
    1. 2.1. Architectural Overview
    2. 2.2. Core Components & Deployment Strategy
    3. 2.3. Technology Stack & Justifications
  4. 3. Data Model (Google Cloud Firestore)
    1. 3.1. Collection: UserJobProfiles
    2. 3.2. Collection: ApplicationLogs
    3. 3.3. Collection: SystemConfiguration
    4. 3.4. Indexing Considerations
  5. 4. Component Deep Dive: LoginManager.py
    1. 4.1. Overview & Responsibilities
    2. 4.2. Deployment & Execution Environment
    3. 4.3. Detailed Workflow & Logic
    4. 4.4. Selenium Configuration & CAPTCHA Handling
    5. 4.5. OTP Retrieval Mechanism
    6. 4.6. Session Data Management
    7. 4.7. Pre-creation of Applications
    8. 4.8. Logging & Error Handling
  6. 5. Component Deep Dive: AppIdApplicationOrchestrator.py
    1. 5.1. Overview & Responsibilities
    2. 5.2. Deployment & Execution Environment
    3. 5.3. Optimized In-Memory User Cache
    4. 5.4. Real-time Firestore Listener
    5. 5.5. Job Polling & Discovery
    6. 5.6. Schedule Retrieval
    7. 5.7. Enhanced User-Job-Schedule Matching Logic
    8. 5.8. Application Submission Workflow (API & WebSocket)
    9. 5.9. GlobalJobIdMap Management
    10. 5.10. Concurrency Model
    11. 5.11. Recently Processed Job-Schedule Cache
    12. 5.12. Proactive Token Expiry Management
    13. 5.13. Logging & Error Handling
  7. 6. Workflow Diagrams
    1. 6.1. System Interaction Diagram
    2. 6.2. LoginManager Detailed Sequence
    3. 6.3. ApplicationOrchestrator Job Processing Flow
  8. 7. Configuration & Environment Variables
  9. 8. Setup, Running & Maintenance
    1. 8.1. Prerequisites
    2. 8.2. Running LoginManager.py
    3. 8.3. Running AppIdApplicationOrchestrator.py
    4. 8.4. Maintenance Notes
  10. 9. Security Considerations
  11. 10. Future Enhancements & Roadmap

Recent Changes & Architectural Enhancements

The system has undergone significant architectural improvements to enhance speed, reliability, and scalability. The major enhancements are:

Pre-created Applications System

Purpose: Dramatically reduce application submission latency by pre-creating application IDs during user login rather than during the critical job discovery phase.

Key Changes:

  • LoginManager Enhancement: After successful login, the system now analyzes the user's preferences against a GlobalJobIdMap and pre-creates applications for all matching job types using the create-application API with scheduleId: null.
  • Firestore Storage: Pre-created application IDs are stored in the pre_created_applications field within each user's UserJobProfiles document.
  • Orchestrator Optimization: AppIdApplicationOrchestrator.py now checks for pre-existing application IDs before attempting to create new ones, skipping the create step and directly proceeding to the update step.
  • Performance Impact: Reduces job-to-application latency by 50-200ms per application by eliminating the create API call during the critical path.

GlobalJobIdMap System

Purpose: Centralized mapping system that enables pre-creation and allows the system to dynamically learn about new job types.

Architecture:

  • New Firestore Collection: SystemConfigurationGlobalJobIdMap_CA document containing:
    • job_map: Maps structured keys (LOCATION_UPPERCASE-HOUR_TYPE_UPPERCASE) to job IDs
    • reverse_job_map: Maps job IDs to structured keys
    • last_updated_timestamp: Tracking updates
  • Dynamic Learning: When the orchestrator encounters unknown job IDs, it derives the structured key from job schedules and updates the global map
  • Consistency: Transactional updates ensure data consistency across multiple orchestrator instances

Optimized User-Job Matching Logic

Purpose: Replace O(n×m) user iteration with O(1) indexed lookups for dramatic performance improvements.

Implementation:

  • Indexed Caches:
    • location_to_users_index: Maps locations to sets of interested user IDs
    • hours_to_users_index: Maps hour preferences to sets of user IDs
  • Algorithm Enhancement: Instead of checking every user against every job, the system now:
    1. Gets candidate users by location lookup: location_to_users_index[job_location]
    2. Gets users by hour preference: hours_to_users_index[schedule_type]
    3. Computes intersection for final candidates
  • Performance Impact: Reduces user-job matching time from O(n) to O(1) for typical job counts

Enhanced Error Handling

Purpose: Provide more granular and context-aware error handling, especially for application conflicts.

Key Improvements:

  • APPLICATION_ALREADY_EXIST_CAN_BE_RESET Handling:
    • LoginManager Context: Sets user status to LOGIN_FAILED with descriptive error message
    • Orchestrator Context: Sets user status to APPLICATION_FAILED and logs the conflict
  • Auth Failure Differentiation: Clear distinction between token expiry vs. other API failures
  • Contextual Logging: Error messages now include source context (LoginManager vs. Orchestrator)

Performance Monitoring & Logging

Purpose: Comprehensive performance tracking for optimization and troubleshooting.

New Monitoring:

  • Timing Logs: Millisecond-precision timing for all major operations:
    • get_schedule: Schedule fetching duration
    • execute_api_calls: Create and update API call timing
    • connect_and_update_via_websocket: WebSocket workflow duration
    • apply_for_job_and_update_status: End-to-end application timing
    • load_or_refresh_global_job_id_map: GlobalJobIdMap operations
    • process_single_job_concurrently: User matching performance
  • PERF Log Level: Dedicated performance log category for easy filtering
  • Cache Debug Logs: Detailed logging of index updates and cache state changes

Migration Path:

  • Backward Compatibility: New AppIdApplicationOrchestrator.py maintains full compatibility with existing data structures
  • Gradual Rollout: Pre-creation system gracefully handles cases where applications don't exist, falling back to the original create-then-update flow
  • Zero Downtime: Changes can be deployed without service interruption

1. Introduction

1.1. Document Purpose

This document provides a comprehensive technical description of the Automated Job Application System. It details the system's architecture, components, data models, workflows, and operational aspects. It is intended for developers, maintainers, and anyone seeking to understand the inner workings of the system.

1.2. System Purpose & Goals

The primary purpose of this system is to automate the process of discovering and applying for job postings on a target website (e.g., Amazon Hiring). It aims to achieve this with high speed and reliability on behalf of multiple configured users, maximizing their chances of securing positions, especially those filled on a first-come, first-served basis.

Core Goals:

  • Speed & Efficiency: Minimize latency from job discovery to application submission through asynchronous operations, optimized matching, and pre-created applications.
  • Reliability & Robustness: Ensure continuous operation with comprehensive error handling, session management, and fault tolerance.
  • Scalability: Design components to handle an increasing number of users and job application volumes through indexed caching and optimized algorithms.
  • Accuracy: Precisely match jobs to user-defined preferences (location, hours).
  • Security: Protect user credentials and sensitive session data at all stages.
  • Maintainability: Provide clear logging and a modular design for ease of maintenance and future development.

1.3. Scope

In Scope:

  • Secure storage and management of user credentials and job preferences.
  • Automated login to the target job website with pre-creation of applications for matching job types.
  • Real-time (high-frequency polling) job monitoring from a specified GraphQL API.
  • Intelligent matching of jobs against user profiles using optimized indexed algorithms (location, job hours).
  • High-speed, concurrent application submissions via the target website's internal APIs (REST and WebSocket).
  • Dynamic learning and management of job type mappings through the GlobalJobIdMap system.
  • Detailed logging of all operations, attempts, successes, and failures to Google Firestore and console.
  • Management of user session validity, including proactive and reactive re-login triggers.
  • Performance monitoring and optimization through comprehensive timing logs.

Out of Scope (for current version):

  • A web-based administration UI (planned for the future).
  • Direct interaction with email services for OTPs (currently uses a third-party API for OTPs associated with temporary emails).
  • Advanced AI/ML for job description analysis or preference learning.
  • Support for multiple target job websites or job source APIs without code modification.

1.4. Definitions & Acronyms

Term Definition
User An individual whose job preferences and credentials are managed by the system.
user_id A unique identifier for each user, consistent across Firestore collections (typically the document ID).
Target Job Website The specific website (e.g., hiring.amazon.ca) where users have accounts and jobs are applied for.
Job Source API The external GraphQL API used to discover new job postings.
candidate_id A user-specific identifier on the Target Job Website.
Session Data website_access_token and website_cookies required to interact with the Target Job Website's APIs.
Pre-created Application An application ID created during login for a specific job type, stored for later use during job discovery.
GlobalJobIdMap Centralized mapping system that correlates structured keys (location-hour_type) with job IDs.
Structured Key A standardized format (LOCATION_UPPERCASE-HOUR_TYPE_UPPERCASE) used for job type identification.
Firestore Google Cloud Firestore: a NoSQL cloud database used for data storage and real-time synchronization.
Selenium Browser automation framework used for logging into the Target Job Website.
ChromeDriver A WebDriver executable that Selenium uses to control Google Chrome.
OTP One-Time Password, used for login verification.
EC2 Amazon Elastic Compute Cloud, where the ApplicationOrchestrator runs.
API Application Programming Interface.
REST API Representational State Transfer API.
WebSocket API A communication protocol providing full-duplex communication channels over a single TCP connection.
GraphQL A query language for APIs and a server-side runtime for executing those queries.
aiohttp An asynchronous HTTP client/server framework for Python.
websockets (library) A Python library for building WebSocket clients and servers.
Epoch Milliseconds (ms) Time represented as the number of milliseconds that have elapsed since January 1, 1970, at 00:00:00 UTC.

2. System Architecture

2.1. Architectural Overview

The system employs a decoupled, service-oriented architecture with significant performance and reliability enhancements. Two primary Python services, LoginManager.py and AppIdApplicationOrchestrator.py, operate independently but coordinate their activities through Google Cloud Firestore and the new GlobalJobIdMap system. This design allows for specialized deployment environments: the LoginManager (requiring a GUI for Selenium) runs locally, while the AppIdApplicationOrchestrator (headless, I/O-bound) runs on a cloud server (AWS EC2).

Firestore acts as the central nervous system, storing all persistent data (user profiles, preferences, session tokens, application logs, job mappings) and facilitating real-time updates between components.

2.2. Core Components & Deployment Strategy

  1. LoginManager.py (Local Machine) - Enhanced:

    • Purpose: Manages user login sessions and pre-creates applications for matching job types.
    • New Functionality: After successful login, analyzes user preferences against the GlobalJobIdMap and pre-creates applications using the create-application API with scheduleId: null.
    • Deployment: Runs on a user's local machine (Windows, macOS, or Linux with a desktop environment).
    • Interaction: Reads user credentials and GlobalJobIdMap from Firestore, performs login via Selenium, pre-creates applications, and writes updated session data and pre-created application IDs back to Firestore.
  2. AppIdApplicationOrchestrator.py (AWS EC2) - Optimized:

    • Purpose: Discovers new jobs, matches them against active user profiles using optimized algorithms, and submits applications at high speed using pre-created application IDs when available.
    • New Features:
      • Utilizes indexed user caches for O(1) user-job matching
      • Manages and updates the GlobalJobIdMap dynamically
      • Leverages pre-created application IDs to skip create API calls
    • Deployment: Runs on a headless Linux server (e.g., AWS EC2 instance).
    • Interaction:
      • Maintains optimized in-memory caches with preference indexes, synchronized with Firestore via a real-time listener.
      • Polls an external Job Source API (GraphQL).
      • Manages the GlobalJobIdMap for job type discovery and learning.
      • Interacts with the Target Job Website's internal APIs (REST and WebSocket) for application submission.
      • Logs all application attempts and outcomes to Firestore with detailed performance metrics.
  3. Google Cloud Firestore (Cloud Database) - Extended:

    • Purpose: Centralized data persistence, state synchronization, and job type mapping.
    • Key Roles:
      • Stores UserJobProfiles (credentials, preferences, session details, status, pre-created applications).
      • Stores ApplicationLogs (detailed records of each application attempt with performance metrics).
      • Stores SystemConfiguration (GlobalJobIdMap for job type management).
      • Enables real-time updates to the ApplicationOrchestrator's optimized cache when user data changes.
  4. External APIs:

    • Job Source API (GraphQL): Polled by ApplicationOrchestrator for new job listings.
    • Target Job Website APIs (REST & WebSocket): Used by both components for application creation/submission.
    • OTP API (temporamail.co): Used by LoginManager to retrieve OTPs for specific email accounts.

2.3. Technology Stack & Justifications

Category Technology Version (Typical) Justification
Programming Language Python 3.9+ Rich ecosystem, excellent libraries for web automation (Selenium), async operations (asyncio), and cloud integration.
Browser Automation Selenium WebDriver Latest Industry standard for browser automation, necessary for complex login flows on the Target Job Website.
ChromeDriver Matching Chrome WebDriver for Google Chrome.
Asynchronous HTTP aiohttp Latest High-performance asynchronous HTTP client for the ApplicationOrchestrator, enabling concurrent API calls.
websockets (library) Latest Efficient Python library for WebSocket client interactions, used by ApplicationOrchestrator.
Database Google Cloud Firestore N/A (Cloud Service) Scalable, real-time NoSQL database. Excellent for distributed systems and synchronizing state.
Cloud SDK google-cloud-firestore, firebase-admin Latest Official Python libraries for interacting with Firestore.
Environment Mgmt python-dotenv Latest Manages environment variables from a .env file for configuration.
Deployment (Local) Standard Python Environment N/A LoginManager runs directly.
Deployment (Cloud) AWS EC2 (e.g., t3.micro/small with Amazon Linux 2) N/A Cost-effective and reliable platform for running the headless ApplicationOrchestrator.
Process Management systemd (Linux) or supervisor (Alternative) System Default Ensures the ApplicationOrchestrator service runs continuously and restarts on failure on the EC2 instance.
Configuration .env file, Python constants N/A Easy to manage and deploy different configurations.

3. Data Model (Google Cloud Firestore)

Firestore is used to store all persistent data for the system. The primary collections are:

3.1. Collection: UserJobProfiles

Stores all information related to a user's profile, their preferences for job applications, their credentials for the target job website, their current session status, and their pre-created applications.

  • Document ID: user_id (String) - A unique identifier for the user (e.g., "user_alpha", "john_doe").
  • Fields:
    • email: (String) The email address used for logging into the Target Job Website. Example: "user@example.com"
    • pin: (String) The PIN or password for the Target Job Website. Note: The LoginManager.py expects this in plaintext. For production, consider encrypting this field and decrypting it only within the LoginManager.
    • token: (String) A secondary identifier, often the same as user_id or a related token used for OTP retrieval (e.g., the part of the email before "@" if using a service like temporamail.co). Example: "user_alpha"
    • status: (String) Current operational status of the user profile. Enum:
      • ACTIVE: Profile is active, session is expected to be valid. Processed by ApplicationOrchestrator.
      • INACTIVE: Profile is manually disabled. Ignored by all services.
      • LOGIN_REQUIRED: Session is invalid, expired, or initial login needed. Processed by LoginManager.
      • LOGIN_IN_PROGRESS: LoginManager is currently attempting to log this user in.
      • LOGIN_FAILED: The last login attempt by LoginManager failed.
      • APPLICATION_SUCCESS: The ApplicationOrchestrator successfully submitted an application and completed the WebSocket workflow.
      • APPLICATION_FAILED: The ApplicationOrchestrator encountered an application conflict (e.g., APPLICATION_ALREADY_EXIST_CAN_BE_RESET).
      • SHIFT_PICKED: The ApplicationOrchestrator submitted an application, but the WebSocket workflow indicated a "shift picked" state (may imply success or a specific intermediate state).
    • website_access_token: (String, Nullable) The authentication token (e.g., Bearer token) obtained after successful login to the Target Job Website. Used for API calls by ApplicationOrchestrator.
    • website_cookies: (String (JSON) or Array of Objects, Nullable) Cookies obtained from the Target Job Website session. Stored as a JSON string by LoginManager if they are a list of dicts. ApplicationOrchestrator parses this. Example (as JSON string): '[{"name": "session_id", "value": "xyz123"}, ...]'
    • candidate_id: (String, Nullable) The unique identifier for the user on the Target Job Website (e.g., Amazon's bbCandidateId).
    • user_agent: (String) The User-Agent string to be used for API calls and WebSocket connections, mimicking the browser used for login. Example: "Mozilla/5.0..."
    • login_expiry_timestamp: (Integer, Nullable) Epoch milliseconds timestamp indicating when the current website_access_token and website_cookies are expected to expire.
    • last_login_attempt_timestamp: (Timestamp, Nullable) Firestore server timestamp of the last login attempt by LoginManager.
    • last_login_success_timestamp: (Timestamp, Nullable) Firestore server timestamp of the last successful login by LoginManager.
    • last_error_message: (String, Nullable) Details of the last error encountered by LoginManager or ApplicationOrchestrator for this user.
    • location_preferences: (Array of Strings) A list of preferred location codes/cities. Case-insensitive matching is performed. "ANY" is a special value. Example: ["Toronto", "Vancouver", "ANY"]
    • job_hours_preferences: (String) User's preference for job hours. Case-insensitive matching. Expected values for direct matching: "FULL_TIME", "PART_TIME", "FLEX_TIME". Special values: "ANY", "PART_TIME_OR_FLEX". Example: "ANY"
    • pre_created_applications: (Map, New) Stores pre-created application IDs organized by job ID for faster application submission.
      • Key: job_id (String) - The job ID from the GlobalJobIdMap
      • Value: (Map) containing:
        • application_id: (String) The pre-created application ID
        • retrieved_at_timestamp: (Timestamp) When this application ID was created/verified
        • source_job_key: (String) The structured key from GlobalJobIdMap that matched this job
    • application_details: (Map, Nullable) Stores details of the last successful application submission.
      • application_id: (String) The ID of the application created on the Target Job Website.
      • job: (Map) Information about the job applied for (e.g., jobId, jobTitle).
      • schedule: (Map) Information about the schedule chosen (e.g., scheduleId, hoursPerWeek).
      • updated_at: (Timestamp) Firestore server timestamp of when these details were last updated.
      • updated_at_iso: (String) ISO 8601 formatted string of updated_at.
    • created_at: (Timestamp) Firestore server timestamp when the user profile was created.
    • updated_at: (Timestamp) Firestore server timestamp of the last update to this document.

3.2. Collection: ApplicationLogs

Records every job application attempt made by the ApplicationOrchestrator, providing a detailed audit trail with performance metrics.

  • Document ID: Auto-generated by Firestore.
  • Fields:
    • user_id: (String) Foreign key referencing the user_id in UserJobProfiles.
    • job_id_external: (String) The external identifier of the job from the Job Source API.
    • schedule_id_external: (String, Nullable) The external identifier of the schedule chosen for the job.
    • application_id: (String, Nullable) The application ID returned by the Target Job Website's API after successful creation.
    • applied_at_timestamp: (Timestamp) Firestore server timestamp when the application attempt was made.
    • status: (String) Outcome of the application attempt. Enum:
      • SUCCESS_API: API calls (create-application, update-application) were successful.
      • SUCCESS_WS: WebSocket workflow completed successfully after successful API calls.
      • FAILED_AUTH: Authentication failed during API calls (token likely expired).
      • FAILED_API: API calls failed for reasons other than auth (e.g., server error, bad request, schedule unavailable).
      • FAILED_WS: WebSocket connection or workflow failed after successful API calls.
      • SKIPPED_RECENTLY_PROCESSED: Skipped because this job-schedule was processed recently for another user.
      • (Other specific failure reasons could be added)
    • api_response_status_code: (Integer, Nullable) HTTP status code from the primary application API call (e.g., create-application or update-application).
    • api_response_body: (String, Nullable) A snippet or full body of the API response, especially for errors.
    • error_details: (String, Nullable) A more specific error message or traceback snippet.
    • matched_preferences: (Map, Nullable) Details of how the user's preferences matched the job/schedule.
      • location_matched_on: (String) The specific location that matched.
      • user_location_preferences: (Array of Strings) User's original location preferences.
      • user_hours_preference: (String) User's original hours preference.
      • matched_schedule_type: (String) The type of the schedule that matched (e.g., "FLEX_TIME").
      • matched_schedule_hours: (Number) The hours per week of the matched schedule.
    • attempt_duration_ms: (Float, Nullable) Time taken for the entire application attempt for this user and job, in milliseconds.
    • country_code: (String) The country code (e.g., "CA", "US") for which the application was made.

3.3. Collection: SystemConfiguration

New collection that stores system-wide configuration and mapping data.

Document: GlobalJobIdMap_CA

  • Purpose: Centralized mapping between structured job keys and job IDs for Canada region.
  • Fields:
    • job_map: (Map) Forward mapping from structured keys to job IDs
      • Key: Structured key format (LOCATION_UPPERCASE-HOUR_TYPE_UPPERCASE)
      • Value: Job ID (e.g., "JOB-CA-0000000354")
      • Example: "TORONTO-FULL_TIME": "JOB-CA-0000000267"
    • reverse_job_map: (Map) Reverse mapping from job IDs to structured keys
      • Key: Job ID
      • Value: Structured key
      • Example: "JOB-CA-0000000267": "TORONTO-FULL_TIME"
    • last_updated_timestamp: (Timestamp) Firestore server timestamp of the last update
    • source_script: (String, Optional) Identifies which script last updated the map
    • last_upload_time_utc: (String, Optional) ISO timestamp for manual tracking

3.4. Indexing Considerations

  • UserJobProfiles Collection:
    • Composite index on (status == "ACTIVE"). Used by ApplicationOrchestrator for initial cache load and by the listener.
    • Composite index on (status == "LOGIN_REQUIRED"). Used by LoginManager.
    • Composite index on (status == "ACTIVE", login_expiry_timestamp <= NOW + GRACE_PERIOD). Used by LoginManager and potentially the proactive token checker to find users needing re-login due to impending expiry. Firestore supports range filters on one field per composite index.
  • ApplicationLogs Collection:
    • Indexes on user_id and applied_at_timestamp would be beneficial for querying logs for specific users or time ranges.
    • Index on status for analyzing success/failure rates.
  • SystemConfiguration Collection:
    • Single-field indexes on document IDs are sufficient for the current access patterns.

Firestore automatically creates single-field indexes. Composite indexes need to be created manually via the Firebase console or firebase.json if not prompted automatically by query errors.


4. Component Deep Dive: LoginManager.py

4.1. Overview & Responsibilities

The LoginManager.py script is a crucial component responsible for maintaining active login sessions for users on the Target Job Website and pre-creating applications for efficient job application submission. It automates the often complex login process, which may involve multi-factor authentication (OTP) and CAPTCHA challenges, and then strategically pre-creates applications for job types that match the user's preferences.

4.2. Deployment & Execution Environment

  • Deployment: Runs on a user's local machine (Windows, macOS, or Linux with a desktop environment).
  • Reasoning: This is due to its reliance on Selenium WebDriver, which automates a real web browser (Google Chrome). CAPTCHA solving extensions and the visual feedback for debugging OTP/login issues are more effectively managed in an environment with a GUI. Running this on a headless server would be significantly more complex and brittle.
  • Execution: Designed to run periodically (e.g., via a cron job, Task Scheduler, or a simple while True loop with time.sleep()). The script itself implements a continuous loop that runs every 4 minutes by default.

4.3. Detailed Workflow & Logic

The script operates in a main loop:

  1. Initialization:

    • Loads environment variables (e.g., FIREBASE_SERVICE_ACCOUNT_KEY_PATH).
    • Initializes the Firebase Admin SDK and Firestore client.
    • Fetches the GlobalJobIdMap from Firestore for use in pre-creation logic.
    • Enters a while True loop for continuous operation.
  2. User Identification (get_users_needing_login function):

    • Queries the UserJobProfiles collection in Firestore for users matching two criteria:
      1. Users with status == "LOGIN_REQUIRED".
      2. Users with status == "ACTIVE" AND login_expiry_timestamp <= (current_time_utc + PROACTIVE_RELOGIN_WINDOW_SECONDS). This proactively identifies users whose sessions are about to expire.
    • Converts the stream of Firestore documents to a list.
    • Returns a list of user data dictionaries.
  3. Per-User Login Processing (within main function loop):

    • If no users need processing, sleeps and continues the main loop.
    • For each identified user:
      • Updates the user's status in Firestore to LOGIN_IN_PROGRESS.
      • WebDriver Setup (setup_chrome_driver):
        • Configures Chrome options:
          • Disables automation-controlled flags (--disable-blink-features=AutomationControlled).
          • Disables infobars.
          • Excludes "enable-automation" switch.
          • Sets useAutomationExtension to False.
          • Configures preferences to disable password manager and enable JavaScript.
          • Loads the CAPTCHA solver extension (NopeCHA) from the local extensions/gui directory.
          • Optionally runs in headless mode if RUN_HEADLESS environment variable is "true" (though typically run with GUI for reliability).
        • Instantiates webdriver.Chrome with these options.
      • Login Sequence (perform_login_sequence):
        • Retrieves user's email and pin from the provided user_data.
        • Navigates to the CAPTCHA solver extension's configuration page (CONFIG["CAPTCHA_SOLVER_CONFIG_URL"]) to ensure it's active.
        • Navigates to the Target Job Website's login URL.
        • Checks if already logged in by verifying the current URL.
        • Handles cookie consent pop-ups (attempts to press Escape).
        • Enters email, clicks continue.
        • Enters PIN, clicks continue.
        • Handles radio button selection for OTP delivery method (if present).
        • Clicks "Send Verification Code" and records the current UTC milliseconds (send_code_utc_ms) for OTP validation.
        • CAPTCHA Handling (wait_for_captcha_solve):
          • Checks for the presence of the awswaf-captcha shadow DOM element.
          • If found, waits for the CAPTCHA solver extension to remove the CAPTCHA modal (up to a timeout).
        • OTP Retrieval (get_latest_otp_with_retry):
          • Calls the OTP API (temporamail.co/api/otp?tokenName={user_token}) with retries.
          • Validates that the OTP timestamp (part of the API response) is recent enough relative to send_code_utc_ms.
        • Enters the retrieved OTP into the OTP input field.
        • Clicks continue after OTP (if the button exists).
        • Success Verification: Waits for the URL to contain the job search page partial URL (CONFIG["JOB_WEBSITE_JOBSEARCH_URL_PARTIAL"]).
        • Data Extraction (on success):
          • Executes JavaScript to get localStorage.getItem('accessToken') and localStorage.getItem('bbCandidateId').
          • Gets all browser cookies using driver.get_cookies().
      • New: Pre-creation Logic (process_pre_creation_for_user):
        • Fetches fresh user data from Firestore after successful login
        • Analyzes user's location and hour preferences against the GlobalJobIdMap
        • For each matching job type:
          • Calls _make_create_application_call with scheduleId: null
          • Stores successful application IDs in pre_created_applications
        • Updates Firestore with the pre-created application map
        • Handles specific errors like APPLICATION_ALREADY_EXIST_CAN_BE_RESET
      • Error Handling: Enhanced to include pre-creation failures and specific error codes
      • Driver Teardown: driver.quit() is called in a finally block to ensure the browser closes.
    • A short random delay is introduced between processing users.
  4. Loop Continuation:

    • After processing all users in the current batch, sleeps for 4 minutes (or CONFIG["LOGIN_CHECK_INTERVAL_SECONDS"]).
    • Handles KeyboardInterrupt for graceful shutdown.
    • Catches any other exceptions in the main loop, logs them, and waits before retrying.

4.4. Selenium Configuration & CAPTCHA Handling

  • Chrome Options: The script uses several Chrome options to make Selenium less detectable and to configure the browser environment (see setup_chrome_driver).
  • CAPTCHA Solver Extension: It relies on a pre-loaded Chrome extension (specified by CONFIG["CAPTCHA_SOLVER_EXTENSION_PATH"], e.g., NopeCHA) to automatically solve CAPTCHAs encountered during login. The script navigates to the extension's configuration URL (CONFIG["CAPTCHA_SOLVER_CONFIG_URL"]) to ensure its settings are applied.
  • The wait_for_captcha_solve function specifically looks for an awswaf-captcha element and waits for its shadow DOM content (the CAPTCHA modal) to disappear, indicating the extension has solved it.

4.5. OTP Retrieval Mechanism

  • The get_latest_otp_with_retry function is responsible for fetching OTPs.
  • It makes HTTP GET requests to an external API endpoint: https://temporamail.co/api/otp?tokenName={token}.
    • {token} is derived from the user's profile (field token in UserJobProfiles).
  • The API response is expected to be JSON, containing the OTP and a timestamp (e.g., {"otp": "123456_1678886400000", "exists": true}).
  • The script parses the OTP and its timestamp.
  • Timestamp Validation: Crucially, it checks if the OTP's timestamp is greater than or equal to (send_code_utc_ms - tolerance_ms). This ensures that an old, potentially re-used OTP is not accepted if the "Send Code" button was clicked multiple times.
  • Retries: The function retries fetching the OTP max_retries times with a wait_time interval if an OTP is not found or is too old.

4.6. Session Data Management

  • Extracted Data:
    • accessToken (from localStorage) -> website_access_token
    • bbCandidateId (from localStorage) -> candidate_id
    • Browser cookies (driver.get_cookies()) -> website_cookies (stored as a JSON string of the list of cookie objects)
  • Expiry Calculation: login_expiry_timestamp is calculated as datetime.now(timezone.utc) + timedelta(seconds=CONFIG["SESSION_DURATION_SECONDS"]), then converted to epoch milliseconds. The default duration is 1 hour and 42 minutes (1.70 hours).
  • Storage: All session data is written to the corresponding user's document in the UserJobProfiles collection in Firestore.

4.7. Pre-creation of Applications

New Feature: After successful login, the system now pre-creates applications for job types that match the user's preferences.

Process:

  1. Preference Analysis: Compares user's location_preferences and job_hours_preferences against the GlobalJobIdMap
  2. Matching Logic:
    • Location: Matches if user has "ANY" in preferences OR specific location matches
    • Hours: Handles "ANY", "PART_TIME_OR_FLEX", and specific hour type matching
  3. API Calls: For each match, calls create-application API with:
    • jobId: From the GlobalJobIdMap
    • scheduleId: null (allows scheduling flexibility)
    • activeApplicationCheckEnabled: true
  4. Storage: Successful application IDs stored in user's pre_created_applications field
  5. Error Handling:
    • APPLICATION_ALREADY_EXIST_CAN_BE_RESET: Sets user to LOGIN_FAILED state
    • Auth failures: Sets user to LOGIN_REQUIRED for retry

Benefits:

  • Reduces job application latency by 50-200ms
  • Enables immediate application updates when jobs are discovered
  • Provides better error handling for application conflicts

4.8. Logging & Error Handling

  • Logging (log_step function):
    • Provides timestamped, color-coded console output.
    • Log levels: INFO, DEBUG, WARNING, ERROR, CRITICAL.
    • Uses ANSI escape codes for colors.
  • Error Handling:
    • Extensive try-except blocks cover Selenium errors, network issues, API errors, and OTP retrieval failures.
    • Detailed error messages, including tracebacks for critical errors, are logged.
    • User status in Firestore is updated to LOGIN_FAILED upon critical failures during their processing, along with a last_error_message.
    • The main loop has a catch-all exception handler to prevent the script from crashing entirely, logging the error and continuing after a delay.

5. Component Deep Dive: AppIdApplicationOrchestrator.py

5.1. Overview & Responsibilities

The AppIdApplicationOrchestrator.py is the enhanced core engine of the job application system. It runs continuously, identifying new job opportunities and rapidly applying for them on behalf of all eligible users using optimized algorithms and pre-created applications. It is designed for high concurrency, low latency, and dynamic learning of new job types.

5.2. Deployment & Execution Environment

  • Deployment: Runs on a headless Linux server, typically an AWS EC2 instance.
  • Reasoning: Its tasks are I/O-bound (network requests) and benefit from an asyncio-based architecture. No GUI is required. A cloud server provides the necessary uptime and network connectivity.
  • Execution: Intended to be run as a persistent service, managed by a process manager like systemd or supervisor.

5.3. Optimized In-Memory User Cache

Enhanced from previous version with indexed data structures for O(1) user lookup:

  • Purpose: Provide microsecond-level access to active user data and enable instant user-job matching through preference indexes.
  • Structure: Three interconnected data structures:
    • user_details_cache: Dict[str, Dict[str, Any]] - Full user data by user_id
    • location_to_users_index: Dict[str, Set[str]] - Maps locations to user_id sets
    • hours_to_users_index: Dict[str, Set[str]] - Maps hour preferences to user_id sets
  • New Cached Data:
    • pre_created_applications: Pre-created application IDs for faster submission
    • processed_location_preferences: Normalized uppercase location list
    • processed_job_hours_preference: Normalized uppercase hour preference
  • Index Management:
    • _add_user_to_indexes(): Adds user to preference indexes
    • _remove_user_from_indexes(): Removes user from preference indexes
    • Automatically maintained during cache updates

5.7. Enhanced User-Job-Schedule Matching Logic

Optimized from O(n) to O(1) complexity:

Previous Approach:

For each job:
  For each user:  # O(n) users
    Check location match
    Check hour preference match

New Approach:

For each job:
  candidate_users = location_to_users_index[job_location] ∪ location_to_users_index["ANY"]  # O(1)
  For each schedule_category in [FLEX_TIME, FULL_TIME, PART_TIME]:
    matching_hour_users = hours_to_users_index[relevant_hour_prefs]  # O(1)
    final_candidates = candidate_users ∩ matching_hour_users  # O(min(sets))

Performance Impact: For typical scenarios with 100+ users and 10+ jobs, this reduces matching time from ~100ms to ~1ms per job.

5.8. Application Submission Workflow (API & WebSocket)

Enhanced with pre-created application support:

  1. execute_api_calls (Enhanced):

    • New Parameter: pre_existing_application_id (optional)
    • Optimization: If pre-existing ID provided, skips create API call entirely
    • Fallback: If no pre-existing ID, uses original create-then-update flow
    • Error Handling: Specific handling for APPLICATION_ALREADY_EXIST_CAN_BE_RESET in orchestrator context
  2. apply_for_job_and_update_status (Enhanced):

    • Checks user's pre_created_applications for existing application ID
    • Passes pre-existing ID to execute_api_calls if available
    • Handles new error status ALREADY_EXISTS_ADVANCED_ORCH from orchestrator's create attempts

5.9. GlobalJobIdMap Management

New Feature: Dynamic learning and management of job type mappings.

Components:

  • load_or_refresh_global_job_id_map(): Loads map from Firestore on startup and periodically
  • derive_representative_hour_type(): Analyzes schedules to determine job's hour type
  • update_global_job_id_map_in_firestore(): Transactionally updates Firestore with new mappings

Dynamic Learning Process:

  1. When orchestrator encounters unknown job_id:
    • Fetches all schedules for the job
    • Derives representative hour type (FLEX_TIME, PART_TIME, FULL_TIME)
    • Creates structured key: JOB_CITY_UPPER-DERIVED_HOUR_TYPE
    • Asynchronously updates GlobalJobIdMap in Firestore
    • Updates local cache for immediate use

Conflict Resolution:

  • Transactional updates prevent race conditions
  • Existing mappings are preserved (first-come-first-serve)
  • Detailed logging for conflict investigation

5.13. Logging & Error Handling

Enhanced with comprehensive performance monitoring:

  • New PERF Log Level: Dedicated category for performance metrics
  • Timing Coverage:
    • load_or_refresh_global_job_id_map: GlobalJobIdMap operations
    • update_global_job_id_map_in_firestore: Firestore update operations
    • process_single_job_concurrently: User matching performance
    • All existing API and WebSocket timing logs
  • Enhanced Error Context: All error messages now include source context (e.g., "Orchestrator:", "LoginManager:")

8. Setup, Running & Maintenance

8.1. Prerequisites

  • Python: Version 3.9 or higher.
  • Google Cloud Project: With Firestore enabled.
  • Firebase Service Account Key: Download the JSON key file.
  • Google Chrome: Installed (for LoginManager.py).
  • ChromeDriver: Corresponding version to your Chrome, placed in your system's PATH or specified directly.
  • CAPTCHA Solver Extension: (e.g., NopeCHA) downloaded and placed in the extensions/gui directory.
  • Required Python Packages: Install using pip install -r requirements.txt. The requirements.txt should include:
    • firebase-admin
    • google-cloud-firestore
    • selenium
    • python-dotenv
    • aiohttp
    • websockets
    • requests (for LoginManager's OTP client)
  • AWS Account & EC2 Instance: (For AppIdApplicationOrchestrator.py) A Linux EC2 instance configured with Python and necessary permissions (e.g., IAM role for Firestore access if not using service account key directly on EC2).

8.2. Running LoginManager.py

  1. Ensure all prerequisites for LoginManager.py are met on your local machine.
  2. Create a .env file in the project root and add FIREBASE_SERVICE_ACCOUNT_KEY_PATH.
  3. New Step: Initialize the GlobalJobIdMap by running python tests/AddGlobalJobIdMap.py once to populate the system with initial job type mappings.
  4. Populate Firestore UserJobProfiles with user data (email, pin, preferences, initial status LOGIN_REQUIRED).
  5. Run the script from the project root: python src/LoginManager.py
  6. Monitor the console output for login progress, pre-creation activity, and errors.

8.3. Running AppIdApplicationOrchestrator.py

  1. Set up an AWS EC2 instance with Python and required packages.
  2. Copy the project files to the EC2 instance.
  3. Place the Firebase service account key JSON file on the EC2 instance and update the .env file with its path. Or, configure an IAM role for the EC2 instance with Firestore permissions.
  4. Set TARGET_COUNTRY in the .env file if not "CA".
  5. Updated: Run the enhanced orchestrator: python src/AppIdApplicationOrchestrator.py
  6. For production, use a process manager like systemd or supervisor to ensure the script runs continuously and restarts on failure.

8.4. Maintenance Notes

  • Monitor Performance Logs: The new PERF logging provides detailed timing information. Monitor for performance degradation over time.
  • GlobalJobIdMap Monitoring: Periodically review the GlobalJobIdMap for completeness and accuracy. New job types should be automatically discovered and added.
  • Pre-creation Success Rates: Monitor LoginManager logs for pre-creation success rates and APPLICATION_ALREADY_EXIST_CAN_BE_RESET errors.
  • Cache Performance: Monitor the indexed cache performance logs to ensure O(1) user lookup performance is maintained.
  • Firestore Costs: The new system generates additional Firestore operations for GlobalJobIdMap management and pre-created application storage. Monitor costs accordingly.
  • API Changes: [Same as before, with additional monitoring for the create-application API behavior with null scheduleId]

10. Future Enhancements & Roadmap

Enhanced from previous roadmap:

  • Advanced Pre-creation Strategies:

    • Machine learning models to predict which job types are most likely to appear
    • Batch pre-creation optimization during low-traffic periods
    • Intelligent application ID recycling and management
  • GlobalJobIdMap Enhancements:

    • Multi-region support (US, UK, etc.) with separate mapping documents
    • Automated job type validation and cleanup
    • Integration with job posting analytics for predictive mapping
  • Performance Optimization:

    • Redis integration for ultra-fast cache operations
    • Advanced concurrency patterns with worker pools
    • WebSocket connection pooling and reuse
  • Enhanced Monitoring & Analytics:

    • Real-time dashboards for pre-creation success rates
    • Performance trend analysis and alerting
    • User success rate analytics and optimization recommendations
  • Admin Web Application (Enhanced):

    • Pre-creation management and troubleshooting tools
    • GlobalJobIdMap administration interface
    • Performance monitoring and optimization dashboard
    • Advanced user preference analysis and recommendation engine

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages