Skip to content

Add onprem-data-verifier tool #31

Merged
firat-kaya merged 3 commits intomainfrom
on-prem-verify
Mar 11, 2026
Merged

Add onprem-data-verifier tool #31
firat-kaya merged 3 commits intomainfrom
on-prem-verify

Conversation

@firat-kaya
Copy link
Copy Markdown
Contributor

This commit introduces a new Go application, , designed to validate Looker on-premise backup artifacts before migration to Looker Cloud. The tool performs several checks, including:

  • Workspace structure verification
  • MD5 checksum integrity checks
  • GPG key validation against the customer's LUID
  • SQL dump analysis (version, charset, extended inserts, critical tables)
  • CMK validation (raw and base64) Includes unit tests for various components.

…tion, , designed to validate Looker

      on-premise backup artifacts before migration to Looker Cloud. The tool performs several checks, including:  Workspace structure verification

 MD5 checksum integrity checks

 GPG key validation against the customer's LUID

 SQL dump analysis (version, charset, extended inserts, critical tables)

 CMK validation (raw and base64) Includes unit tests for various components. A  file is also added to exclude IDE

      files.
@firat-kaya firat-kaya requested a review from a team as a code owner February 19, 2026 13:39
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @firat-kaya, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a critical new utility designed to enhance the reliability and security of Looker on-premise to cloud migrations. By systematically verifying backup artifacts, the tool aims to prevent common data integrity and security issues, ensuring a smoother transition process. It provides a structured, step-by-step validation workflow, outputting clear results and a comprehensive metadata report.

Highlights

  • New Tool Introduction: A new Go application, onprem-data-verifier, has been added to validate Looker on-premise backup artifacts before migration to Looker Cloud.
  • Comprehensive Validation Pipeline: The tool performs a series of checks including workspace structure, MD5 checksum integrity, GPG key validation, SQL dump analysis (version, charset, extended inserts, critical tables), and Customer Master Key (CMK) validation.
  • CLI Interface and Reporting: The tool features a Cobra-based command-line interface with required flags for backup directory, customer name, and LUID. It provides colored console output and generates a detailed JSON report (metadata.json).
  • Modular Design with Unit Tests: The validation logic is organized into distinct packages (cmd, logger, metadata, validator) and is thoroughly covered by unit tests for checksums, CMK, configuration, GPG, metadata, and SQL structure.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • .gitignore
    • Added .idea directory to ignore IDE-specific files.
  • onprem-data-verifier/.idea/vcs.xml
    • Added IntelliJ IDEA VCS mapping configuration for the new project.
  • onprem-data-verifier/README.md
    • Added comprehensive documentation for the onprem-data-verifier tool, detailing its purpose, features, prerequisites, build instructions, usage, workspace requirements, validation pipeline, and output formats.
  • onprem-data-verifier/cmd/root.go
    • Added the main Cobra command structure for the onprem-data-verifier CLI tool.
    • Implemented flag parsing for backupDir, customerName, luid, and outputFile.
    • Integrated the validator.Orchestrator to run the validation process and generate the final report.
    • Updated error handling to use the custom logger package for consistent output.
  • onprem-data-verifier/go.mod
    • Added Go module definition for onprem-data-verifier.
    • Included dependencies for github.com/spf13/cobra, github.com/spf13/viper, and github.com/stretchr/testify.
  • onprem-data-verifier/go.sum
    • Added Go module checksums for all direct and indirect dependencies.
  • onprem-data-verifier/logger/logger.go
    • Added a new logger package to provide standardized, colored console output for different log levels (Step, Success, Warn, Error, Info, Title, Fatal, Completion).
    • Implemented a formatBytes helper function for human-readable size output.
  • onprem-data-verifier/main.go
    • Added the main entry point for the onprem-data-verifier application, calling cmd.Execute().
  • onprem-data-verifier/metadata/metadata.go
    • Added a Report struct to define the structure of the validation output metadata.
    • Implemented ToJSON method for serializing the report to JSON.
    • Implemented Save method for writing the report to a specified file path, with directory validation and fallback logic.
  • onprem-data-verifier/tests/checksum_test.go
    • Added unit tests for CalculateMD5 to verify MD5 hash generation for files.
    • Added unit tests for ParseMD5Manifest to ensure correct parsing of MD5 manifest files, including handling absolute paths.
    • Added unit tests for VerifyFile to confirm file integrity against expected checksums.
  • onprem-data-verifier/tests/cmk_validator_test.go
    • Added unit tests for ValidateCMK to verify Customer Master Key formats (Raw and Base64) and handle invalid lengths or corrupt data.
  • onprem-data-verifier/tests/config_test.go
    • Added unit tests for GetDefaultConfig to ensure the default validation policies, supported versions, and table checks are correctly loaded.
  • onprem-data-verifier/tests/gpg_verify_test.go
    • Added unit tests for ParseGPGColons to extract GPG key IDs from GPG output.
    • Added unit tests for ParseGPGColons_NotFound to handle cases where no keys are found in GPG output.
  • onprem-data-verifier/tests/metadata_test.go
    • Added unit tests for metadata.Report JSON serialization and field mapping.
    • Added unit tests for Report.Save to test saving to default, custom, and non-existent directories with fallback behavior.
  • onprem-data-verifier/tests/sql_structure_test.go
    • Added unit tests for IsLookerVersionSupported to check version compatibility.
    • Added comprehensive unit tests for AnalyzeSQLDump to verify extraction of Looker version, detection of extended inserts, charset, collation, and critical tables from SQL dumps, including gzipped files and edge cases.
  • onprem-data-verifier/validator/ValidationConfig.go
    • Added ValidationConfig struct to define validation rules and policies.
    • Implemented GetDefaultConfig to provide hardcoded default validation settings for supported Looker versions and critical tables.
  • onprem-data-verifier/validator/checksum.go
    • Added CalculateMD5 function for streaming MD5 hash calculation.
    • Implemented ParseMD5Manifest to read and normalize MD5 manifest files.
    • Added VerifyFile function to compare file hashes against expected values.
  • onprem-data-verifier/validator/cmk_validator.go
    • Added ValidateCMK function to check the validity and format (Raw or Base64) of Customer Master Keys.
  • onprem-data-verifier/validator/gpg_verify.go
    • Added ParseGPGColons to extract GPG key IDs from gpg --list-keys output.
    • Implemented GetKeyIDsFromEmail to query the local GPG keyring for keys associated with a specific email.
    • Added VerifyRecipient to check if an encrypted file is encrypted for any of the valid GPG key IDs.
  • onprem-data-verifier/validator/sql_structure.go
    • Added SQLAnalysisResult struct to store results from SQL dump analysis.
    • Implemented AnalyzeSQLDump for a single-pass scan of SQL files (including gzipped) to extract Looker version, detect extended inserts, charsets, collations, and identify critical tables.
    • Added IsLookerVersionSupported helper function for version compatibility checks.
  • onprem-data-verifier/validator/validator.go
    • Added the Orchestrator struct to manage the overall validation workflow.
    • Implemented NewOrchestrator for initializing the validator with customer-specific paths and configuration.
    • Added the Run method to execute all validation steps sequentially, including workspace verification, integrity checks, GPG key validation, database analysis, CMK validation, and filesystem analysis.
    • Implemented individual validation methods (validateDatabase, validateCMK, validateFileSystem, verifyWorkspace, verifyIntegrity, verifyGPGKeys) that update the internal report and log progress.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Go application, onprem-data-verifier, to validate Looker on-premise backup artifacts. The tool is well-structured, with clear separation of concerns into validation logic, command-line interface, and logging. The inclusion of unit tests is also a great practice. My review focuses on improving correctness, code clarity, and adherence to best practices. I've provided suggestions to fix an invalid Go version, correct a broken example in the command's help text, use a standard timestamp format, and remove an unnecessary IDE file from version control.

Copy link
Copy Markdown
Contributor

@drstrangelooker drstrangelooker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please address the Gemini issues then commit.

…tion, , designed to validate Looker

      on-premise backup artifacts before migration to Looker Cloud. The tool performs several checks, including:  Workspace structure verification

 MD5 checksum integrity checks

 GPG key validation against the customer's LUID

 SQL dump analysis (version, charset, extended inserts, critical tables)

 CMK validation (raw and base64) Includes unit tests for various components. A  file is also added to exclude IDE

      files.
@firat-kaya
Copy link
Copy Markdown
Contributor Author

/gemini review

@firat-kaya firat-kaya merged commit e36ffde into main Mar 11, 2026
2 checks passed
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Go application, onprem-data-verifier, designed to validate Looker on-premise backup artifacts. The tool is well-structured with a clear separation of concerns and includes a comprehensive suite of validations and tests. However, two significant vulnerabilities were identified: a weak GPG recipient verification that can be bypassed by crafted GPG packets, and a potential Denial of Service vulnerability via gzip bombs. Addressing these issues will significantly improve the robustness and security of the migration process. Additionally, opportunities for improvement were noted in logging, output formatting, and the removal of IDE-specific files and unreachable code to refine the implementation and enhance user-facing output for clarity.

Comment on lines +61 to +65
for _, id := range validKeyIDs {
if strings.Contains(output, id) {
return true, nil
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The VerifyRecipient function uses a weak check (strings.Contains(output, id)) to verify if a file is encrypted for a specific Key ID. This is insecure for several reasons:

  1. Substring Matching: It can match substrings of other Key IDs (e.g., Key ID '1234' would match '0123456789ABCDEF'), leading to false positives.
  2. Security Bypass: An attacker can bypass this check by crafting a GPG message that includes the target Key ID in a non-recipient field that gpg --list-packets displays, such as a literal data packet filename or a fake signature packet.
  3. Empty ID Match: If the Key ID list contains an empty string (e.g., due to malformed GPG output), strings.Contains will always return true, bypassing the check for any file.

This allows an attacker to provide a malicious backup that appears to be encrypted for the correct recipient, bypassing a critical security control.

	// Check if the file is encrypted for ANY of the valid IDs
	// Use a more robust check to ensure the ID is associated with a public key encryption packet
	for _, id := range validKeyIDs {
		if id == "" {
			continue
		}
		// Match the specific pattern for a recipient key ID in gpg --list-packets output
		pattern := fmt.Sprintf("keyid %s", id)
		if strings.Contains(output, pattern) {
			return true, nil
		}
	}

Comment on lines +1 to +6
<?xml version="1.0" encoding="UTF-8"?>
<project version="4">
<component name="VcsDirectoryMappings">
<mapping directory="$PROJECT_DIR$/.." vcs="Git" />
</component>
</project> No newline at end of file
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

IDE-specific configuration files and directories (like .idea) should not be committed to the version control system. While .gitignore has been updated to exclude .idea, this directory was likely added before the ignore rule was in place. Please remove the entire .idea directory from the repository to maintain a clean project structure.

Comment on lines +54 to +60
if strings.HasSuffix(filePath, ".gz") {
gzReader, err = gzip.NewReader(file)
if err != nil {
return nil, fmt.Errorf("failed to create gzip reader: %w", err)
}
defer gzReader.Close()
scanner = bufio.NewScanner(gzReader)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The AnalyzeSQLDump function uses gzip.NewReader to decompress SQL dumps without any limits on the decompression ratio or the total size of the decompressed data. An attacker can provide a specially crafted 'gzip bomb' (a small compressed file that expands to an enormous size) that can exhaust system resources (CPU and memory), leading to a Denial of Service (DoS) when the tool attempts to scan the file.

Suggested change
if strings.HasSuffix(filePath, ".gz") {
gzReader, err = gzip.NewReader(file)
if err != nil {
return nil, fmt.Errorf("failed to create gzip reader: %w", err)
}
defer gzReader.Close()
scanner = bufio.NewScanner(gzReader)
if strings.HasSuffix(filePath, ".gz") {
gzReader, err = gzip.NewReader(file)
if err != nil {
return nil, fmt.Errorf("failed to create gzip reader: %w", err)
}
defer gzReader.Close()
// Wrap the reader with a limit to prevent gzip bombs (e.g., 100GB limit)
// Note: Requires importing the "io" package
limitedReader := io.LimitReader(gzReader, 100*1024*1024*1024)
scanner = bufio.NewScanner(limitedReader)
}

--backupDir ./workspace \
--customerName lookersre-scotty-1 \
--luid "u-12345-6789" \
--output metadata.json"`,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example usage in the Long help text includes an extraneous double quote at the end of the line, which could be confusing for users. It should be removed.

Suggested change
--output metadata.json"`,
--output metadata.json`,

div *= unit
exp++
}
return fmt.Sprintf("%.2f %cB", float64(b)/float64(div), "KMGTPE"[exp])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current formatBytes function produces slightly awkward output (e.g., 1.23 K B). To improve clarity and align with IEC standards for units based on powers of 1024, it would be better to format the output as 1.23 KiB, 1.23 MiB, etc.

Suggested change
return fmt.Sprintf("%.2f %cB", float64(b)/float64(div), "KMGTPE"[exp])
return fmt.Sprintf("%.2f %ciB", float64(b)/float64(div), "KMGTPE"[exp])

Comment on lines +45 to +49
if finalPath != "metadata.json" && finalPath == reportPath {
logger.Info("Saving report to the user provided output path: %s", finalPath)
} else if finalPath == "metadata.json" {
logger.Info("Saving report to the current directory: %s", finalPath)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logging logic for the report's save path is incomplete. When a user specifies a path in a non-existent directory, the tool correctly falls back to the current directory but fails to log the actual save location, which could cause confusion. Simplifying the logging to always report the final path would be more robust.

	logger.Info("Saving report to: %s", finalPath)

Comment on lines +168 to +170
if len(stats.DetectedCharsets) > 0 {
logger.Success("Database Charset: utf8mb4")
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In validateDatabase, if no DEFAULT CHARSET definitions are found in the SQL dump, nothing is logged about the charset check. While this isn't a failure, it's an important piece of information. It would be beneficial to log a warning in this scenario to inform the user.

Suggested change
if len(stats.DetectedCharsets) > 0 {
logger.Success("Database Charset: utf8mb4")
}
if len(stats.DetectedCharsets) > 0 {
logger.Success("Database Charset: utf8mb4")
} else {
logger.Warn("No 'DEFAULT CHARSET' definitions found in the SQL dump.")
}

Comment on lines +207 to +209
// Execution will never reach here but we need to add a return to keep compiler happy
o.Report.CmkStatus = "Invalid"
return fmt.Errorf("CMK is invalid")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code is unreachable. The ValidateCMK helper function returns an error on failure, which is handled in the preceding if err != nil block. If no error occurs, isValid is guaranteed to be true, and the function will return from within the if isValid block. This unreachable code should be removed to improve code clarity and maintainability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants