Visuelle Steuerungserkennung (OmniParser)

Wir unterstützen auch die visuelle Steuerungserkennung mit OmniParser-v2. Diese Methode ist nützlich, um benutzerdefinierte Steuerelemente in der Anwendung zu erkennen, die von Standard-UIA-Methoden möglicherweise nicht erkannt werden. Die visuelle Steuerungserkennung verwendet Computer-Vision-Techniken, um UI-Elemente anhand ihres visuellen Erscheinungsbilds zu identifizieren und mit ihnen zu interagieren.

Bereitstellung

Klonen Sie auf Ihrem Remote-GPU-Server das OmniParser-Repository

git clone https://github.com/microsoft/OmniParser.git

Starten Sie den omniparserserver-Dienst

cd OmniParser/omnitool/omniparserserver
python gradio_demo.py

Dies gibt Ihnen eine Kurz-URL

* Running on local URL:  http://0.0.0.0:7861
* Running on public URL: https://xxxxxxxxxxxxxxxxxx.gradio.live

Hinweis: Wenn Sie Fragen zur Bereitstellung von OmniParser haben, werfen Sie bitte einen Blick auf die README aus dem OmniParser-Repository.

Konfiguration

Nachdem Sie das OmniParser-Modell bereitgestellt haben, müssen Sie die OmniParser-Einstellungen in der Datei config.yaml konfigurieren.

OMNIPARSER: {
  ENDPOINT: "<YOUR_END_POINT>", # The endpoint for the omniparser deployment
  BOX_THRESHOLD: 0.05, # The box confidence threshold for the omniparser, default is 0.05
  IOU_THRESHOLD: 0.1, # The iou threshold for the omniparser, default is 0.1
  USE_PADDLEOCR: True, # Whether to use the paddleocr for the omniparser
  IMGSZ: 640 # The image size for the omniparser
}

Um das Filtern von Symbolsteuerelementen zu aktivieren, müssen Sie CONTROL_BACKEND in der Datei config_dev.yaml auf ["omniparser"] setzen.

CONTROL_BACKEND: ["omniparser"]

Referenz

Die folgenden Klassen werden für die visuelle Steuerungserkennung in OmniParser verwendet

Basen: BasicGrounding

Die Klasse OmniparserGrounding ist eine Unterklasse von BasicGrounding, die zur Darstellung des Omniparser-Grounding-Modells verwendet wird.

`parse_results(results, application_window=None)`

Analysieren Sie die Grounding-Ergebniszeichenkette in eine Liste von Dictionaries mit Steuerelementinformationen.

Parameter	`results` (`List[Dict[str, Any]]`) – Die Liste der Grounding-Ergebnis-Dictionaries aus dem Grounding-Modell. `application_window` (`UIAWrapper`, Standard: `None` ) – Das Anwendungsfenster, um die absoluten Koordinaten zu erhalten.

Rückgabe

List[Dict[str, Any]] –

Die Liste der Dictionaries mit Steuerelementinformationen, das Dictionary sollte die folgenden Schlüssel enthalten: { "control_type": Der Steuerungstyp des Elements, "name": Der Name des Elements, "x0": Die absolute linke Koordinate des Begrenzungsrahmens in Ganzzahl, "y0": Die absolute obere Koordinate des Begrenzungsrahmens in Ganzzahl, "x1": Die absolute rechte Koordinate des Begrenzungsrahmens in Ganzzahl, "y1": Die absolute untere Koordinate des Begrenzungsrahmens in Ganzzahl, }

Quellcode in automator/ui_control/grounding/omniparser.py

def parse_results(
    self, results: List[Dict[str, Any]], application_window: UIAWrapper = None
) -> List[Dict[str, Any]]:
    """
    Parse the grounding results string into a list of control elements infomation dictionaries.
    :param results: The list of grounding results dictionaries from the grounding model.
    :param application_window: The application window to get the absolute coordinates.
    :return: The list of control elements information dictionaries, the dictionary should contain the following keys:
    {
        "control_type": The control type of the element,
        "name": The name of the element,
        "x0": The absolute left coordinate of the bounding box in integer,
        "y0": The absolute top coordinate of the bounding box in integer,
        "x1": The absolute right coordinate of the bounding box in integer,
        "y1": The absolute bottom coordinate of the bounding box in integer,
    }
    """

    control_elements_info = []

    if application_window is None:
        application_rect = RECT(0, 0, 0, 0)
    else:
        try:
            application_rect = application_window.rectangle()
        except Exception:
            application_rect = RECT(0, 0, 0, 0)

    for control_info in results:

        if not self._filter_interactivity and control_info.get(
            "interactivity", True
        ):
            continue

        application_left, application_top = (
            application_rect.left,
            application_rect.top,
        )

        control_box = control_info.get("bbox", [0, 0, 0, 0])

        control_left = int(
            application_left + control_box[0] * application_rect.width()
        )
        control_top = int(
            application_top + control_box[1] * application_rect.height()
        )
        control_right = int(
            application_left + control_box[2] * application_rect.width()
        )
        control_bottom = int(
            application_top + control_box[3] * application_rect.height()
        )

        control_elements_info.append(
            {
                "control_type": control_info.get("type", "Button"),
                "name": control_info.get("content", ""),
                "x0": control_left,
                "y0": control_top,
                "x1": control_right,
                "y1": control_bottom,
            }
        )

    return control_elements_info

`predict(image_path, box_threshold=0.05, iou_threshold=0.1, use_paddleocr=True, imgsz=640, api_name='/process')`

Sagen Sie das Grounding für das gegebene Bild voraus.

Parameter

image_path (str) –

Der Pfad zum Bild.
box_threshold (float, Standard: 0.05 ) –

Die Schwelle für den Begrenzungsrahmen.
iou_threshold (float, Standard: 0.1 ) –

Die Schwelle für die Schnittmenge über die Vereinigung.
use_paddleocr (bool, Standard: True ) –

Ob paddleocr verwendet werden soll.
imgsz (int, Standard: 640 ) –

Die Bildgröße.
api_name (str, Standard: '/process' ) –

Der Name der API.

Rückgabe	`List[Dict[str, Any]]` – Die vorhergesagte Grounding-Ergebniszeichenkette.

Quellcode in automator/ui_control/grounding/omniparser.py

def predict(
    self,
    image_path: str,
    box_threshold: float = 0.05,
    iou_threshold: float = 0.1,
    use_paddleocr: bool = True,
    imgsz: int = 640,
    api_name: str = "/process",
) -> List[Dict[str, Any]]:
    """
    Predict the grounding for the given image.
    :param image_path: The path to the image.
    :param box_threshold: The threshold for the bounding box.
    :param iou_threshold: The threshold for the intersection over union.
    :param use_paddleocr: Whether to use paddleocr.
    :param imgsz: The image size.
    :param api_name: The name of the API.
    :return: The predicted grounding results string.
    """

    list_of_grounding_results = []

    if not os.path.exists(image_path):
        print_with_color(
            f"Warning: The image path {image_path} does not exist.", "yellow"
        )
        return list_of_grounding_results

    try:
        results = self.service.chat_completion(
            image_path, box_threshold, iou_threshold, use_paddleocr, imgsz, api_name
        )
        grounding_results = results[1].splitlines()

    except Exception as e:
        print_with_color(
            f"Warning: Failed to get grounding results for Omniparser. Error: {e}",
            "yellow",
        )

        return list_of_grounding_results

    for item in grounding_results:
        try:
            item = json.loads(item)
            list_of_grounding_results.append(item)
        except json.JSONDecodeError:
            try:
                # the item string is a string converted from python's dict
                item = ast.literal_eval(item[item.index("{"):item.rindex("}") + 1])
                list_of_grounding_results.append(item)
            except Exception:
                pass

    return list_of_grounding_results