The Signal · Human reviewed

Surgical AI developments, translated for the operating room.

Every article is source-linked, reviewed by a human editor, and explicit about evidence status, limitations, and commercial origin.

evergreen editorial · educational synthesis

The Case for Surgeon-Led Data Communities

Deck: Clinicians can help define useful questions, labels, failure modes, and governance, provided participation is structured and accountable. Evidence status: Evergreen educational analysis based on established Surgical Data Science principles. Not clinical guidance. Estimated reading time: 9–12 minutes. Tags: community, surgeon-led, collaboration. The issue beneath the headline Data communities can reduce duplication and improve clinical relevance by sharing protocols, definitions, lessons, and carefully governed resources. Leadership should include technical, patient, legal, and operational perspectives. Surgical AI discussions often compress several different questions into one: whether a signal can be predicted, whether the prediction transfers, whether a user can interpret it, and whether acting on it improves anything that matters. Separating those questions is not a technicality. It prevents a benchmark result from being mistaken for evidence of clinical readiness. The operating room is a particularly demanding setting for this reasoning. Procedures unfold over time, visibility changes, anatomy varies, and team decisions depend on information that may never appear in the video. Data are also produced within institutions that have different equipment, governance, documentation, and case mixes. A model can be technically competent and still be inappropriate for a proposed use.[1,2] A concrete way to see the problem Surgeons from several hospitals could agree on a phase taxonomy and hard-case review process before pooling any video, making later comparisons more meaningful. This example shows why the unit of analysis matters. Frames from one operation are highly related. Labels may be operational definitions rather than direct observations of a clinical state. A model may use a visually convenient proxy, and a strong average score may hide the specific conditions in which that proxy breaks. The first question should therefore be: what evidence is actually available in the input? The second is: how was the target defined? The third is: where was the relationship tested? Only after those questions should architecture and performance improvements dominate the discussion. Why this matters The distinction matters for both research efficiency and clinical responsibility. Teams waste time when they optimize a benchmark before confirming that its target maps to a useful question. They also create avoidable risk when they present a retrospective output as though it were validated decision support. Clinicians can improve projects by defining events, edge cases, consequences, and workflow constraints. Engineers can improve them by exposing assumptions, testing shift, measuring calibration, and designing systems that recognize unsupported inputs. Neither group can replace the other. Surgical AI is strongest when the clinical definition, data representation, model, interface, and evaluation are designed as one connected system. What to watch for Watch for communities that focus on data extraction without contributor recognition, institutional authority, sustainable stewardship, or patient interests. Also examine what the study does not report. Common omissions include independent case counts, excluded recordings, annotation disagreement, center-specific performance, false alerts per procedure, prospective latency, and the exact role of the user. Missing detail does not prove poor work, but it limits what a reader can conclude. Be cautious with words such as understands, assists, prevents, autonomous, and clinically useful. Each implies evidence beyond simple label prediction. A careful article can still describe promising technical progress while stating that external validation, prospective testing, human-factors work, or regulatory review remains necessary. What better evidence would look like Better evidence starts with a precise intended use and a dataset that reflects the variation relevant to it. It separates patients and procedures across splits, documents provenance, reports difficult cases, and uses metrics connected to the consequences of error. External validation should cross a named boundary such as hospital, device, surgeon group, or time period. For systems intended to operate live, silent prospective evaluation can test data flow, latency, missing inputs, and domain shift before users rely on outputs. Human-factors studies can then examine whether information is understood, whether it changes behavior, and whether it creates alert burden or automation bias. Clinical benefit requires an appropriately designed comparative study; it cannot be inferred from retrospective accuracy alone. Practical question for readers What shared definition or governance tool would be more valuable than immediately pooling raw video? Try answering the question without using the words AI or accuracy. Describe the user, the moment, the input, the output, the possible response, and the consequence of an error. If those elements remain vague, the project probably needs a clearer intended use before it needs a larger model. Closing perspective The goal is not to demand a clinical trial for every exploratory model. Early research needs room to test ideas. The goal is to label evidence honestly so that feasibility, validation, integration, and impact are not confused. A prototype can be valuable because it reveals a measurable signal or a better experimental method. It becomes more credible, not less, when its limits are explicit. Surgical AI will advance through accumulation of well-defined questions, representative data, reproducible methods, external tests, and careful integration. Hype shortens that chain in language. Good editorial work restores the missing steps. References 1. Surgical Data Science — from Concepts toward Clinical Translation. https://doi.org/10.1016/j.media.2021.102306 2. Surgical Data Science: A Consensus Perspective. https://arxiv.org/abs/1806.03184 3. STANDING Together Recommendations for Health Datasets. https://www.nature.com/articles/s41591-024-03147-2 4. WHO Ethics and Governance of Artificial Intelligence for Health. https://www.who.int/publications/i/item/9789240029200 One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average. The same caution applies to newer foundation and multimodal models. Their broad pretraining may improve adaptation, yet scale does not define clinical labels, obtain consent, prevent leakage, or establish safe use. Those responsibilities remain with the people designing and evaluating the system. One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average.

Read analysis

evergreen editorial · educational synthesis

What “Generalization” Should Mean in Surgical AI

Deck: Generalization should name the variation crossed: patients, surgeons, centers, devices, procedures, time, or clinical prevalence. Evidence status: Evergreen educational analysis based on established Surgical Data Science principles. Not clinical guidance. Estimated reading time: 9–12 minutes. Tags: generalization, external-validation, claims. The issue beneath the headline Saying that a model generalizes without specifying the domain change makes the claim impossible to interpret. Each new setting tests different assumptions. Surgical AI discussions often compress several different questions into one: whether a signal can be predicted, whether the prediction transfers, whether a user can interpret it, and whether acting on it improves anything that matters. Separating those questions is not a technicality. It prevents a benchmark result from being mistaken for evidence of clinical readiness. The operating room is a particularly demanding setting for this reasoning. Procedures unfold over time, visibility changes, anatomy varies, and team decisions depend on information that may never appear in the video. Data are also produced within institutions that have different equipment, governance, documentation, and case mixes. A model can be technically competent and still be inappropriate for a proposed use.[1,2] A concrete way to see the problem A model may generalize to new surgeons using the same equipment but fail when moved to a hospital with different cameras and workflow. This example shows why the unit of analysis matters. Frames from one operation are highly related. Labels may be operational definitions rather than direct observations of a clinical state. A model may use a visually convenient proxy, and a strong average score may hide the specific conditions in which that proxy breaks. The first question should therefore be: what evidence is actually available in the input? The second is: how was the target defined? The third is: where was the relationship tested? Only after those questions should architecture and performance improvements dominate the discussion. Why this matters The distinction matters for both research efficiency and clinical responsibility. Teams waste time when they optimize a benchmark before confirming that its target maps to a useful question. They also create avoidable risk when they present a retrospective output as though it were validated decision support. Clinicians can improve projects by defining events, edge cases, consequences, and workflow constraints. Engineers can improve them by exposing assumptions, testing shift, measuring calibration, and designing systems that recognize unsupported inputs. Neither group can replace the other. Surgical AI is strongest when the clinical definition, data representation, model, interface, and evaluation are designed as one connected system. What to watch for Watch for random splits described as generalization, pooled multicenter results without center-specific analysis, and no uncertainty around subgroup estimates. Also examine what the study does not report. Common omissions include independent case counts, excluded recordings, annotation disagreement, center-specific performance, false alerts per procedure, prospective latency, and the exact role of the user. Missing detail does not prove poor work, but it limits what a reader can conclude. Be cautious with words such as understands, assists, prevents, autonomous, and clinically useful. Each implies evidence beyond simple label prediction. A careful article can still describe promising technical progress while stating that external validation, prospective testing, human-factors work, or regulatory review remains necessary. What better evidence would look like Better evidence starts with a precise intended use and a dataset that reflects the variation relevant to it. It separates patients and procedures across splits, documents provenance, reports difficult cases, and uses metrics connected to the consequences of error. External validation should cross a named boundary such as hospital, device, surgeon group, or time period. For systems intended to operate live, silent prospective evaluation can test data flow, latency, missing inputs, and domain shift before users rely on outputs. Human-factors studies can then examine whether information is understood, whether it changes behavior, and whether it creates alert burden or automation bias. Clinical benefit requires an appropriately designed comparative study; it cannot be inferred from retrospective accuracy alone. Practical question for readers Across which concrete boundary does the reported evidence support transfer? Try answering the question without using the words AI or accuracy. Describe the user, the moment, the input, the output, the possible response, and the consequence of an error. If those elements remain vague, the project probably needs a clearer intended use before it needs a larger model. Closing perspective The goal is not to demand a clinical trial for every exploratory model. Early research needs room to test ideas. The goal is to label evidence honestly so that feasibility, validation, integration, and impact are not confused. A prototype can be valuable because it reveals a measurable signal or a better experimental method. It becomes more credible, not less, when its limits are explicit. Surgical AI will advance through accumulation of well-defined questions, representative data, reproducible methods, external tests, and careful integration. Hype shortens that chain in language. Good editorial work restores the missing steps. References 1. Surgical Data Science — from Concepts toward Clinical Translation. https://doi.org/10.1016/j.media.2021.102306 2. Surgical Data Science: A Consensus Perspective. https://arxiv.org/abs/1806.03184 3. On Calibration of Modern Neural Networks. https://proceedings.mlr.press/v70/guo17a.html 4. Variable Generalization Performance of a Deep Learning Model for Chest Radiographs. https://doi.org/10.1371/journal.pmed.1002683 5. TRIPOD+AI Statement. https://www.bmj.com/content/385/bmj-2023-078378 One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average. The same caution applies to newer foundation and multimodal models. Their broad pretraining may improve adaptation, yet scale does not define clinical labels, obtain consent, prevent leakage, or establish safe use. Those responsibilities remain with the people designing and evaluating the system. One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average.

Read analysis

evergreen editorial · educational synthesis

Why Retrospective Accuracy Is Not Clinical Readiness

Deck: Archived-data performance is a necessary technical result for many systems, but readiness requires live reliability, usability, safety, and impact evidence. Evidence status: Evergreen educational analysis based on established Surgical Data Science principles. Not clinical guidance. Estimated reading time: 9–12 minutes. Tags: retrospective, clinical-readiness, validation. The issue beneath the headline Retrospective studies can estimate discrimination under documented conditions. They cannot fully measure workflow adaptation, missing live inputs, user behavior, or maintenance. Surgical AI discussions often compress several different questions into one: whether a signal can be predicted, whether the prediction transfers, whether a user can interpret it, and whether acting on it improves anything that matters. Separating those questions is not a technicality. It prevents a benchmark result from being mistaken for evidence of clinical readiness. The operating room is a particularly demanding setting for this reasoning. Procedures unfold over time, visibility changes, anatomy varies, and team decisions depend on information that may never appear in the video. Data are also produced within institutions that have different equipment, governance, documentation, and case mixes. A model can be technically competent and still be inappropriate for a proposed use.[1,2] A concrete way to see the problem A retrospective anatomy detector may be accurate on selected frames while lacking a way to handle low-quality live video or communicate uncertainty. This example shows why the unit of analysis matters. Frames from one operation are highly related. Labels may be operational definitions rather than direct observations of a clinical state. A model may use a visually convenient proxy, and a strong average score may hide the specific conditions in which that proxy breaks. The first question should therefore be: what evidence is actually available in the input? The second is: how was the target defined? The third is: where was the relationship tested? Only after those questions should architecture and performance improvements dominate the discussion. Why this matters The distinction matters for both research efficiency and clinical responsibility. Teams waste time when they optimize a benchmark before confirming that its target maps to a useful question. They also create avoidable risk when they present a retrospective output as though it were validated decision support. Clinicians can improve projects by defining events, edge cases, consequences, and workflow constraints. Engineers can improve them by exposing assumptions, testing shift, measuring calibration, and designing systems that recognize unsupported inputs. Neither group can replace the other. Surgical AI is strongest when the clinical definition, data representation, model, interface, and evaluation are designed as one connected system. What to watch for Watch for product claims based solely on benchmark metrics and no definition of intended user, decision, latency, or fallback. Also examine what the study does not report. Common omissions include independent case counts, excluded recordings, annotation disagreement, center-specific performance, false alerts per procedure, prospective latency, and the exact role of the user. Missing detail does not prove poor work, but it limits what a reader can conclude. Be cautious with words such as understands, assists, prevents, autonomous, and clinically useful. Each implies evidence beyond simple label prediction. A careful article can still describe promising technical progress while stating that external validation, prospective testing, human-factors work, or regulatory review remains necessary. What better evidence would look like Better evidence starts with a precise intended use and a dataset that reflects the variation relevant to it. It separates patients and procedures across splits, documents provenance, reports difficult cases, and uses metrics connected to the consequences of error. External validation should cross a named boundary such as hospital, device, surgeon group, or time period. For systems intended to operate live, silent prospective evaluation can test data flow, latency, missing inputs, and domain shift before users rely on outputs. Human-factors studies can then examine whether information is understood, whether it changes behavior, and whether it creates alert burden or automation bias. Clinical benefit requires an appropriately designed comparative study; it cannot be inferred from retrospective accuracy alone. Practical question for readers What additional evidence would be required before this output influenced an intraoperative decision? Try answering the question without using the words AI or accuracy. Describe the user, the moment, the input, the output, the possible response, and the consequence of an error. If those elements remain vague, the project probably needs a clearer intended use before it needs a larger model. Closing perspective The goal is not to demand a clinical trial for every exploratory model. Early research needs room to test ideas. The goal is to label evidence honestly so that feasibility, validation, integration, and impact are not confused. A prototype can be valuable because it reveals a measurable signal or a better experimental method. It becomes more credible, not less, when its limits are explicit. Surgical AI will advance through accumulation of well-defined questions, representative data, reproducible methods, external tests, and careful integration. Hype shortens that chain in language. Good editorial work restores the missing steps. References 1. Surgical Data Science — from Concepts toward Clinical Translation. https://doi.org/10.1016/j.media.2021.102306 2. Surgical Data Science: A Consensus Perspective. https://arxiv.org/abs/1806.03184 3. On Calibration of Modern Neural Networks. https://proceedings.mlr.press/v70/guo17a.html 4. Variable Generalization Performance of a Deep Learning Model for Chest Radiographs. https://doi.org/10.1371/journal.pmed.1002683 5. TRIPOD+AI Statement. https://www.bmj.com/content/385/bmj-2023-078378 6. DECIDE-AI: Reporting Guideline for Early-Stage Clinical Evaluation of AI. https://doi.org/10.1136/bmj-2022-070904 One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average. The same caution applies to newer foundation and multimodal models. Their broad pretraining may improve adaptation, yet scale does not define clinical labels, obtain consent, prevent leakage, or establish safe use. Those responsibilities remain with the people designing and evaluating the system. One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average.

Read analysis

evergreen editorial · educational synthesis

How to Read a Surgical AI Abstract Critically

Deck: Abstracts compress complex studies and often foreground technical results while omitting the conditions that limit them. Evidence status: Evergreen educational analysis based on established Surgical Data Science principles. Not clinical guidance. Estimated reading time: 9–12 minutes. Tags: critical-reading, abstracts, evidence. The issue beneath the headline A critical reading identifies the dataset, independent case count, task, split, comparator, metric, validation setting, and exact claim. Missing information becomes a question for the full text. Surgical AI discussions often compress several different questions into one: whether a signal can be predicted, whether the prediction transfers, whether a user can interpret it, and whether acting on it improves anything that matters. Separating those questions is not a technicality. It prevents a benchmark result from being mistaken for evidence of clinical readiness. The operating room is a particularly demanding setting for this reasoning. Procedures unfold over time, visibility changes, anatomy varies, and team decisions depend on information that may never appear in the video. Data are also produced within institutions that have different equipment, governance, documentation, and case mixes. A model can be technically competent and still be inappropriate for a proposed use.[1,2] A concrete way to see the problem An abstract reporting state-of-the-art accuracy may not reveal that evaluation used one center, selected frames, or an offline sequence model with future context. This example shows why the unit of analysis matters. Frames from one operation are highly related. Labels may be operational definitions rather than direct observations of a clinical state. A model may use a visually convenient proxy, and a strong average score may hide the specific conditions in which that proxy breaks. The first question should therefore be: what evidence is actually available in the input? The second is: how was the target defined? The third is: where was the relationship tested? Only after those questions should architecture and performance improvements dominate the discussion. Why this matters The distinction matters for both research efficiency and clinical responsibility. Teams waste time when they optimize a benchmark before confirming that its target maps to a useful question. They also create avoidable risk when they present a retrospective output as though it were validated decision support. Clinicians can improve projects by defining events, edge cases, consequences, and workflow constraints. Engineers can improve them by exposing assumptions, testing shift, measuring calibration, and designing systems that recognize unsupported inputs. Neither group can replace the other. Surgical AI is strongest when the clinical definition, data representation, model, interface, and evaluation are designed as one connected system. What to watch for Watch for clinical verbs such as improves, prevents, or supports when the study only measured retrospective label prediction. Also examine what the study does not report. Common omissions include independent case counts, excluded recordings, annotation disagreement, center-specific performance, false alerts per procedure, prospective latency, and the exact role of the user. Missing detail does not prove poor work, but it limits what a reader can conclude. Be cautious with words such as understands, assists, prevents, autonomous, and clinically useful. Each implies evidence beyond simple label prediction. A careful article can still describe promising technical progress while stating that external validation, prospective testing, human-factors work, or regulatory review remains necessary. What better evidence would look like Better evidence starts with a precise intended use and a dataset that reflects the variation relevant to it. It separates patients and procedures across splits, documents provenance, reports difficult cases, and uses metrics connected to the consequences of error. External validation should cross a named boundary such as hospital, device, surgeon group, or time period. For systems intended to operate live, silent prospective evaluation can test data flow, latency, missing inputs, and domain shift before users rely on outputs. Human-factors studies can then examine whether information is understood, whether it changes behavior, and whether it creates alert burden or automation bias. Clinical benefit requires an appropriately designed comparative study; it cannot be inferred from retrospective accuracy alone. Practical question for readers Which sentence in the conclusion goes beyond what the methods directly tested? Try answering the question without using the words AI or accuracy. Describe the user, the moment, the input, the output, the possible response, and the consequence of an error. If those elements remain vague, the project probably needs a clearer intended use before it needs a larger model. Closing perspective The goal is not to demand a clinical trial for every exploratory model. Early research needs room to test ideas. The goal is to label evidence honestly so that feasibility, validation, integration, and impact are not confused. A prototype can be valuable because it reveals a measurable signal or a better experimental method. It becomes more credible, not less, when its limits are explicit. Surgical AI will advance through accumulation of well-defined questions, representative data, reproducible methods, external tests, and careful integration. Hype shortens that chain in language. Good editorial work restores the missing steps. References 1. Surgical Data Science — from Concepts toward Clinical Translation. https://doi.org/10.1016/j.media.2021.102306 2. Surgical Data Science: A Consensus Perspective. https://arxiv.org/abs/1806.03184 One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average. The same caution applies to newer foundation and multimodal models. Their broad pretraining may improve adaptation, yet scale does not define clinical labels, obtain consent, prevent leakage, or establish safe use. Those responsibilities remain with the people designing and evaluating the system. One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average. The same caution applies to newer foundation and multimodal models. Their broad pretraining may improve adaptation, yet scale does not define clinical labels, obtain consent, prevent leakage, or establish safe use. Those responsibilities remain with the people designing and evaluating the system.

Read analysis

evergreen editorial · educational synthesis

Dataset Stewardship: Why Uploading Is Not Enough

Deck: Long-term value depends on documentation, versioning, access decisions, corrections, and responsibility for derived work. Evidence status: Evergreen educational analysis based on established Surgical Data Science principles. Not clinical guidance. Estimated reading time: 9–12 minutes. Tags: stewardship, datasets, maintenance. The issue beneath the headline A dataset is infrastructure. Stewardship covers provenance, data dictionaries, licenses, issue handling, withdrawal, updates, security, and communication with users. Surgical AI discussions often compress several different questions into one: whether a signal can be predicted, whether the prediction transfers, whether a user can interpret it, and whether acting on it improves anything that matters. Separating those questions is not a technicality. It prevents a benchmark result from being mistaken for evidence of clinical readiness. The operating room is a particularly demanding setting for this reasoning. Procedures unfold over time, visibility changes, anatomy varies, and team decisions depend on information that may never appear in the video. Data are also produced within institutions that have different equipment, governance, documentation, and case mixes. A model can be technically competent and still be inappropriate for a proposed use.[1,2] A concrete way to see the problem If annotation definitions change, maintainers need versioned releases and guidance about whether benchmark results remain comparable. This example shows why the unit of analysis matters. Frames from one operation are highly related. Labels may be operational definitions rather than direct observations of a clinical state. A model may use a visually convenient proxy, and a strong average score may hide the specific conditions in which that proxy breaks. The first question should therefore be: what evidence is actually available in the input? The second is: how was the target defined? The third is: where was the relationship tested? Only after those questions should architecture and performance improvements dominate the discussion. Why this matters The distinction matters for both research efficiency and clinical responsibility. Teams waste time when they optimize a benchmark before confirming that its target maps to a useful question. They also create avoidable risk when they present a retrospective output as though it were validated decision support. Clinicians can improve projects by defining events, edge cases, consequences, and workflow constraints. Engineers can improve them by exposing assumptions, testing shift, measuring calibration, and designing systems that recognize unsupported inputs. Neither group can replace the other. Surgical AI is strongest when the clinical definition, data representation, model, interface, and evaluation are designed as one connected system. What to watch for Look for abandoned download links, unclear licenses, undocumented corrections, and no contact path for reporting problems. Also examine what the study does not report. Common omissions include independent case counts, excluded recordings, annotation disagreement, center-specific performance, false alerts per procedure, prospective latency, and the exact role of the user. Missing detail does not prove poor work, but it limits what a reader can conclude. Be cautious with words such as understands, assists, prevents, autonomous, and clinically useful. Each implies evidence beyond simple label prediction. A careful article can still describe promising technical progress while stating that external validation, prospective testing, human-factors work, or regulatory review remains necessary. What better evidence would look like Better evidence starts with a precise intended use and a dataset that reflects the variation relevant to it. It separates patients and procedures across splits, documents provenance, reports difficult cases, and uses metrics connected to the consequences of error. External validation should cross a named boundary such as hospital, device, surgeon group, or time period. For systems intended to operate live, silent prospective evaluation can test data flow, latency, missing inputs, and domain shift before users rely on outputs. Human-factors studies can then examine whether information is understood, whether it changes behavior, and whether it creates alert burden or automation bias. Clinical benefit requires an appropriately designed comparative study; it cannot be inferred from retrospective accuracy alone. Practical question for readers Who remains accountable for the dataset five years after its initial release? Try answering the question without using the words AI or accuracy. Describe the user, the moment, the input, the output, the possible response, and the consequence of an error. If those elements remain vague, the project probably needs a clearer intended use before it needs a larger model. Closing perspective The goal is not to demand a clinical trial for every exploratory model. Early research needs room to test ideas. The goal is to label evidence honestly so that feasibility, validation, integration, and impact are not confused. A prototype can be valuable because it reveals a measurable signal or a better experimental method. It becomes more credible, not less, when its limits are explicit. Surgical AI will advance through accumulation of well-defined questions, representative data, reproducible methods, external tests, and careful integration. Hype shortens that chain in language. Good editorial work restores the missing steps. References 1. Surgical Data Science — from Concepts toward Clinical Translation. https://doi.org/10.1016/j.media.2021.102306 2. Surgical Data Science: A Consensus Perspective. https://arxiv.org/abs/1806.03184 3. STANDING Together Recommendations for Health Datasets. https://www.nature.com/articles/s41591-024-03147-2 4. WHO Ethics and Governance of Artificial Intelligence for Health. https://www.who.int/publications/i/item/9789240029200 One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average. The same caution applies to newer foundation and multimodal models. Their broad pretraining may improve adaptation, yet scale does not define clinical labels, obtain consent, prevent leakage, or establish safe use. Those responsibilities remain with the people designing and evaluating the system. One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average. The same caution applies to newer foundation and multimodal models. Their broad pretraining may improve adaptation, yet scale does not define clinical labels, obtain consent, prevent leakage, or establish safe use. Those responsibilities remain with the people designing and evaluating the system.

Read analysis

evergreen editorial · educational synthesis

Why De-identification Is Harder Than Blurring Faces

Deck: Surgical data can remain identifiable through audio, metadata, timestamps, rare events, and linkage to other records. Evidence status: Evergreen educational analysis based on established Surgical Data Science principles. Not clinical guidance. Estimated reading time: 9–12 minutes. Tags: de-identification, privacy, governance. The issue beneath the headline De-identification is a risk-reduction process, not a single image filter. The relevant identifiers depend on data type, context, access model, and possible linkage. Surgical AI discussions often compress several different questions into one: whether a signal can be predicted, whether the prediction transfers, whether a user can interpret it, and whether acting on it improves anything that matters. Separating those questions is not a technicality. It prevents a benchmark result from being mistaken for evidence of clinical readiness. The operating room is a particularly demanding setting for this reasoning. Procedures unfold over time, visibility changes, anatomy varies, and team decisions depend on information that may never appear in the video. Data are also produced within institutions that have different equipment, governance, documentation, and case mixes. A model can be technically competent and still be inappropriate for a proposed use.[1,2] A concrete way to see the problem An endoscopic video may show no face but contain spoken names, device serial numbers, procedure dates, and a rare anatomical finding connected to a public case report. This example shows why the unit of analysis matters. Frames from one operation are highly related. Labels may be operational definitions rather than direct observations of a clinical state. A model may use a visually convenient proxy, and a strong average score may hide the specific conditions in which that proxy breaks. The first question should therefore be: what evidence is actually available in the input? The second is: how was the target defined? The third is: where was the relationship tested? Only after those questions should architecture and performance improvements dominate the discussion. Why this matters The distinction matters for both research efficiency and clinical responsibility. Teams waste time when they optimize a benchmark before confirming that its target maps to a useful question. They also create avoidable risk when they present a retrospective output as though it were validated decision support. Clinicians can improve projects by defining events, edge cases, consequences, and workflow constraints. Engineers can improve them by exposing assumptions, testing shift, measuring calibration, and designing systems that recognize unsupported inputs. Neither group can replace the other. Surgical AI is strongest when the clinical definition, data representation, model, interface, and evaluation are designed as one connected system. What to watch for Watch for claims of anonymity without threat modeling, metadata review, audio handling, access controls, or assessment of derived data. Also examine what the study does not report. Common omissions include independent case counts, excluded recordings, annotation disagreement, center-specific performance, false alerts per procedure, prospective latency, and the exact role of the user. Missing detail does not prove poor work, but it limits what a reader can conclude. Be cautious with words such as understands, assists, prevents, autonomous, and clinically useful. Each implies evidence beyond simple label prediction. A careful article can still describe promising technical progress while stating that external validation, prospective testing, human-factors work, or regulatory review remains necessary. What better evidence would look like Better evidence starts with a precise intended use and a dataset that reflects the variation relevant to it. It separates patients and procedures across splits, documents provenance, reports difficult cases, and uses metrics connected to the consequences of error. External validation should cross a named boundary such as hospital, device, surgeon group, or time period. For systems intended to operate live, silent prospective evaluation can test data flow, latency, missing inputs, and domain shift before users rely on outputs. Human-factors studies can then examine whether information is understood, whether it changes behavior, and whether it creates alert burden or automation bias. Clinical benefit requires an appropriately designed comparative study; it cannot be inferred from retrospective accuracy alone. Practical question for readers Which nonvisual detail could reconnect this recording to a patient or procedure? Try answering the question without using the words AI or accuracy. Describe the user, the moment, the input, the output, the possible response, and the consequence of an error. If those elements remain vague, the project probably needs a clearer intended use before it needs a larger model. Closing perspective The goal is not to demand a clinical trial for every exploratory model. Early research needs room to test ideas. The goal is to label evidence honestly so that feasibility, validation, integration, and impact are not confused. A prototype can be valuable because it reveals a measurable signal or a better experimental method. It becomes more credible, not less, when its limits are explicit. Surgical AI will advance through accumulation of well-defined questions, representative data, reproducible methods, external tests, and careful integration. Hype shortens that chain in language. Good editorial work restores the missing steps. References 1. Surgical Data Science — from Concepts toward Clinical Translation. https://doi.org/10.1016/j.media.2021.102306 2. Surgical Data Science: A Consensus Perspective. https://arxiv.org/abs/1806.03184 3. STANDING Together Recommendations for Health Datasets. https://www.nature.com/articles/s41591-024-03147-2 4. WHO Ethics and Governance of Artificial Intelligence for Health. https://www.who.int/publications/i/item/9789240029200 One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average. The same caution applies to newer foundation and multimodal models. Their broad pretraining may improve adaptation, yet scale does not define clinical labels, obtain consent, prevent leakage, or establish safe use. Those responsibilities remain with the people designing and evaluating the system. One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average.

Read analysis

evergreen editorial · educational synthesis

The Hidden Importance of Out-of-Body Frames

Deck: Camera removal and low-information moments test whether a system can recognize when its task is unsupported. Evidence status: Evergreen educational analysis based on established Surgical Data Science principles. Not clinical guidance. Estimated reading time: 9–12 minutes. Tags: out-of-body, abstention, video-quality. The issue beneath the headline Out-of-body frames are common workflow events. Excluding them can create systems that confidently classify phases or anatomy when no surgical scene is visible. Surgical AI discussions often compress several different questions into one: whether a signal can be predicted, whether the prediction transfers, whether a user can interpret it, and whether acting on it improves anything that matters. Separating those questions is not a technicality. It prevents a benchmark result from being mistaken for evidence of clinical readiness. The operating room is a particularly demanding setting for this reasoning. Procedures unfold over time, visibility changes, anatomy varies, and team decisions depend on information that may never appear in the video. Data are also produced within institutions that have different equipment, governance, documentation, and case mixes. A model can be technically competent and still be inappropriate for a proposed use.[1,2] A concrete way to see the problem A phase model might infer a late procedural phase from timing while the camera shows a drape. That may be acceptable for indexing but unsafe for visual guidance. This example shows why the unit of analysis matters. Frames from one operation are highly related. Labels may be operational definitions rather than direct observations of a clinical state. A model may use a visually convenient proxy, and a strong average score may hide the specific conditions in which that proxy breaks. The first question should therefore be: what evidence is actually available in the input? The second is: how was the target defined? The third is: where was the relationship tested? Only after those questions should architecture and performance improvements dominate the discussion. Why this matters The distinction matters for both research efficiency and clinical responsibility. Teams waste time when they optimize a benchmark before confirming that its target maps to a useful question. They also create avoidable risk when they present a retrospective output as though it were validated decision support. Clinicians can improve projects by defining events, edge cases, consequences, and workflow constraints. Engineers can improve them by exposing assumptions, testing shift, measuring calibration, and designing systems that recognize unsupported inputs. Neither group can replace the other. Surgical AI is strongest when the clinical definition, data representation, model, interface, and evaluation are designed as one connected system. What to watch for Watch for datasets that silently remove these frames and models without an unknown, low-quality, or abstention state. Also examine what the study does not report. Common omissions include independent case counts, excluded recordings, annotation disagreement, center-specific performance, false alerts per procedure, prospective latency, and the exact role of the user. Missing detail does not prove poor work, but it limits what a reader can conclude. Be cautious with words such as understands, assists, prevents, autonomous, and clinically useful. Each implies evidence beyond simple label prediction. A careful article can still describe promising technical progress while stating that external validation, prospective testing, human-factors work, or regulatory review remains necessary. What better evidence would look like Better evidence starts with a precise intended use and a dataset that reflects the variation relevant to it. It separates patients and procedures across splits, documents provenance, reports difficult cases, and uses metrics connected to the consequences of error. External validation should cross a named boundary such as hospital, device, surgeon group, or time period. For systems intended to operate live, silent prospective evaluation can test data flow, latency, missing inputs, and domain shift before users rely on outputs. Human-factors studies can then examine whether information is understood, whether it changes behavior, and whether it creates alert burden or automation bias. Clinical benefit requires an appropriately designed comparative study; it cannot be inferred from retrospective accuracy alone. Practical question for readers Should the system preserve temporal context, abstain, or reset when the camera leaves the body? Try answering the question without using the words AI or accuracy. Describe the user, the moment, the input, the output, the possible response, and the consequence of an error. If those elements remain vague, the project probably needs a clearer intended use before it needs a larger model. Closing perspective The goal is not to demand a clinical trial for every exploratory model. Early research needs room to test ideas. The goal is to label evidence honestly so that feasibility, validation, integration, and impact are not confused. A prototype can be valuable because it reveals a measurable signal or a better experimental method. It becomes more credible, not less, when its limits are explicit. Surgical AI will advance through accumulation of well-defined questions, representative data, reproducible methods, external tests, and careful integration. Hype shortens that chain in language. Good editorial work restores the missing steps. References 1. Surgical Data Science — from Concepts toward Clinical Translation. https://doi.org/10.1016/j.media.2021.102306 2. Surgical Data Science: A Consensus Perspective. https://arxiv.org/abs/1806.03184 3. EndoNet: A Deep Architecture for Recognition Tasks on Laparoscopic Videos. https://doi.org/10.1109/TMI.2016.2593957 4. TeCNO for Online Recognition of Surgical Phases. https://arxiv.org/abs/2003.10751 5. Endoscapes: A Critical View of Safety Dataset for Laparoscopic Cholecystectomy. https://arxiv.org/abs/2312.12429 One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average. The same caution applies to newer foundation and multimodal models. Their broad pretraining may improve adaptation, yet scale does not define clinical labels, obtain consent, prevent leakage, or establish safe use. Those responsibilities remain with the people designing and evaluating the system. One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average.

Read analysis

evergreen editorial · educational synthesis

Why Surgical AI Needs Better Negative Examples

Deck: Models learn boundaries from what is absent as well as what is present, yet negative cases are often underspecified or too easy. Evidence status: Evergreen educational analysis based on established Surgical Data Science principles. Not clinical guidance. Estimated reading time: 9–12 minutes. Tags: negative-examples, dataset-design, shortcuts. The issue beneath the headline Useful negatives resemble the target without meeting its definition. They help distinguish true clinical criteria from visual shortcuts. Surgical AI discussions often compress several different questions into one: whether a signal can be predicted, whether the prediction transfers, whether a user can interpret it, and whether acting on it improves anything that matters. Separating those questions is not a technicality. It prevents a benchmark result from being mistaken for evidence of clinical readiness. The operating room is a particularly demanding setting for this reasoning. Procedures unfold over time, visibility changes, anatomy varies, and team decisions depend on information that may never appear in the video. Data are also produced within institutions that have different equipment, governance, documentation, and case mixes. A model can be technically competent and still be inappropriate for a proposed use.[1,2] A concrete way to see the problem For critical-view assessment, negatives should include plausible but incomplete views, not only early dissection frames where the answer is obvious. This example shows why the unit of analysis matters. Frames from one operation are highly related. Labels may be operational definitions rather than direct observations of a clinical state. A model may use a visually convenient proxy, and a strong average score may hide the specific conditions in which that proxy breaks. The first question should therefore be: what evidence is actually available in the input? The second is: how was the target defined? The third is: where was the relationship tested? Only after those questions should architecture and performance improvements dominate the discussion. Why this matters The distinction matters for both research efficiency and clinical responsibility. Teams waste time when they optimize a benchmark before confirming that its target maps to a useful question. They also create avoidable risk when they present a retrospective output as though it were validated decision support. Clinicians can improve projects by defining events, edge cases, consequences, and workflow constraints. Engineers can improve them by exposing assumptions, testing shift, measuring calibration, and designing systems that recognize unsupported inputs. Neither group can replace the other. Surgical AI is strongest when the clinical definition, data representation, model, interface, and evaluation are designed as one connected system. What to watch for Look for hard negatives, prevalence matching, mislabeled uncertainty, and whether negative examples cover failure conditions expected in practice. Also examine what the study does not report. Common omissions include independent case counts, excluded recordings, annotation disagreement, center-specific performance, false alerts per procedure, prospective latency, and the exact role of the user. Missing detail does not prove poor work, but it limits what a reader can conclude. Be cautious with words such as understands, assists, prevents, autonomous, and clinically useful. Each implies evidence beyond simple label prediction. A careful article can still describe promising technical progress while stating that external validation, prospective testing, human-factors work, or regulatory review remains necessary. What better evidence would look like Better evidence starts with a precise intended use and a dataset that reflects the variation relevant to it. It separates patients and procedures across splits, documents provenance, reports difficult cases, and uses metrics connected to the consequences of error. External validation should cross a named boundary such as hospital, device, surgeon group, or time period. For systems intended to operate live, silent prospective evaluation can test data flow, latency, missing inputs, and domain shift before users rely on outputs. Human-factors studies can then examine whether information is understood, whether it changes behavior, and whether it creates alert burden or automation bias. Clinical benefit requires an appropriately designed comparative study; it cannot be inferred from retrospective accuracy alone. Practical question for readers Which near-miss example would most effectively test whether the model learned the intended concept? Try answering the question without using the words AI or accuracy. Describe the user, the moment, the input, the output, the possible response, and the consequence of an error. If those elements remain vague, the project probably needs a clearer intended use before it needs a larger model. Closing perspective The goal is not to demand a clinical trial for every exploratory model. Early research needs room to test ideas. The goal is to label evidence honestly so that feasibility, validation, integration, and impact are not confused. A prototype can be valuable because it reveals a measurable signal or a better experimental method. It becomes more credible, not less, when its limits are explicit. Surgical AI will advance through accumulation of well-defined questions, representative data, reproducible methods, external tests, and careful integration. Hype shortens that chain in language. Good editorial work restores the missing steps. References 1. Surgical Data Science — from Concepts toward Clinical Translation. https://doi.org/10.1016/j.media.2021.102306 2. Surgical Data Science: A Consensus Perspective. https://arxiv.org/abs/1806.03184 3. STANDING Together Recommendations for Health Datasets. https://www.nature.com/articles/s41591-024-03147-2 4. WHO Ethics and Governance of Artificial Intelligence for Health. https://www.who.int/publications/i/item/9789240029200 One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average. The same caution applies to newer foundation and multimodal models. Their broad pretraining may improve adaptation, yet scale does not define clinical labels, obtain consent, prevent leakage, or establish safe use. Those responsibilities remain with the people designing and evaluating the system. One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average.

Read analysis

evergreen editorial · educational synthesis

What Foundation Models Will Not Fix

Deck: Scale cannot repair unclear labels, poor governance, missing populations, weak intended-use definitions, or absent clinical evaluation. Evidence status: Evergreen educational analysis based on established Surgical Data Science principles. Not clinical guidance. Estimated reading time: 9–12 minutes. Tags: foundation-models, limitations, governance. The issue beneath the headline Foundation models can improve representation learning but inherit the data and objectives used to train them. They do not determine which clinical question is worth solving. Surgical AI discussions often compress several different questions into one: whether a signal can be predicted, whether the prediction transfers, whether a user can interpret it, and whether acting on it improves anything that matters. Separating those questions is not a technicality. It prevents a benchmark result from being mistaken for evidence of clinical readiness. The operating room is a particularly demanding setting for this reasoning. Procedures unfold over time, visibility changes, anatomy varies, and team decisions depend on information that may never appear in the video. Data are also produced within institutions that have different equipment, governance, documentation, and case mixes. A model can be technically competent and still be inappropriate for a proposed use.[1,2] A concrete way to see the problem A model pretrained on millions of routine frames may still have little evidence for rare complications or unusual anatomy. This example shows why the unit of analysis matters. Frames from one operation are highly related. Labels may be operational definitions rather than direct observations of a clinical state. A model may use a visually convenient proxy, and a strong average score may hide the specific conditions in which that proxy breaks. The first question should therefore be: what evidence is actually available in the input? The second is: how was the target defined? The third is: where was the relationship tested? Only after those questions should architecture and performance improvements dominate the discussion. Why this matters The distinction matters for both research efficiency and clinical responsibility. Teams waste time when they optimize a benchmark before confirming that its target maps to a useful question. They also create avoidable risk when they present a retrospective output as though it were validated decision support. Clinicians can improve projects by defining events, edge cases, consequences, and workflow constraints. Engineers can improve them by exposing assumptions, testing shift, measuring calibration, and designing systems that recognize unsupported inputs. Neither group can replace the other. Surgical AI is strongest when the clinical definition, data representation, model, interface, and evaluation are designed as one connected system. What to watch for Watch for universal claims based on average benchmark gains and for deployments that substitute model scale for local validation. Also examine what the study does not report. Common omissions include independent case counts, excluded recordings, annotation disagreement, center-specific performance, false alerts per procedure, prospective latency, and the exact role of the user. Missing detail does not prove poor work, but it limits what a reader can conclude. Be cautious with words such as understands, assists, prevents, autonomous, and clinically useful. Each implies evidence beyond simple label prediction. A careful article can still describe promising technical progress while stating that external validation, prospective testing, human-factors work, or regulatory review remains necessary. What better evidence would look like Better evidence starts with a precise intended use and a dataset that reflects the variation relevant to it. It separates patients and procedures across splits, documents provenance, reports difficult cases, and uses metrics connected to the consequences of error. External validation should cross a named boundary such as hospital, device, surgeon group, or time period. For systems intended to operate live, silent prospective evaluation can test data flow, latency, missing inputs, and domain shift before users rely on outputs. Human-factors studies can then examine whether information is understood, whether it changes behavior, and whether it creates alert burden or automation bias. Clinical benefit requires an appropriately designed comparative study; it cannot be inferred from retrospective accuracy alone. Practical question for readers Which project weakness would remain unchanged if the model became ten times larger? Try answering the question without using the words AI or accuracy. Describe the user, the moment, the input, the output, the possible response, and the consequence of an error. If those elements remain vague, the project probably needs a clearer intended use before it needs a larger model. Closing perspective The goal is not to demand a clinical trial for every exploratory model. Early research needs room to test ideas. The goal is to label evidence honestly so that feasibility, validation, integration, and impact are not confused. A prototype can be valuable because it reveals a measurable signal or a better experimental method. It becomes more credible, not less, when its limits are explicit. Surgical AI will advance through accumulation of well-defined questions, representative data, reproducible methods, external tests, and careful integration. Hype shortens that chain in language. Good editorial work restores the missing steps. References 1. Surgical Data Science — from Concepts toward Clinical Translation. https://doi.org/10.1016/j.media.2021.102306 2. Surgical Data Science: A Consensus Perspective. https://arxiv.org/abs/1806.03184 3. STANDING Together Recommendations for Health Datasets. https://www.nature.com/articles/s41591-024-03147-2 4. WHO Ethics and Governance of Artificial Intelligence for Health. https://www.who.int/publications/i/item/9789240029200 5. Segment Anything. https://arxiv.org/abs/2304.02643 6. DINOv2: Learning Robust Visual Features without Supervision. https://arxiv.org/abs/2304.07193 One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average. The same caution applies to newer foundation and multimodal models. Their broad pretraining may improve adaptation, yet scale does not define clinical labels, obtain consent, prevent leakage, or establish safe use. Those responsibilities remain with the people designing and evaluating the system. One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average.

Read analysis

evergreen editorial · educational synthesis

What Foundation Models Could Change in Surgical Video

Deck: Reusable representations may reduce labeling needs and support several downstream tasks, especially where local datasets are small. Evidence status: Evergreen educational analysis based on established Surgical Data Science principles. Not clinical guidance. Estimated reading time: 9–12 minutes. Tags: foundation-models, pretraining, surgical-video. The issue beneath the headline Large-scale self-supervised or multimodal pretraining can capture recurring visual and temporal structure. Adaptation may improve retrieval, segmentation, workflow recognition, or report support. Surgical AI discussions often compress several different questions into one: whether a signal can be predicted, whether the prediction transfers, whether a user can interpret it, and whether acting on it improves anything that matters. Separating those questions is not a technicality. It prevents a benchmark result from being mistaken for evidence of clinical readiness. The operating room is a particularly demanding setting for this reasoning. Procedures unfold over time, visibility changes, anatomy varies, and team decisions depend on information that may never appear in the video. Data are also produced within institutions that have different equipment, governance, documentation, and case mixes. A model can be technically competent and still be inappropriate for a proposed use.[1,2] A concrete way to see the problem A hospital could fine-tune a pretrained surgical encoder for a local phase taxonomy with fewer annotations than training from scratch. This example shows why the unit of analysis matters. Frames from one operation are highly related. Labels may be operational definitions rather than direct observations of a clinical state. A model may use a visually convenient proxy, and a strong average score may hide the specific conditions in which that proxy breaks. The first question should therefore be: what evidence is actually available in the input? The second is: how was the target defined? The third is: where was the relationship tested? Only after those questions should architecture and performance improvements dominate the discussion. Why this matters The distinction matters for both research efficiency and clinical responsibility. Teams waste time when they optimize a benchmark before confirming that its target maps to a useful question. They also create avoidable risk when they present a retrospective output as though it were validated decision support. Clinicians can improve projects by defining events, edge cases, consequences, and workflow constraints. Engineers can improve them by exposing assumptions, testing shift, measuring calibration, and designing systems that recognize unsupported inputs. Neither group can replace the other. Surgical AI is strongest when the clinical definition, data representation, model, interface, and evaluation are designed as one connected system. What to watch for Watch for leakage between pretraining and benchmarks, opaque training data, compute barriers, and evaluation limited to familiar procedures. Also examine what the study does not report. Common omissions include independent case counts, excluded recordings, annotation disagreement, center-specific performance, false alerts per procedure, prospective latency, and the exact role of the user. Missing detail does not prove poor work, but it limits what a reader can conclude. Be cautious with words such as understands, assists, prevents, autonomous, and clinically useful. Each implies evidence beyond simple label prediction. A careful article can still describe promising technical progress while stating that external validation, prospective testing, human-factors work, or regulatory review remains necessary. What better evidence would look like Better evidence starts with a precise intended use and a dataset that reflects the variation relevant to it. It separates patients and procedures across splits, documents provenance, reports difficult cases, and uses metrics connected to the consequences of error. External validation should cross a named boundary such as hospital, device, surgeon group, or time period. For systems intended to operate live, silent prospective evaluation can test data flow, latency, missing inputs, and domain shift before users rely on outputs. Human-factors studies can then examine whether information is understood, whether it changes behavior, and whether it creates alert burden or automation bias. Clinical benefit requires an appropriately designed comparative study; it cannot be inferred from retrospective accuracy alone. Practical question for readers Which downstream task benefits, and how much local data and validation are still required? Try answering the question without using the words AI or accuracy. Describe the user, the moment, the input, the output, the possible response, and the consequence of an error. If those elements remain vague, the project probably needs a clearer intended use before it needs a larger model. Closing perspective The goal is not to demand a clinical trial for every exploratory model. Early research needs room to test ideas. The goal is to label evidence honestly so that feasibility, validation, integration, and impact are not confused. A prototype can be valuable because it reveals a measurable signal or a better experimental method. It becomes more credible, not less, when its limits are explicit. Surgical AI will advance through accumulation of well-defined questions, representative data, reproducible methods, external tests, and careful integration. Hype shortens that chain in language. Good editorial work restores the missing steps. References 1. Surgical Data Science — from Concepts toward Clinical Translation. https://doi.org/10.1016/j.media.2021.102306 2. Surgical Data Science: A Consensus Perspective. https://arxiv.org/abs/1806.03184 3. EndoNet: A Deep Architecture for Recognition Tasks on Laparoscopic Videos. https://doi.org/10.1109/TMI.2016.2593957 4. TeCNO for Online Recognition of Surgical Phases. https://arxiv.org/abs/2003.10751 5. Endoscapes: A Critical View of Safety Dataset for Laparoscopic Cholecystectomy. https://arxiv.org/abs/2312.12429 6. Segment Anything. https://arxiv.org/abs/2304.02643 One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average. The same caution applies to newer foundation and multimodal models. Their broad pretraining may improve adaptation, yet scale does not define clinical labels, obtain consent, prevent leakage, or establish safe use. Those responsibilities remain with the people designing and evaluating the system. One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average.

Read analysis

evergreen editorial · educational synthesis

How to Think About Explainability in Surgical AI

Deck: An explanation is useful only when it is faithful to the model, understandable to its user, and connected to a real review task. Evidence status: Evergreen educational analysis based on established Surgical Data Science principles. Not clinical guidance. Estimated reading time: 9–12 minutes. Tags: explainability, interpretability, interfaces. The issue beneath the headline Heatmaps, examples, concept scores, and language rationales reveal different aspects of behavior. None automatically proves that the model used clinically appropriate evidence. Surgical AI discussions often compress several different questions into one: whether a signal can be predicted, whether the prediction transfers, whether a user can interpret it, and whether acting on it improves anything that matters. Separating those questions is not a technicality. It prevents a benchmark result from being mistaken for evidence of clinical readiness. The operating room is a particularly demanding setting for this reasoning. Procedures unfold over time, visibility changes, anatomy varies, and team decisions depend on information that may never appear in the video. Data are also produced within institutions that have different equipment, governance, documentation, and case mixes. A model can be technically competent and still be inappropriate for a proposed use.[1,2] A concrete way to see the problem A heatmap over Calot's triangle may reassure a viewer, but it should be tested by changing that region and observing whether the prediction responds coherently. This example shows why the unit of analysis matters. Frames from one operation are highly related. Labels may be operational definitions rather than direct observations of a clinical state. A model may use a visually convenient proxy, and a strong average score may hide the specific conditions in which that proxy breaks. The first question should therefore be: what evidence is actually available in the input? The second is: how was the target defined? The third is: where was the relationship tested? Only after those questions should architecture and performance improvements dominate the discussion. Why this matters The distinction matters for both research efficiency and clinical responsibility. Teams waste time when they optimize a benchmark before confirming that its target maps to a useful question. They also create avoidable risk when they present a retrospective output as though it were validated decision support. Clinicians can improve projects by defining events, edge cases, consequences, and workflow constraints. Engineers can improve them by exposing assumptions, testing shift, measuring calibration, and designing systems that recognize unsupported inputs. Neither group can replace the other. Surgical AI is strongest when the clinical definition, data representation, model, interface, and evaluation are designed as one connected system. What to watch for Watch for explanations selected only because they look plausible, no user study, unstable outputs, and claims that explanation compensates for weak validation. Also examine what the study does not report. Common omissions include independent case counts, excluded recordings, annotation disagreement, center-specific performance, false alerts per procedure, prospective latency, and the exact role of the user. Missing detail does not prove poor work, but it limits what a reader can conclude. Be cautious with words such as understands, assists, prevents, autonomous, and clinically useful. Each implies evidence beyond simple label prediction. A careful article can still describe promising technical progress while stating that external validation, prospective testing, human-factors work, or regulatory review remains necessary. What better evidence would look like Better evidence starts with a precise intended use and a dataset that reflects the variation relevant to it. It separates patients and procedures across splits, documents provenance, reports difficult cases, and uses metrics connected to the consequences of error. External validation should cross a named boundary such as hospital, device, surgeon group, or time period. For systems intended to operate live, silent prospective evaluation can test data flow, latency, missing inputs, and domain shift before users rely on outputs. Human-factors studies can then examine whether information is understood, whether it changes behavior, and whether it creates alert burden or automation bias. Clinical benefit requires an appropriately designed comparative study; it cannot be inferred from retrospective accuracy alone. Practical question for readers What decision can the user make more safely because this explanation is present? Try answering the question without using the words AI or accuracy. Describe the user, the moment, the input, the output, the possible response, and the consequence of an error. If those elements remain vague, the project probably needs a clearer intended use before it needs a larger model. Closing perspective The goal is not to demand a clinical trial for every exploratory model. Early research needs room to test ideas. The goal is to label evidence honestly so that feasibility, validation, integration, and impact are not confused. A prototype can be valuable because it reveals a measurable signal or a better experimental method. It becomes more credible, not less, when its limits are explicit. Surgical AI will advance through accumulation of well-defined questions, representative data, reproducible methods, external tests, and careful integration. Hype shortens that chain in language. Good editorial work restores the missing steps. References 1. Surgical Data Science — from Concepts toward Clinical Translation. https://doi.org/10.1016/j.media.2021.102306 2. Surgical Data Science: A Consensus Perspective. https://arxiv.org/abs/1806.03184 One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average. The same caution applies to newer foundation and multimodal models. Their broad pretraining may improve adaptation, yet scale does not define clinical labels, obtain consent, prevent leakage, or establish safe use. Those responsibilities remain with the people designing and evaluating the system. One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average. The same caution applies to newer foundation and multimodal models. Their broad pretraining may improve adaptation, yet scale does not define clinical labels, obtain consent, prevent leakage, or establish safe use. Those responsibilities remain with the people designing and evaluating the system.

Read analysis

evergreen editorial · educational synthesis

Why Confidence Scores Can Mislead Surgeons

Deck: A numerical probability can look precise even when the model is miscalibrated or facing data unlike its training set. Evidence status: Evergreen educational analysis based on established Surgical Data Science principles. Not clinical guidance. Estimated reading time: 9–12 minutes. Tags: confidence, calibration, uncertainty. The issue beneath the headline Confidence usually reflects a model's internal scoring after training. It is not a direct measure of clinical correctness, image adequacy, or familiarity with the case. Surgical AI discussions often compress several different questions into one: whether a signal can be predicted, whether the prediction transfers, whether a user can interpret it, and whether acting on it improves anything that matters. Separating those questions is not a technicality. It prevents a benchmark result from being mistaken for evidence of clinical readiness. The operating room is a particularly demanding setting for this reasoning. Procedures unfold over time, visibility changes, anatomy varies, and team decisions depend on information that may never appear in the video. Data are also produced within institutions that have different equipment, governance, documentation, and case mixes. A model can be technically competent and still be inappropriate for a proposed use.[1,2] A concrete way to see the problem An anatomy model may report 95 percent confidence on a smoke-obscured frame because neural networks can be overconfident outside their training distribution. This example shows why the unit of analysis matters. Frames from one operation are highly related. Labels may be operational definitions rather than direct observations of a clinical state. A model may use a visually convenient proxy, and a strong average score may hide the specific conditions in which that proxy breaks. The first question should therefore be: what evidence is actually available in the input? The second is: how was the target defined? The third is: where was the relationship tested? Only after those questions should architecture and performance improvements dominate the discussion. Why this matters The distinction matters for both research efficiency and clinical responsibility. Teams waste time when they optimize a benchmark before confirming that its target maps to a useful question. They also create avoidable risk when they present a retrospective output as though it were validated decision support. Clinicians can improve projects by defining events, edge cases, consequences, and workflow constraints. Engineers can improve them by exposing assumptions, testing shift, measuring calibration, and designing systems that recognize unsupported inputs. Neither group can replace the other. Surgical AI is strongest when the clinical definition, data representation, model, interface, and evaluation are designed as one connected system. What to watch for Look for calibration curves, external calibration, uncertainty under shift, abstention rules, and interfaces that avoid false precision. Also examine what the study does not report. Common omissions include independent case counts, excluded recordings, annotation disagreement, center-specific performance, false alerts per procedure, prospective latency, and the exact role of the user. Missing detail does not prove poor work, but it limits what a reader can conclude. Be cautious with words such as understands, assists, prevents, autonomous, and clinically useful. Each implies evidence beyond simple label prediction. A careful article can still describe promising technical progress while stating that external validation, prospective testing, human-factors work, or regulatory review remains necessary. What better evidence would look like Better evidence starts with a precise intended use and a dataset that reflects the variation relevant to it. It separates patients and procedures across splits, documents provenance, reports difficult cases, and uses metrics connected to the consequences of error. External validation should cross a named boundary such as hospital, device, surgeon group, or time period. For systems intended to operate live, silent prospective evaluation can test data flow, latency, missing inputs, and domain shift before users rely on outputs. Human-factors studies can then examine whether information is understood, whether it changes behavior, and whether it creates alert burden or automation bias. Clinical benefit requires an appropriately designed comparative study; it cannot be inferred from retrospective accuracy alone. Practical question for readers What does the displayed number claim to measure, and was that interpretation validated? Try answering the question without using the words AI or accuracy. Describe the user, the moment, the input, the output, the possible response, and the consequence of an error. If those elements remain vague, the project probably needs a clearer intended use before it needs a larger model. Closing perspective The goal is not to demand a clinical trial for every exploratory model. Early research needs room to test ideas. The goal is to label evidence honestly so that feasibility, validation, integration, and impact are not confused. A prototype can be valuable because it reveals a measurable signal or a better experimental method. It becomes more credible, not less, when its limits are explicit. Surgical AI will advance through accumulation of well-defined questions, representative data, reproducible methods, external tests, and careful integration. Hype shortens that chain in language. Good editorial work restores the missing steps. References 1. Surgical Data Science — from Concepts toward Clinical Translation. https://doi.org/10.1016/j.media.2021.102306 2. Surgical Data Science: A Consensus Perspective. https://arxiv.org/abs/1806.03184 3. On Calibration of Modern Neural Networks. https://proceedings.mlr.press/v70/guo17a.html 4. Variable Generalization Performance of a Deep Learning Model for Chest Radiographs. https://doi.org/10.1371/journal.pmed.1002683 5. TRIPOD+AI Statement. https://www.bmj.com/content/385/bmj-2023-078378 One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average. The same caution applies to newer foundation and multimodal models. Their broad pretraining may improve adaptation, yet scale does not define clinical labels, obtain consent, prevent leakage, or establish safe use. Those responsibilities remain with the people designing and evaluating the system. One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average.

Read analysis

evergreen editorial · educational synthesis

The Difference Between Automation and Assistance in Surgery

Deck: Automation transfers execution of a task; assistance supplies information or capability while leaving defined decisions and actions with people. Evidence status: Evergreen educational analysis based on established Surgical Data Science principles. Not clinical guidance. Estimated reading time: 9–12 minutes. Tags: automation, assistance, responsibility. The issue beneath the headline The distinction changes evidence, responsibility, interface design, and risk. Systems also exist on a continuum rather than in two neat categories. Surgical AI discussions often compress several different questions into one: whether a signal can be predicted, whether the prediction transfers, whether a user can interpret it, and whether acting on it improves anything that matters. Separating those questions is not a technicality. It prevents a benchmark result from being mistaken for evidence of clinical readiness. The operating room is a particularly demanding setting for this reasoning. Procedures unfold over time, visibility changes, anatomy varies, and team decisions depend on information that may never appear in the video. Data are also produced within institutions that have different equipment, governance, documentation, and case mixes. A model can be technically competent and still be inappropriate for a proposed use.[1,2] A concrete way to see the problem A model that indexes procedure phases after surgery is analytical assistance. A robot that adjusts an instrument trajectory acts on the physical world and requires substantially stronger controls. This example shows why the unit of analysis matters. Frames from one operation are highly related. Labels may be operational definitions rather than direct observations of a clinical state. A model may use a visually convenient proxy, and a strong average score may hide the specific conditions in which that proxy breaks. The first question should therefore be: what evidence is actually available in the input? The second is: how was the target defined? The third is: where was the relationship tested? Only after those questions should architecture and performance improvements dominate the discussion. Why this matters The distinction matters for both research efficiency and clinical responsibility. Teams waste time when they optimize a benchmark before confirming that its target maps to a useful question. They also create avoidable risk when they present a retrospective output as though it were validated decision support. Clinicians can improve projects by defining events, edge cases, consequences, and workflow constraints. Engineers can improve them by exposing assumptions, testing shift, measuring calibration, and designing systems that recognize unsupported inputs. Neither group can replace the other. Surgical AI is strongest when the clinical definition, data representation, model, interface, and evaluation are designed as one connected system. What to watch for Watch for marketing language that calls a suggestion autonomous or calls an automated action mere support. Ask who initiates, confirms, monitors, and can override the action. Also examine what the study does not report. Common omissions include independent case counts, excluded recordings, annotation disagreement, center-specific performance, false alerts per procedure, prospective latency, and the exact role of the user. Missing detail does not prove poor work, but it limits what a reader can conclude. Be cautious with words such as understands, assists, prevents, autonomous, and clinically useful. Each implies evidence beyond simple label prediction. A careful article can still describe promising technical progress while stating that external validation, prospective testing, human-factors work, or regulatory review remains necessary. What better evidence would look like Better evidence starts with a precise intended use and a dataset that reflects the variation relevant to it. It separates patients and procedures across splits, documents provenance, reports difficult cases, and uses metrics connected to the consequences of error. External validation should cross a named boundary such as hospital, device, surgeon group, or time period. For systems intended to operate live, silent prospective evaluation can test data flow, latency, missing inputs, and domain shift before users rely on outputs. Human-factors studies can then examine whether information is understood, whether it changes behavior, and whether it creates alert burden or automation bias. Clinical benefit requires an appropriately designed comparative study; it cannot be inferred from retrospective accuracy alone. Practical question for readers At which exact point does the system change from informing a person to executing part of the task? Try answering the question without using the words AI or accuracy. Describe the user, the moment, the input, the output, the possible response, and the consequence of an error. If those elements remain vague, the project probably needs a clearer intended use before it needs a larger model. Closing perspective The goal is not to demand a clinical trial for every exploratory model. Early research needs room to test ideas. The goal is to label evidence honestly so that feasibility, validation, integration, and impact are not confused. A prototype can be valuable because it reveals a measurable signal or a better experimental method. It becomes more credible, not less, when its limits are explicit. Surgical AI will advance through accumulation of well-defined questions, representative data, reproducible methods, external tests, and careful integration. Hype shortens that chain in language. Good editorial work restores the missing steps. References 1. Surgical Data Science — from Concepts toward Clinical Translation. https://doi.org/10.1016/j.media.2021.102306 2. Surgical Data Science: A Consensus Perspective. https://arxiv.org/abs/1806.03184 One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average. The same caution applies to newer foundation and multimodal models. Their broad pretraining may improve adaptation, yet scale does not define clinical labels, obtain consent, prevent leakage, or establish safe use. Those responsibilities remain with the people designing and evaluating the system. One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average.

Read analysis

evergreen editorial · educational synthesis

Why Annotation Is a Clinical Act, Not Just a Labeling Task

Deck: Every label embeds a definition of what matters, what is visible, and how ambiguity should be resolved. Evidence status: Evergreen educational analysis based on established Surgical Data Science principles. Not clinical guidance. Estimated reading time: 9–12 minutes. Tags: annotation, clinical-expertise, labels. The issue beneath the headline Clinical annotation requires translating expertise into operational rules. The process can shape the target more strongly than the model architecture. Surgical AI discussions often compress several different questions into one: whether a signal can be predicted, whether the prediction transfers, whether a user can interpret it, and whether acting on it improves anything that matters. Separating those questions is not a technicality. It prevents a benchmark result from being mistaken for evidence of clinical readiness. The operating room is a particularly demanding setting for this reasoning. Procedures unfold over time, visibility changes, anatomy varies, and team decisions depend on information that may never appear in the video. Data are also produced within institutions that have different equipment, governance, documentation, and case mixes. A model can be technically competent and still be inappropriate for a proposed use.[1,2] A concrete way to see the problem Defining a bleeding event requires decisions about active flow, pooled blood, irrigation, duration, and visibility. Those decisions determine prevalence and what errors mean. This example shows why the unit of analysis matters. Frames from one operation are highly related. Labels may be operational definitions rather than direct observations of a clinical state. A model may use a visually convenient proxy, and a strong average score may hide the specific conditions in which that proxy breaks. The first question should therefore be: what evidence is actually available in the input? The second is: how was the target defined? The third is: where was the relationship tested? Only after those questions should architecture and performance improvements dominate the discussion. Why this matters The distinction matters for both research efficiency and clinical responsibility. Teams waste time when they optimize a benchmark before confirming that its target maps to a useful question. They also create avoidable risk when they present a retrospective output as though it were validated decision support. Clinicians can improve projects by defining events, edge cases, consequences, and workflow constraints. Engineers can improve them by exposing assumptions, testing shift, measuring calibration, and designing systems that recognize unsupported inputs. Neither group can replace the other. Surgical AI is strongest when the clinical definition, data representation, model, interface, and evaluation are designed as one connected system. What to watch for Look for annotator expertise, protocols, examples, agreement, adjudication, and versioning rather than accepting the phrase expert annotated as sufficient. Also examine what the study does not report. Common omissions include independent case counts, excluded recordings, annotation disagreement, center-specific performance, false alerts per procedure, prospective latency, and the exact role of the user. Missing detail does not prove poor work, but it limits what a reader can conclude. Be cautious with words such as understands, assists, prevents, autonomous, and clinically useful. Each implies evidence beyond simple label prediction. A careful article can still describe promising technical progress while stating that external validation, prospective testing, human-factors work, or regulatory review remains necessary. What better evidence would look like Better evidence starts with a precise intended use and a dataset that reflects the variation relevant to it. It separates patients and procedures across splits, documents provenance, reports difficult cases, and uses metrics connected to the consequences of error. External validation should cross a named boundary such as hospital, device, surgeon group, or time period. For systems intended to operate live, silent prospective evaluation can test data flow, latency, missing inputs, and domain shift before users rely on outputs. Human-factors studies can then examine whether information is understood, whether it changes behavior, and whether it creates alert burden or automation bias. Clinical benefit requires an appropriately designed comparative study; it cannot be inferred from retrospective accuracy alone. Practical question for readers Which clinical judgment has been compressed into the label used by the model? Try answering the question without using the words AI or accuracy. Describe the user, the moment, the input, the output, the possible response, and the consequence of an error. If those elements remain vague, the project probably needs a clearer intended use before it needs a larger model. Closing perspective The goal is not to demand a clinical trial for every exploratory model. Early research needs room to test ideas. The goal is to label evidence honestly so that feasibility, validation, integration, and impact are not confused. A prototype can be valuable because it reveals a measurable signal or a better experimental method. It becomes more credible, not less, when its limits are explicit. Surgical AI will advance through accumulation of well-defined questions, representative data, reproducible methods, external tests, and careful integration. Hype shortens that chain in language. Good editorial work restores the missing steps. References 1. Surgical Data Science — from Concepts toward Clinical Translation. https://doi.org/10.1016/j.media.2021.102306 2. Surgical Data Science: A Consensus Perspective. https://arxiv.org/abs/1806.03184 3. STANDING Together Recommendations for Health Datasets. https://www.nature.com/articles/s41591-024-03147-2 4. WHO Ethics and Governance of Artificial Intelligence for Health. https://www.who.int/publications/i/item/9789240029200 5. DECIDE-AI: Reporting Guideline for Early-Stage Clinical Evaluation of AI. https://doi.org/10.1136/bmj-2022-070904 6. CONSORT-AI Extension. https://doi.org/10.1038/s41591-020-1034-x One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average. The same caution applies to newer foundation and multimodal models. Their broad pretraining may improve adaptation, yet scale does not define clinical labels, obtain consent, prevent leakage, or establish safe use. Those responsibilities remain with the people designing and evaluating the system. One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average.

Read analysis

evergreen editorial · educational synthesis

What Makes a Dataset Multicenter

Deck: A multicenter label is meaningful only when center variation is preserved and evaluated rather than pooled away. Evidence status: Evergreen educational analysis based on established Surgical Data Science principles. Not clinical guidance. Estimated reading time: 9–12 minutes. Tags: multicenter, generalization, datasets. The issue beneath the headline Multicenter datasets include data from distinct institutions, but their value depends on independent provenance, meaningful variation, and center-aware splits. Surgical AI discussions often compress several different questions into one: whether a signal can be predicted, whether the prediction transfers, whether a user can interpret it, and whether acting on it improves anything that matters. Separating those questions is not a technicality. It prevents a benchmark result from being mistaken for evidence of clinical readiness. The operating room is a particularly demanding setting for this reasoning. Procedures unfold over time, visibility changes, anatomy varies, and team decisions depend on information that may never appear in the video. Data are also produced within institutions that have different equipment, governance, documentation, and case mixes. A model can be technically competent and still be inappropriate for a proposed use.[1,2] A concrete way to see the problem Combining two hospitals and randomly mixing all cases can improve diversity while still failing to test transfer. Holding one center out asks a stronger generalization question. This example shows why the unit of analysis matters. Frames from one operation are highly related. Labels may be operational definitions rather than direct observations of a clinical state. A model may use a visually convenient proxy, and a strong average score may hide the specific conditions in which that proxy breaks. The first question should therefore be: what evidence is actually available in the input? The second is: how was the target defined? The third is: where was the relationship tested? Only after those questions should architecture and performance improvements dominate the discussion. Why this matters The distinction matters for both research efficiency and clinical responsibility. Teams waste time when they optimize a benchmark before confirming that its target maps to a useful question. They also create avoidable risk when they present a retrospective output as though it were validated decision support. Clinicians can improve projects by defining events, edge cases, consequences, and workflow constraints. Engineers can improve them by exposing assumptions, testing shift, measuring calibration, and designing systems that recognize unsupported inputs. Neither group can replace the other. Surgical AI is strongest when the clinical definition, data representation, model, interface, and evaluation are designed as one connected system. What to watch for Watch for unbalanced center sizes, common equipment suppliers, harmonized protocols, and results reported only after pooling. Also examine what the study does not report. Common omissions include independent case counts, excluded recordings, annotation disagreement, center-specific performance, false alerts per procedure, prospective latency, and the exact role of the user. Missing detail does not prove poor work, but it limits what a reader can conclude. Be cautious with words such as understands, assists, prevents, autonomous, and clinically useful. Each implies evidence beyond simple label prediction. A careful article can still describe promising technical progress while stating that external validation, prospective testing, human-factors work, or regulatory review remains necessary. What better evidence would look like Better evidence starts with a precise intended use and a dataset that reflects the variation relevant to it. It separates patients and procedures across splits, documents provenance, reports difficult cases, and uses metrics connected to the consequences of error. External validation should cross a named boundary such as hospital, device, surgeon group, or time period. For systems intended to operate live, silent prospective evaluation can test data flow, latency, missing inputs, and domain shift before users rely on outputs. Human-factors studies can then examine whether information is understood, whether it changes behavior, and whether it creates alert burden or automation bias. Clinical benefit requires an appropriately designed comparative study; it cannot be inferred from retrospective accuracy alone. Practical question for readers Does the evaluation show performance at each center and on a center excluded from training? Try answering the question without using the words AI or accuracy. Describe the user, the moment, the input, the output, the possible response, and the consequence of an error. If those elements remain vague, the project probably needs a clearer intended use before it needs a larger model. Closing perspective The goal is not to demand a clinical trial for every exploratory model. Early research needs room to test ideas. The goal is to label evidence honestly so that feasibility, validation, integration, and impact are not confused. A prototype can be valuable because it reveals a measurable signal or a better experimental method. It becomes more credible, not less, when its limits are explicit. Surgical AI will advance through accumulation of well-defined questions, representative data, reproducible methods, external tests, and careful integration. Hype shortens that chain in language. Good editorial work restores the missing steps. References 1. Surgical Data Science — from Concepts toward Clinical Translation. https://doi.org/10.1016/j.media.2021.102306 2. Surgical Data Science: A Consensus Perspective. https://arxiv.org/abs/1806.03184 3. STANDING Together Recommendations for Health Datasets. https://www.nature.com/articles/s41591-024-03147-2 4. WHO Ethics and Governance of Artificial Intelligence for Health. https://www.who.int/publications/i/item/9789240029200 5. On Calibration of Modern Neural Networks. https://proceedings.mlr.press/v70/guo17a.html 6. Variable Generalization Performance of a Deep Learning Model for Chest Radiographs. https://doi.org/10.1371/journal.pmed.1002683 One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average. The same caution applies to newer foundation and multimodal models. Their broad pretraining may improve adaptation, yet scale does not define clinical labels, obtain consent, prevent leakage, or establish safe use. Those responsibilities remain with the people designing and evaluating the system. One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average.

Read analysis

evergreen editorial · educational synthesis

Why Prospective Validation Matters

Deck: Retrospective performance cannot reproduce the missing data, timing constraints, behavior changes, and operational friction of live use. Evidence status: Evergreen educational analysis based on established Surgical Data Science principles. Not clinical guidance. Estimated reading time: 9–12 minutes. Tags: prospective-validation, clinical-evidence, deployment. The issue beneath the headline Prospective validation fixes the model and protocol before observing incoming cases. It can reveal distribution changes, latency, sensor failures, exclusions, and interactions with staff. Surgical AI discussions often compress several different questions into one: whether a signal can be predicted, whether the prediction transfers, whether a user can interpret it, and whether acting on it improves anything that matters. Separating those questions is not a technicality. It prevents a benchmark result from being mistaken for evidence of clinical readiness. The operating room is a particularly demanding setting for this reasoning. Procedures unfold over time, visibility changes, anatomy varies, and team decisions depend on information that may never appear in the video. Data are also produced within institutions that have different equipment, governance, documentation, and case mixes. A model can be technically competent and still be inappropriate for a proposed use.[1,2] A concrete way to see the problem A model may perform well on curated archived video but fail silently when a live feed changes resolution or briefly disconnects. Silent prospective deployment can identify this before outputs influence care. This example shows why the unit of analysis matters. Frames from one operation are highly related. Labels may be operational definitions rather than direct observations of a clinical state. A model may use a visually convenient proxy, and a strong average score may hide the specific conditions in which that proxy breaks. The first question should therefore be: what evidence is actually available in the input? The second is: how was the target defined? The third is: where was the relationship tested? Only after those questions should architecture and performance improvements dominate the discussion. Why this matters The distinction matters for both research efficiency and clinical responsibility. Teams waste time when they optimize a benchmark before confirming that its target maps to a useful question. They also create avoidable risk when they present a retrospective output as though it were validated decision support. Clinicians can improve projects by defining events, edge cases, consequences, and workflow constraints. Engineers can improve them by exposing assumptions, testing shift, measuring calibration, and designing systems that recognize unsupported inputs. Neither group can replace the other. Surgical AI is strongest when the clinical definition, data representation, model, interface, and evaluation are designed as one connected system. What to watch for Check whether the study was merely prospective data collection or whether it evaluated an intervention, and whether model updates occurred during the study. Also examine what the study does not report. Common omissions include independent case counts, excluded recordings, annotation disagreement, center-specific performance, false alerts per procedure, prospective latency, and the exact role of the user. Missing detail does not prove poor work, but it limits what a reader can conclude. Be cautious with words such as understands, assists, prevents, autonomous, and clinically useful. Each implies evidence beyond simple label prediction. A careful article can still describe promising technical progress while stating that external validation, prospective testing, human-factors work, or regulatory review remains necessary. What better evidence would look like Better evidence starts with a precise intended use and a dataset that reflects the variation relevant to it. It separates patients and procedures across splits, documents provenance, reports difficult cases, and uses metrics connected to the consequences of error. External validation should cross a named boundary such as hospital, device, surgeon group, or time period. For systems intended to operate live, silent prospective evaluation can test data flow, latency, missing inputs, and domain shift before users rely on outputs. Human-factors studies can then examine whether information is understood, whether it changes behavior, and whether it creates alert burden or automation bias. Clinical benefit requires an appropriately designed comparative study; it cannot be inferred from retrospective accuracy alone. Practical question for readers What live condition could not have been represented by the retrospective test set? Try answering the question without using the words AI or accuracy. Describe the user, the moment, the input, the output, the possible response, and the consequence of an error. If those elements remain vague, the project probably needs a clearer intended use before it needs a larger model. Closing perspective The goal is not to demand a clinical trial for every exploratory model. Early research needs room to test ideas. The goal is to label evidence honestly so that feasibility, validation, integration, and impact are not confused. A prototype can be valuable because it reveals a measurable signal or a better experimental method. It becomes more credible, not less, when its limits are explicit. Surgical AI will advance through accumulation of well-defined questions, representative data, reproducible methods, external tests, and careful integration. Hype shortens that chain in language. Good editorial work restores the missing steps. References 1. Surgical Data Science — from Concepts toward Clinical Translation. https://doi.org/10.1016/j.media.2021.102306 2. Surgical Data Science: A Consensus Perspective. https://arxiv.org/abs/1806.03184 3. On Calibration of Modern Neural Networks. https://proceedings.mlr.press/v70/guo17a.html 4. Variable Generalization Performance of a Deep Learning Model for Chest Radiographs. https://doi.org/10.1371/journal.pmed.1002683 5. TRIPOD+AI Statement. https://www.bmj.com/content/385/bmj-2023-078378 6. DECIDE-AI: Reporting Guideline for Early-Stage Clinical Evaluation of AI. https://doi.org/10.1136/bmj-2022-070904 One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average. The same caution applies to newer foundation and multimodal models. Their broad pretraining may improve adaptation, yet scale does not define clinical labels, obtain consent, prevent leakage, or establish safe use. Those responsibilities remain with the people designing and evaluating the system. One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average.

Read analysis

evergreen editorial · educational synthesis

What Surgical AI Can Learn from Aviation Safety

Deck: The useful lesson is systematic management of complex systems, not a claim that cockpits and operating rooms are interchangeable. Evidence status: Evergreen educational analysis based on established Surgical Data Science principles. Not clinical guidance. Estimated reading time: 9–12 minutes. Tags: safety, human-factors, systems. The issue beneath the headline Aviation safety emphasizes standardization, reporting, simulation, human factors, redundancy, and learning from near misses. Surgical AI can adopt those habits while respecting clinical variability. Surgical AI discussions often compress several different questions into one: whether a signal can be predicted, whether the prediction transfers, whether a user can interpret it, and whether acting on it improves anything that matters. Separating those questions is not a technicality. It prevents a benchmark result from being mistaken for evidence of clinical readiness. The operating room is a particularly demanding setting for this reasoning. Procedures unfold over time, visibility changes, anatomy varies, and team decisions depend on information that may never appear in the video. Data are also produced within institutions that have different equipment, governance, documentation, and case mixes. A model can be technically competent and still be inappropriate for a proposed use.[1,2] A concrete way to see the problem An AI alert should be evaluated as part of a team and interface, much like another instrument in a safety-critical system. Its failure modes include communication and workload, not only classification error. This example shows why the unit of analysis matters. Frames from one operation are highly related. Labels may be operational definitions rather than direct observations of a clinical state. A model may use a visually convenient proxy, and a strong average score may hide the specific conditions in which that proxy breaks. The first question should therefore be: what evidence is actually available in the input? The second is: how was the target defined? The third is: where was the relationship tested? Only after those questions should architecture and performance improvements dominate the discussion. Why this matters The distinction matters for both research efficiency and clinical responsibility. Teams waste time when they optimize a benchmark before confirming that its target maps to a useful question. They also create avoidable risk when they present a retrospective output as though it were validated decision support. Clinicians can improve projects by defining events, edge cases, consequences, and workflow constraints. Engineers can improve them by exposing assumptions, testing shift, measuring calibration, and designing systems that recognize unsupported inputs. Neither group can replace the other. Surgical AI is strongest when the clinical definition, data representation, model, interface, and evaluation are designed as one connected system. What to watch for Avoid shallow analogies that assume checklists or automation transfer unchanged. Look for concrete analysis of roles, escalation, workload, and fallback. Also examine what the study does not report. Common omissions include independent case counts, excluded recordings, annotation disagreement, center-specific performance, false alerts per procedure, prospective latency, and the exact role of the user. Missing detail does not prove poor work, but it limits what a reader can conclude. Be cautious with words such as understands, assists, prevents, autonomous, and clinically useful. Each implies evidence beyond simple label prediction. A careful article can still describe promising technical progress while stating that external validation, prospective testing, human-factors work, or regulatory review remains necessary. What better evidence would look like Better evidence starts with a precise intended use and a dataset that reflects the variation relevant to it. It separates patients and procedures across splits, documents provenance, reports difficult cases, and uses metrics connected to the consequences of error. External validation should cross a named boundary such as hospital, device, surgeon group, or time period. For systems intended to operate live, silent prospective evaluation can test data flow, latency, missing inputs, and domain shift before users rely on outputs. Human-factors studies can then examine whether information is understood, whether it changes behavior, and whether it creates alert burden or automation bias. Clinical benefit requires an appropriately designed comparative study; it cannot be inferred from retrospective accuracy alone. Practical question for readers Which system-level failure would remain even if the model prediction were technically correct? Try answering the question without using the words AI or accuracy. Describe the user, the moment, the input, the output, the possible response, and the consequence of an error. If those elements remain vague, the project probably needs a clearer intended use before it needs a larger model. Closing perspective The goal is not to demand a clinical trial for every exploratory model. Early research needs room to test ideas. The goal is to label evidence honestly so that feasibility, validation, integration, and impact are not confused. A prototype can be valuable because it reveals a measurable signal or a better experimental method. It becomes more credible, not less, when its limits are explicit. Surgical AI will advance through accumulation of well-defined questions, representative data, reproducible methods, external tests, and careful integration. Hype shortens that chain in language. Good editorial work restores the missing steps. References 1. Surgical Data Science — from Concepts toward Clinical Translation. https://doi.org/10.1016/j.media.2021.102306 2. Surgical Data Science: A Consensus Perspective. https://arxiv.org/abs/1806.03184 3. DECIDE-AI: Reporting Guideline for Early-Stage Clinical Evaluation of AI. https://doi.org/10.1136/bmj-2022-070904 4. CONSORT-AI Extension. https://doi.org/10.1038/s41591-020-1034-x 5. SPIRIT-AI Extension. https://doi.org/10.1136/bmj.m3210 6. FUTURE-AI: International Consensus Guideline for Trustworthy Healthcare AI. https://www.bmj.com/content/388/bmj-2024-081554 One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average. The same caution applies to newer foundation and multimodal models. Their broad pretraining may improve adaptation, yet scale does not define clinical labels, obtain consent, prevent leakage, or establish safe use. Those responsibilities remain with the people designing and evaluating the system. One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average.

Read analysis

evergreen editorial · educational synthesis

Critical View of Safety as a Case Study for Surgical AI

Deck: A clinically meaningful concept becomes difficult to model when its criteria depend on anatomy, dissection, timing, and expert judgment. Evidence status: Evergreen educational analysis based on established Surgical Data Science principles. Not clinical guidance. Estimated reading time: 9–12 minutes. Tags: critical-view-of-safety, clinical-labels, validation. The issue beneath the headline The critical view of safety is attractive because it has explicit criteria and safety relevance. Translating it into labels still requires decisions about frames, temporal evidence, visibility, and acceptable uncertainty. Surgical AI discussions often compress several different questions into one: whether a signal can be predicted, whether the prediction transfers, whether a user can interpret it, and whether acting on it improves anything that matters. Separating those questions is not a technicality. It prevents a benchmark result from being mistaken for evidence of clinical readiness. The operating room is a particularly demanding setting for this reasoning. Procedures unfold over time, visibility changes, anatomy varies, and team decisions depend on information that may never appear in the video. Data are also produced within institutions that have different equipment, governance, documentation, and case mixes. A model can be technically competent and still be inappropriate for a proposed use.[1,2] A concrete way to see the problem A single frame may appear to show two structures entering the gallbladder while earlier or alternative views reveal ambiguity. Criteria-level review can be more informative than a single binary score. This example shows why the unit of analysis matters. Frames from one operation are highly related. Labels may be operational definitions rather than direct observations of a clinical state. A model may use a visually convenient proxy, and a strong average score may hide the specific conditions in which that proxy breaks. The first question should therefore be: what evidence is actually available in the input? The second is: how was the target defined? The third is: where was the relationship tested? Only after those questions should architecture and performance improvements dominate the discussion. Why this matters The distinction matters for both research efficiency and clinical responsibility. Teams waste time when they optimize a benchmark before confirming that its target maps to a useful question. They also create avoidable risk when they present a retrospective output as though it were validated decision support. Clinicians can improve projects by defining events, edge cases, consequences, and workflow constraints. Engineers can improve them by exposing assumptions, testing shift, measuring calibration, and designing systems that recognize unsupported inputs. Neither group can replace the other. Surgical AI is strongest when the clinical definition, data representation, model, interface, and evaluation are designed as one connected system. What to watch for Watch for datasets that overrepresent ideal views, labels without adjudication, and claims that a retrospective classifier is equivalent to intraoperative decision support. Also examine what the study does not report. Common omissions include independent case counts, excluded recordings, annotation disagreement, center-specific performance, false alerts per procedure, prospective latency, and the exact role of the user. Missing detail does not prove poor work, but it limits what a reader can conclude. Be cautious with words such as understands, assists, prevents, autonomous, and clinically useful. Each implies evidence beyond simple label prediction. A careful article can still describe promising technical progress while stating that external validation, prospective testing, human-factors work, or regulatory review remains necessary. What better evidence would look like Better evidence starts with a precise intended use and a dataset that reflects the variation relevant to it. It separates patients and procedures across splits, documents provenance, reports difficult cases, and uses metrics connected to the consequences of error. External validation should cross a named boundary such as hospital, device, surgeon group, or time period. For systems intended to operate live, silent prospective evaluation can test data flow, latency, missing inputs, and domain shift before users rely on outputs. Human-factors studies can then examine whether information is understood, whether it changes behavior, and whether it creates alert burden or automation bias. Clinical benefit requires an appropriately designed comparative study; it cannot be inferred from retrospective accuracy alone. Practical question for readers What evidence beyond a high-quality frame would a surgeon use before accepting a critical-view assessment? Try answering the question without using the words AI or accuracy. Describe the user, the moment, the input, the output, the possible response, and the consequence of an error. If those elements remain vague, the project probably needs a clearer intended use before it needs a larger model. Closing perspective The goal is not to demand a clinical trial for every exploratory model. Early research needs room to test ideas. The goal is to label evidence honestly so that feasibility, validation, integration, and impact are not confused. A prototype can be valuable because it reveals a measurable signal or a better experimental method. It becomes more credible, not less, when its limits are explicit. Surgical AI will advance through accumulation of well-defined questions, representative data, reproducible methods, external tests, and careful integration. Hype shortens that chain in language. Good editorial work restores the missing steps. References 1. Surgical Data Science — from Concepts toward Clinical Translation. https://doi.org/10.1016/j.media.2021.102306 2. Surgical Data Science: A Consensus Perspective. https://arxiv.org/abs/1806.03184 3. On Calibration of Modern Neural Networks. https://proceedings.mlr.press/v70/guo17a.html 4. Variable Generalization Performance of a Deep Learning Model for Chest Radiographs. https://doi.org/10.1371/journal.pmed.1002683 5. TRIPOD+AI Statement. https://www.bmj.com/content/385/bmj-2023-078378 6. DECIDE-AI: Reporting Guideline for Early-Stage Clinical Evaluation of AI. https://doi.org/10.1136/bmj-2022-070904 One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average. The same caution applies to newer foundation and multimodal models. Their broad pretraining may improve adaptation, yet scale does not define clinical labels, obtain consent, prevent leakage, or establish safe use. Those responsibilities remain with the people designing and evaluating the system. One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average.

Read analysis

evergreen editorial · educational synthesis

Anatomy Recognition in Surgery: Why It Is Hard

Deck: Anatomy is deformable, partially visible, altered by dissection, and often defined by relationships rather than fixed appearance. Evidence status: Evergreen educational analysis based on established Surgical Data Science principles. Not clinical guidance. Estimated reading time: 9–12 minutes. Tags: anatomy, segmentation, uncertainty. The issue beneath the headline Unlike many natural objects, tissue changes shape, color, position, and visibility throughout a procedure. Boundaries may be uncertain even for experienced observers. Surgical AI discussions often compress several different questions into one: whether a signal can be predicted, whether the prediction transfers, whether a user can interpret it, and whether acting on it improves anything that matters. Separating those questions is not a technicality. It prevents a benchmark result from being mistaken for evidence of clinical readiness. The operating room is a particularly demanding setting for this reasoning. Procedures unfold over time, visibility changes, anatomy varies, and team decisions depend on information that may never appear in the video. Data are also produced within institutions that have different equipment, governance, documentation, and case mixes. A model can be technically competent and still be inappropriate for a proposed use.[1,2] A concrete way to see the problem The cystic duct can be obscured by fat, clips, smoke, blood, or instruments. A segmentation model may produce a smooth plausible contour despite inadequate visual evidence. This example shows why the unit of analysis matters. Frames from one operation are highly related. Labels may be operational definitions rather than direct observations of a clinical state. A model may use a visually convenient proxy, and a strong average score may hide the specific conditions in which that proxy breaks. The first question should therefore be: what evidence is actually available in the input? The second is: how was the target defined? The third is: where was the relationship tested? Only after those questions should architecture and performance improvements dominate the discussion. Why this matters The distinction matters for both research efficiency and clinical responsibility. Teams waste time when they optimize a benchmark before confirming that its target maps to a useful question. They also create avoidable risk when they present a retrospective output as though it were validated decision support. Clinicians can improve projects by defining events, edge cases, consequences, and workflow constraints. Engineers can improve them by exposing assumptions, testing shift, measuring calibration, and designing systems that recognize unsupported inputs. Neither group can replace the other. Surgical AI is strongest when the clinical definition, data representation, model, interface, and evaluation are designed as one connected system. What to watch for Look for expert annotation protocols, inter-annotator variability, quality-condition analysis, abstention behavior, and external testing across anatomy and pathology. Also examine what the study does not report. Common omissions include independent case counts, excluded recordings, annotation disagreement, center-specific performance, false alerts per procedure, prospective latency, and the exact role of the user. Missing detail does not prove poor work, but it limits what a reader can conclude. Be cautious with words such as understands, assists, prevents, autonomous, and clinically useful. Each implies evidence beyond simple label prediction. A careful article can still describe promising technical progress while stating that external validation, prospective testing, human-factors work, or regulatory review remains necessary. What better evidence would look like Better evidence starts with a precise intended use and a dataset that reflects the variation relevant to it. It separates patients and procedures across splits, documents provenance, reports difficult cases, and uses metrics connected to the consequences of error. External validation should cross a named boundary such as hospital, device, surgeon group, or time period. For systems intended to operate live, silent prospective evaluation can test data flow, latency, missing inputs, and domain shift before users rely on outputs. Human-factors studies can then examine whether information is understood, whether it changes behavior, and whether it creates alert burden or automation bias. Clinical benefit requires an appropriately designed comparative study; it cannot be inferred from retrospective accuracy alone. Practical question for readers How should a system behave when the image does not support a reliable anatomical boundary? Try answering the question without using the words AI or accuracy. Describe the user, the moment, the input, the output, the possible response, and the consequence of an error. If those elements remain vague, the project probably needs a clearer intended use before it needs a larger model. Closing perspective The goal is not to demand a clinical trial for every exploratory model. Early research needs room to test ideas. The goal is to label evidence honestly so that feasibility, validation, integration, and impact are not confused. A prototype can be valuable because it reveals a measurable signal or a better experimental method. It becomes more credible, not less, when its limits are explicit. Surgical AI will advance through accumulation of well-defined questions, representative data, reproducible methods, external tests, and careful integration. Hype shortens that chain in language. Good editorial work restores the missing steps. References 1. Surgical Data Science — from Concepts toward Clinical Translation. https://doi.org/10.1016/j.media.2021.102306 2. Surgical Data Science: A Consensus Perspective. https://arxiv.org/abs/1806.03184 3. EndoNet: A Deep Architecture for Recognition Tasks on Laparoscopic Videos. https://doi.org/10.1109/TMI.2016.2593957 4. TeCNO for Online Recognition of Surgical Phases. https://arxiv.org/abs/2003.10751 5. Endoscapes: A Critical View of Safety Dataset for Laparoscopic Cholecystectomy. https://arxiv.org/abs/2312.12429 6. Segment Anything. https://arxiv.org/abs/2304.02643 One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average. The same caution applies to newer foundation and multimodal models. Their broad pretraining may improve adaptation, yet scale does not define clinical labels, obtain consent, prevent leakage, or establish safe use. Those responsibilities remain with the people designing and evaluating the system. One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average.

Read analysis

evergreen editorial · educational synthesis

Why Tool Detection Is Useful but Clinically Incomplete

Deck: Knowing which instruments are visible can support workflow analysis, yet instrument presence is only a proxy for activity and intent. Evidence status: Evergreen educational analysis based on established Surgical Data Science principles. Not clinical guidance. Estimated reading time: 9–12 minutes. Tags: tool-detection, proxies, workflow. The issue beneath the headline Tools are visually distinctive and comparatively straightforward to label. Their presence can correlate with phases, actions, resource use, and technique. Surgical AI discussions often compress several different questions into one: whether a signal can be predicted, whether the prediction transfers, whether a user can interpret it, and whether acting on it improves anything that matters. Separating those questions is not a technicality. It prevents a benchmark result from being mistaken for evidence of clinical readiness. The operating room is a particularly demanding setting for this reasoning. Procedures unfold over time, visibility changes, anatomy varies, and team decisions depend on information that may never appear in the video. Data are also produced within institutions that have different equipment, governance, documentation, and case mixes. A model can be technically competent and still be inappropriate for a proposed use.[1,2] A concrete way to see the problem A clip showing a hook does not establish what tissue it contacts or why it is being used. Occlusion can also hide an active tool while its effect remains clinically important. This example shows why the unit of analysis matters. Frames from one operation are highly related. Labels may be operational definitions rather than direct observations of a clinical state. A model may use a visually convenient proxy, and a strong average score may hide the specific conditions in which that proxy breaks. The first question should therefore be: what evidence is actually available in the input? The second is: how was the target defined? The third is: where was the relationship tested? Only after those questions should architecture and performance improvements dominate the discussion. Why this matters The distinction matters for both research efficiency and clinical responsibility. Teams waste time when they optimize a benchmark before confirming that its target maps to a useful question. They also create avoidable risk when they present a retrospective output as though it were validated decision support. Clinicians can improve projects by defining events, edge cases, consequences, and workflow constraints. Engineers can improve them by exposing assumptions, testing shift, measuring calibration, and designing systems that recognize unsupported inputs. Neither group can replace the other. Surgical AI is strongest when the clinical definition, data representation, model, interface, and evaluation are designed as one connected system. What to watch for Watch for conclusions that move from detection accuracy to claims about performance, safety, or decision-making without validating those intermediate assumptions. Also examine what the study does not report. Common omissions include independent case counts, excluded recordings, annotation disagreement, center-specific performance, false alerts per procedure, prospective latency, and the exact role of the user. Missing detail does not prove poor work, but it limits what a reader can conclude. Be cautious with words such as understands, assists, prevents, autonomous, and clinically useful. Each implies evidence beyond simple label prediction. A careful article can still describe promising technical progress while stating that external validation, prospective testing, human-factors work, or regulatory review remains necessary. What better evidence would look like Better evidence starts with a precise intended use and a dataset that reflects the variation relevant to it. It separates patients and procedures across splits, documents provenance, reports difficult cases, and uses metrics connected to the consequences of error. External validation should cross a named boundary such as hospital, device, surgeon group, or time period. For systems intended to operate live, silent prospective evaluation can test data flow, latency, missing inputs, and domain shift before users rely on outputs. Human-factors studies can then examine whether information is understood, whether it changes behavior, and whether it creates alert burden or automation bias. Clinical benefit requires an appropriately designed comparative study; it cannot be inferred from retrospective accuracy alone. Practical question for readers Which clinically important fact remains unknown after a tool has been detected correctly? Try answering the question without using the words AI or accuracy. Describe the user, the moment, the input, the output, the possible response, and the consequence of an error. If those elements remain vague, the project probably needs a clearer intended use before it needs a larger model. Closing perspective The goal is not to demand a clinical trial for every exploratory model. Early research needs room to test ideas. The goal is to label evidence honestly so that feasibility, validation, integration, and impact are not confused. A prototype can be valuable because it reveals a measurable signal or a better experimental method. It becomes more credible, not less, when its limits are explicit. Surgical AI will advance through accumulation of well-defined questions, representative data, reproducible methods, external tests, and careful integration. Hype shortens that chain in language. Good editorial work restores the missing steps. References 1. Surgical Data Science — from Concepts toward Clinical Translation. https://doi.org/10.1016/j.media.2021.102306 2. Surgical Data Science: A Consensus Perspective. https://arxiv.org/abs/1806.03184 3. EndoNet: A Deep Architecture for Recognition Tasks on Laparoscopic Videos. https://doi.org/10.1109/TMI.2016.2593957 4. TeCNO for Online Recognition of Surgical Phases. https://arxiv.org/abs/2003.10751 5. Endoscapes: A Critical View of Safety Dataset for Laparoscopic Cholecystectomy. https://arxiv.org/abs/2312.12429 6. DECIDE-AI: Reporting Guideline for Early-Stage Clinical Evaluation of AI. https://doi.org/10.1136/bmj-2022-070904 One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average. The same caution applies to newer foundation and multimodal models. Their broad pretraining may improve adaptation, yet scale does not define clinical labels, obtain consent, prevent leakage, or establish safe use. Those responsibilities remain with the people designing and evaluating the system. One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average.

Read analysis

evergreen editorial · educational synthesis

Why Surgical Phase Recognition Became a Benchmark Task

Deck: Phases offer a clinically legible way to organize long procedures, but their benchmark popularity should not be confused with direct clinical benefit. Evidence status: Evergreen educational analysis based on established Surgical Data Science principles. Not clinical guidance. Estimated reading time: 9–12 minutes. Tags: workflow, phase-recognition, benchmarks. The issue beneath the headline Phase labels compress hours of video into a manageable temporal structure. They are easier to annotate than many fine-grained actions and support comparison across sequence models. Surgical AI discussions often compress several different questions into one: whether a signal can be predicted, whether the prediction transfers, whether a user can interpret it, and whether acting on it improves anything that matters. Separating those questions is not a technicality. It prevents a benchmark result from being mistaken for evidence of clinical readiness. The operating room is a particularly demanding setting for this reasoning. Procedures unfold over time, visibility changes, anatomy varies, and team decisions depend on information that may never appear in the video. Data are also produced within institutions that have different equipment, governance, documentation, and case mixes. A model can be technically competent and still be inappropriate for a proposed use.[1,2] A concrete way to see the problem Cholecystectomy phases can help index video or study workflow. A correct phase label does not by itself indicate whether a maneuver was safe or technically appropriate. This example shows why the unit of analysis matters. Frames from one operation are highly related. Labels may be operational definitions rather than direct observations of a clinical state. A model may use a visually convenient proxy, and a strong average score may hide the specific conditions in which that proxy breaks. The first question should therefore be: what evidence is actually available in the input? The second is: how was the target defined? The third is: where was the relationship tested? Only after those questions should architecture and performance improvements dominate the discussion. Why this matters The distinction matters for both research efficiency and clinical responsibility. Teams waste time when they optimize a benchmark before confirming that its target maps to a useful question. They also create avoidable risk when they present a retrospective output as though it were validated decision support. Clinicians can improve projects by defining events, edge cases, consequences, and workflow constraints. Engineers can improve them by exposing assumptions, testing shift, measuring calibration, and designing systems that recognize unsupported inputs. Neither group can replace the other. Surgical AI is strongest when the clinical definition, data representation, model, interface, and evaluation are designed as one connected system. What to watch for Examine phase definitions, transition tolerance, online versus offline inference, class imbalance, and whether repeated or atypical phases are represented. Also examine what the study does not report. Common omissions include independent case counts, excluded recordings, annotation disagreement, center-specific performance, false alerts per procedure, prospective latency, and the exact role of the user. Missing detail does not prove poor work, but it limits what a reader can conclude. Be cautious with words such as understands, assists, prevents, autonomous, and clinically useful. Each implies evidence beyond simple label prediction. A careful article can still describe promising technical progress while stating that external validation, prospective testing, human-factors work, or regulatory review remains necessary. What better evidence would look like Better evidence starts with a precise intended use and a dataset that reflects the variation relevant to it. It separates patients and procedures across splits, documents provenance, reports difficult cases, and uses metrics connected to the consequences of error. External validation should cross a named boundary such as hospital, device, surgeon group, or time period. For systems intended to operate live, silent prospective evaluation can test data flow, latency, missing inputs, and domain shift before users rely on outputs. Human-factors studies can then examine whether information is understood, whether it changes behavior, and whether it creates alert burden or automation bias. Clinical benefit requires an appropriately designed comparative study; it cannot be inferred from retrospective accuracy alone. Practical question for readers What real user decision would improve if the current phase were known reliably? Try answering the question without using the words AI or accuracy. Describe the user, the moment, the input, the output, the possible response, and the consequence of an error. If those elements remain vague, the project probably needs a clearer intended use before it needs a larger model. Closing perspective The goal is not to demand a clinical trial for every exploratory model. Early research needs room to test ideas. The goal is to label evidence honestly so that feasibility, validation, integration, and impact are not confused. A prototype can be valuable because it reveals a measurable signal or a better experimental method. It becomes more credible, not less, when its limits are explicit. Surgical AI will advance through accumulation of well-defined questions, representative data, reproducible methods, external tests, and careful integration. Hype shortens that chain in language. Good editorial work restores the missing steps. References 1. Surgical Data Science — from Concepts toward Clinical Translation. https://doi.org/10.1016/j.media.2021.102306 2. Surgical Data Science: A Consensus Perspective. https://arxiv.org/abs/1806.03184 3. EndoNet: A Deep Architecture for Recognition Tasks on Laparoscopic Videos. https://doi.org/10.1109/TMI.2016.2593957 4. TeCNO for Online Recognition of Surgical Phases. https://arxiv.org/abs/2003.10751 5. Endoscapes: A Critical View of Safety Dataset for Laparoscopic Cholecystectomy. https://arxiv.org/abs/2312.12429 6. DECIDE-AI: Reporting Guideline for Early-Stage Clinical Evaluation of AI. https://doi.org/10.1136/bmj-2022-070904 One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average. The same caution applies to newer foundation and multimodal models. Their broad pretraining may improve adaptation, yet scale does not define clinical labels, obtain consent, prevent leakage, or establish safe use. Those responsibilities remain with the people designing and evaluating the system. One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average.

Read analysis

evergreen editorial · educational synthesis

What Domain Shift Means in the Operating Room

Deck: A model leaves its training distribution whenever the hospital, equipment, team, case mix, or recording process changes. Evidence status: Evergreen educational analysis based on established Surgical Data Science principles. Not clinical guidance. Estimated reading time: 9–12 minutes. Tags: domain-shift, generalization, external-validation. The issue beneath the headline Domain shift is not one event. It includes changes in pixels, workflows, populations, labels, prevalence, and intended use. Different shifts require different tests. Surgical AI discussions often compress several different questions into one: whether a signal can be predicted, whether the prediction transfers, whether a user can interpret it, and whether acting on it improves anything that matters. Separating those questions is not a technicality. It prevents a benchmark result from being mistaken for evidence of clinical readiness. The operating room is a particularly demanding setting for this reasoning. Procedures unfold over time, visibility changes, anatomy varies, and team decisions depend on information that may never appear in the video. Data are also produced within institutions that have different equipment, governance, documentation, and case mixes. A model can be technically competent and still be inappropriate for a proposed use.[1,2] A concrete way to see the problem A phase model trained on one video processor may use color balance or overlays as shortcuts. A camera upgrade can then lower performance without any change in surgery. This example shows why the unit of analysis matters. Frames from one operation are highly related. Labels may be operational definitions rather than direct observations of a clinical state. A model may use a visually convenient proxy, and a strong average score may hide the specific conditions in which that proxy breaks. The first question should therefore be: what evidence is actually available in the input? The second is: how was the target defined? The third is: where was the relationship tested? Only after those questions should architecture and performance improvements dominate the discussion. Why this matters The distinction matters for both research efficiency and clinical responsibility. Teams waste time when they optimize a benchmark before confirming that its target maps to a useful question. They also create avoidable risk when they present a retrospective output as though it were validated decision support. Clinicians can improve projects by defining events, edge cases, consequences, and workflow constraints. Engineers can improve them by exposing assumptions, testing shift, measuring calibration, and designing systems that recognize unsupported inputs. Neither group can replace the other. Surgical AI is strongest when the clinical definition, data representation, model, interface, and evaluation are designed as one connected system. What to watch for Watch for external validation that changes only one minor factor, and for papers that call a random held-out split evidence of generalization. Also examine what the study does not report. Common omissions include independent case counts, excluded recordings, annotation disagreement, center-specific performance, false alerts per procedure, prospective latency, and the exact role of the user. Missing detail does not prove poor work, but it limits what a reader can conclude. Be cautious with words such as understands, assists, prevents, autonomous, and clinically useful. Each implies evidence beyond simple label prediction. A careful article can still describe promising technical progress while stating that external validation, prospective testing, human-factors work, or regulatory review remains necessary. What better evidence would look like Better evidence starts with a precise intended use and a dataset that reflects the variation relevant to it. It separates patients and procedures across splits, documents provenance, reports difficult cases, and uses metrics connected to the consequences of error. External validation should cross a named boundary such as hospital, device, surgeon group, or time period. For systems intended to operate live, silent prospective evaluation can test data flow, latency, missing inputs, and domain shift before users rely on outputs. Human-factors studies can then examine whether information is understood, whether it changes behavior, and whether it creates alert burden or automation bias. Clinical benefit requires an appropriately designed comparative study; it cannot be inferred from retrospective accuracy alone. Practical question for readers Which local change would be most likely to invalidate the assumptions behind a model you are considering? Try answering the question without using the words AI or accuracy. Describe the user, the moment, the input, the output, the possible response, and the consequence of an error. If those elements remain vague, the project probably needs a clearer intended use before it needs a larger model. Closing perspective The goal is not to demand a clinical trial for every exploratory model. Early research needs room to test ideas. The goal is to label evidence honestly so that feasibility, validation, integration, and impact are not confused. A prototype can be valuable because it reveals a measurable signal or a better experimental method. It becomes more credible, not less, when its limits are explicit. Surgical AI will advance through accumulation of well-defined questions, representative data, reproducible methods, external tests, and careful integration. Hype shortens that chain in language. Good editorial work restores the missing steps. References 1. Surgical Data Science — from Concepts toward Clinical Translation. https://doi.org/10.1016/j.media.2021.102306 2. Surgical Data Science: A Consensus Perspective. https://arxiv.org/abs/1806.03184 3. On Calibration of Modern Neural Networks. https://proceedings.mlr.press/v70/guo17a.html 4. Variable Generalization Performance of a Deep Learning Model for Chest Radiographs. https://doi.org/10.1371/journal.pmed.1002683 5. TRIPOD+AI Statement. https://www.bmj.com/content/385/bmj-2023-078378 One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average. The same caution applies to newer foundation and multimodal models. Their broad pretraining may improve adaptation, yet scale does not define clinical labels, obtain consent, prevent leakage, or establish safe use. Those responsibilities remain with the people designing and evaluating the system. One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average.

Read analysis

evergreen editorial · educational synthesis

Why ImageNet Pretraining Does Not Solve Surgery

Deck: General visual pretraining can help optimization, but it does not supply surgical concepts, temporal context, or clinical validation. Evidence status: Evergreen educational analysis based on established Surgical Data Science principles. Not clinical guidance. Estimated reading time: 9–12 minutes. Tags: pretraining, computer-vision, domain-shift. The issue beneath the headline ImageNet teaches models statistical features from ordinary photographs. Some edges, textures, and shapes transfer, yet surgical scenes differ in anatomy, lighting, motion, occlusion, and meaning. Surgical AI discussions often compress several different questions into one: whether a signal can be predicted, whether the prediction transfers, whether a user can interpret it, and whether acting on it improves anything that matters. Separating those questions is not a technicality. It prevents a benchmark result from being mistaken for evidence of clinical readiness. The operating room is a particularly demanding setting for this reasoning. Procedures unfold over time, visibility changes, anatomy varies, and team decisions depend on information that may never appear in the video. Data are also produced within institutions that have different equipment, governance, documentation, and case mixes. A model can be technically competent and still be inappropriate for a proposed use.[1,2] A concrete way to see the problem A pretrained detector may adapt quickly to instruments, but it still needs surgical labels and may fail on smoke, blood, unfamiliar devices, or subtle tissue boundaries. This example shows why the unit of analysis matters. Frames from one operation are highly related. Labels may be operational definitions rather than direct observations of a clinical state. A model may use a visually convenient proxy, and a strong average score may hide the specific conditions in which that proxy breaks. The first question should therefore be: what evidence is actually available in the input? The second is: how was the target defined? The third is: where was the relationship tested? Only after those questions should architecture and performance improvements dominate the discussion. Why this matters The distinction matters for both research efficiency and clinical responsibility. Teams waste time when they optimize a benchmark before confirming that its target maps to a useful question. They also create avoidable risk when they present a retrospective output as though it were validated decision support. Clinicians can improve projects by defining events, edge cases, consequences, and workflow constraints. Engineers can improve them by exposing assumptions, testing shift, measuring calibration, and designing systems that recognize unsupported inputs. Neither group can replace the other. Surgical AI is strongest when the clinical definition, data representation, model, interface, and evaluation are designed as one connected system. What to watch for Distinguish improved benchmark initialization from evidence of surgical understanding. Ask what domain-specific data were required and where the adapted model was externally tested. Also examine what the study does not report. Common omissions include independent case counts, excluded recordings, annotation disagreement, center-specific performance, false alerts per procedure, prospective latency, and the exact role of the user. Missing detail does not prove poor work, but it limits what a reader can conclude. Be cautious with words such as understands, assists, prevents, autonomous, and clinically useful. Each implies evidence beyond simple label prediction. A careful article can still describe promising technical progress while stating that external validation, prospective testing, human-factors work, or regulatory review remains necessary. What better evidence would look like Better evidence starts with a precise intended use and a dataset that reflects the variation relevant to it. It separates patients and procedures across splits, documents provenance, reports difficult cases, and uses metrics connected to the consequences of error. External validation should cross a named boundary such as hospital, device, surgeon group, or time period. For systems intended to operate live, silent prospective evaluation can test data flow, latency, missing inputs, and domain shift before users rely on outputs. Human-factors studies can then examine whether information is understood, whether it changes behavior, and whether it creates alert burden or automation bias. Clinical benefit requires an appropriately designed comparative study; it cannot be inferred from retrospective accuracy alone. Practical question for readers Which part of the task genuinely benefits from generic visual features, and which part still depends on surgical context? Try answering the question without using the words AI or accuracy. Describe the user, the moment, the input, the output, the possible response, and the consequence of an error. If those elements remain vague, the project probably needs a clearer intended use before it needs a larger model. Closing perspective The goal is not to demand a clinical trial for every exploratory model. Early research needs room to test ideas. The goal is to label evidence honestly so that feasibility, validation, integration, and impact are not confused. A prototype can be valuable because it reveals a measurable signal or a better experimental method. It becomes more credible, not less, when its limits are explicit. Surgical AI will advance through accumulation of well-defined questions, representative data, reproducible methods, external tests, and careful integration. Hype shortens that chain in language. Good editorial work restores the missing steps. References 1. Surgical Data Science — from Concepts toward Clinical Translation. https://doi.org/10.1016/j.media.2021.102306 2. Surgical Data Science: A Consensus Perspective. https://arxiv.org/abs/1806.03184 3. On Calibration of Modern Neural Networks. https://proceedings.mlr.press/v70/guo17a.html 4. Variable Generalization Performance of a Deep Learning Model for Chest Radiographs. https://doi.org/10.1371/journal.pmed.1002683 5. TRIPOD+AI Statement. https://www.bmj.com/content/385/bmj-2023-078378 6. Segment Anything. https://arxiv.org/abs/2304.02643 One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average. The same caution applies to newer foundation and multimodal models. Their broad pretraining may improve adaptation, yet scale does not define clinical labels, obtain consent, prevent leakage, or establish safe use. Those responsibilities remain with the people designing and evaluating the system. One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average.

Read analysis

evergreen editorial · educational synthesis

Why Public Surgical Datasets Are Smaller Than They Look

Deck: Frame counts can create the impression of scale even when the number of independent patients, surgeons, or centers is modest. Evidence status: Evergreen educational analysis based on established Surgical Data Science principles. Not clinical guidance. Estimated reading time: 9–12 minutes. Tags: datasets, sampling, independence. The issue beneath the headline Video datasets contain highly correlated frames. A million images sampled from a few procedures do not provide the same diversity as a million independent clinical observations. Surgical AI discussions often compress several different questions into one: whether a signal can be predicted, whether the prediction transfers, whether a user can interpret it, and whether acting on it improves anything that matters. Separating those questions is not a technicality. It prevents a benchmark result from being mistaken for evidence of clinical readiness. The operating room is a particularly demanding setting for this reasoning. Procedures unfold over time, visibility changes, anatomy varies, and team decisions depend on information that may never appear in the video. Data are also produced within institutions that have different equipment, governance, documentation, and case mixes. A model can be technically competent and still be inappropriate for a proposed use.[1,2] A concrete way to see the problem A model can see thousands of nearly identical frames from one cholecystectomy. If frames cross train and test partitions, evaluation can reward recognition of the case rather than generalizable surgical features. This example shows why the unit of analysis matters. Frames from one operation are highly related. Labels may be operational definitions rather than direct observations of a clinical state. A model may use a visually convenient proxy, and a strong average score may hide the specific conditions in which that proxy breaks. The first question should therefore be: what evidence is actually available in the input? The second is: how was the target defined? The third is: where was the relationship tested? Only after those questions should architecture and performance improvements dominate the discussion. Why this matters The distinction matters for both research efficiency and clinical responsibility. Teams waste time when they optimize a benchmark before confirming that its target maps to a useful question. They also create avoidable risk when they present a retrospective output as though it were validated decision support. Clinicians can improve projects by defining events, edge cases, consequences, and workflow constraints. Engineers can improve them by exposing assumptions, testing shift, measuring calibration, and designing systems that recognize unsupported inputs. Neither group can replace the other. Surgical AI is strongest when the clinical definition, data representation, model, interface, and evaluation are designed as one connected system. What to watch for Look for patient-level splits, numbers of procedures and centers, case completeness, surgeon diversity, and whether extracted clips preserve the intended clinical distribution. Also examine what the study does not report. Common omissions include independent case counts, excluded recordings, annotation disagreement, center-specific performance, false alerts per procedure, prospective latency, and the exact role of the user. Missing detail does not prove poor work, but it limits what a reader can conclude. Be cautious with words such as understands, assists, prevents, autonomous, and clinically useful. Each implies evidence beyond simple label prediction. A careful article can still describe promising technical progress while stating that external validation, prospective testing, human-factors work, or regulatory review remains necessary. What better evidence would look like Better evidence starts with a precise intended use and a dataset that reflects the variation relevant to it. It separates patients and procedures across splits, documents provenance, reports difficult cases, and uses metrics connected to the consequences of error. External validation should cross a named boundary such as hospital, device, surgeon group, or time period. For systems intended to operate live, silent prospective evaluation can test data flow, latency, missing inputs, and domain shift before users rely on outputs. Human-factors studies can then examine whether information is understood, whether it changes behavior, and whether it creates alert burden or automation bias. Clinical benefit requires an appropriately designed comparative study; it cannot be inferred from retrospective accuracy alone. Practical question for readers When a paper reports millions of frames, how many independent opportunities for generalization were actually tested? Try answering the question without using the words AI or accuracy. Describe the user, the moment, the input, the output, the possible response, and the consequence of an error. If those elements remain vague, the project probably needs a clearer intended use before it needs a larger model. Closing perspective The goal is not to demand a clinical trial for every exploratory model. Early research needs room to test ideas. The goal is to label evidence honestly so that feasibility, validation, integration, and impact are not confused. A prototype can be valuable because it reveals a measurable signal or a better experimental method. It becomes more credible, not less, when its limits are explicit. Surgical AI will advance through accumulation of well-defined questions, representative data, reproducible methods, external tests, and careful integration. Hype shortens that chain in language. Good editorial work restores the missing steps. References 1. Surgical Data Science — from Concepts toward Clinical Translation. https://doi.org/10.1016/j.media.2021.102306 2. Surgical Data Science: A Consensus Perspective. https://arxiv.org/abs/1806.03184 3. STANDING Together Recommendations for Health Datasets. https://www.nature.com/articles/s41591-024-03147-2 4. WHO Ethics and Governance of Artificial Intelligence for Health. https://www.who.int/publications/i/item/9789240029200 One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average. The same caution applies to newer foundation and multimodal models. Their broad pretraining may improve adaptation, yet scale does not define clinical labels, obtain consent, prevent leakage, or establish safe use. Those responsibilities remain with the people designing and evaluating the system. One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average.

Read analysis

evergreen editorial · educational synthesis

Why Surgical AI Has a Data Problem

Deck: The central constraint is not simply a shortage of recordings. It is the shortage of well-governed, representative data connected to stable clinical definitions. Evidence status: Evergreen educational analysis based on established Surgical Data Science principles. Not clinical guidance. Estimated reading time: 9–12 minutes. Tags: datasets, governance, data-quality. The issue beneath the headline Surgical data are fragmented across institutions, devices, formats, and governance systems. The cases easiest to collect are rarely a complete representation of the cases where a model may eventually be used. Surgical AI discussions often compress several different questions into one: whether a signal can be predicted, whether the prediction transfers, whether a user can interpret it, and whether acting on it improves anything that matters. Separating those questions is not a technicality. It prevents a benchmark result from being mistaken for evidence of clinical readiness. The operating room is a particularly demanding setting for this reasoning. Procedures unfold over time, visibility changes, anatomy varies, and team decisions depend on information that may never appear in the video. Data are also produced within institutions that have different equipment, governance, documentation, and case mixes. A model can be technically competent and still be inappropriate for a proposed use.[1,2] A concrete way to see the problem A hospital may hold thousands of laparoscopic videos but lack reliable procedure identifiers, outcomes, device metadata, or permission for secondary research. Counting recordings therefore overstates the evidence available. This example shows why the unit of analysis matters. Frames from one operation are highly related. Labels may be operational definitions rather than direct observations of a clinical state. A model may use a visually convenient proxy, and a strong average score may hide the specific conditions in which that proxy breaks. The first question should therefore be: what evidence is actually available in the input? The second is: how was the target defined? The third is: where was the relationship tested? Only after those questions should architecture and performance improvements dominate the discussion. Why this matters The distinction matters for both research efficiency and clinical responsibility. Teams waste time when they optimize a benchmark before confirming that its target maps to a useful question. They also create avoidable risk when they present a retrospective output as though it were validated decision support. Clinicians can improve projects by defining events, edge cases, consequences, and workflow constraints. Engineers can improve them by exposing assumptions, testing shift, measuring calibration, and designing systems that recognize unsupported inputs. Neither group can replace the other. Surgical AI is strongest when the clinical definition, data representation, model, interface, and evaluation are designed as one connected system. What to watch for Watch for claims that equate hours of video with dataset quality, omit missing cases, or describe one-center data as broadly representative. Also examine what the study does not report. Common omissions include independent case counts, excluded recordings, annotation disagreement, center-specific performance, false alerts per procedure, prospective latency, and the exact role of the user. Missing detail does not prove poor work, but it limits what a reader can conclude. Be cautious with words such as understands, assists, prevents, autonomous, and clinically useful. Each implies evidence beyond simple label prediction. A careful article can still describe promising technical progress while stating that external validation, prospective testing, human-factors work, or regulatory review remains necessary. What better evidence would look like Better evidence starts with a precise intended use and a dataset that reflects the variation relevant to it. It separates patients and procedures across splits, documents provenance, reports difficult cases, and uses metrics connected to the consequences of error. External validation should cross a named boundary such as hospital, device, surgeon group, or time period. For systems intended to operate live, silent prospective evaluation can test data flow, latency, missing inputs, and domain shift before users rely on outputs. Human-factors studies can then examine whether information is understood, whether it changes behavior, and whether it creates alert burden or automation bias. Clinical benefit requires an appropriately designed comparative study; it cannot be inferred from retrospective accuracy alone. Practical question for readers Which missing context would make a large local video archive unsafe to treat as a training dataset? Try answering the question without using the words AI or accuracy. Describe the user, the moment, the input, the output, the possible response, and the consequence of an error. If those elements remain vague, the project probably needs a clearer intended use before it needs a larger model. Closing perspective The goal is not to demand a clinical trial for every exploratory model. Early research needs room to test ideas. The goal is to label evidence honestly so that feasibility, validation, integration, and impact are not confused. A prototype can be valuable because it reveals a measurable signal or a better experimental method. It becomes more credible, not less, when its limits are explicit. Surgical AI will advance through accumulation of well-defined questions, representative data, reproducible methods, external tests, and careful integration. Hype shortens that chain in language. Good editorial work restores the missing steps. References 1. Surgical Data Science — from Concepts toward Clinical Translation. https://doi.org/10.1016/j.media.2021.102306 2. Surgical Data Science: A Consensus Perspective. https://arxiv.org/abs/1806.03184 3. STANDING Together Recommendations for Health Datasets. https://www.nature.com/articles/s41591-024-03147-2 4. WHO Ethics and Governance of Artificial Intelligence for Health. https://www.who.int/publications/i/item/9789240029200 One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average. The same caution applies to newer foundation and multimodal models. Their broad pretraining may improve adaptation, yet scale does not define clinical labels, obtain consent, prevent leakage, or establish safe use. Those responsibilities remain with the people designing and evaluating the system. One additional test is to ask how the conclusion would change if the easiest ten percent of cases were removed. If performance depends heavily on routine, well-framed examples, the result may still be scientifically useful, but the intended claim should narrow. Error distributions usually teach more than a single average.

Read analysis